Proc Fonetik 2005

Proceedings
FONETIK 2005
The XVIIIth Swedish Phonetics Conference
May 2527 2005
Department of Linguistics
Gteborg University
Proceedings FONETIK 2005

The XVIIIth Swedish Phonetics Conference,
held at Gteborg University, May 2527, 2005
Edited by Anders Eriksson and Jonas Lindh
Gteborg University
Box 200, SE 405 30 Gteborg
ISBN 91-973895-9-5
The Authors and the Department of Linguistics
Cover photo and design: Anders Eriksson
Printed by Reprocentralen, Humanisten, Gteborg University
Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University
Preface
This volume contains the contributions to FONETIK 2005, the Eighteenth Swedish
Phonetics Conference, organized by the Phonetics group at Gteborg University on
May 2527, 2005. The papers appear in the order they were presented at the conference.
Only a limited number of copies of this publication has been printed for distribution among the authors and those attending the conference. For access to electronic
versions of the contributions, please look under:
http://www.ling.gu.se/konferenser/fonetik2005/
We would like to thank all contributors to the Proceedings. We are also indebted to
Fonetikstiftelsen for financial support.
Gteborg in May 2005

On behalf of the Phonetics group
Anders Eriksson
sa Abelin
iii
Jonas Lindh
Previous Swedish Phonetics Conferences (from 1986)

I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII
XIII
XIV
XV
XVI
XVII
1986
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
Uppsala University
Lund University
KTH Stockholm
Ume University (Lvnger)
Stockholm University
Chalmers and Gteborg University
Uppsala University
Lund University (Hr)
(XIIIth CPhS in Stockholm)
KTH Stockholm (Nsslingen)
Ume University
Gteborg University
Skvde University College
Lund University
KTH Stockholm
Ume University (Lvnger)
iv
Contents
Dialectal, regional and sociophonetic variation
Phonological quantity in Swedish dialects: A data-driven categorization
Felix Schaeffler
Phonological variation and geographical orientation among students in a West

Swedish small town school
Anna Gunnarsdotter Grnberg
On the phonetics of unstressed /e/ in Stockholm Swedish and FinlandSwedish

Yuni Kim
The interaction of word accent and quantity in Gothenburg Swedish

My Segerup
13
Speaker recognition and synthesis

Visual acoustic vs. aural perceptual speaker identification in a closed set of disguised
voices
Jonas Lindh
17
A model based experiment towards an emotional synthesis

Jonas Lindh
21
Annotating speech data for pronunciation variation modelling

Per-Anders Jande
25
Prosody duration, quantity and rhythm

Estonian rhythm and the Pairwise Variability Index
Eva Liina Asu and Francis Nolan
29
Duration of syllable-sized units in casual and elaborated Finnish: a comparison with

Swedish and Spanish
Diana Krull
33
Language contact, second language learning and foreign accent

The sound of 'Swedish on multilingual ground'
Petra Bodn
37
The communicative function of "s" in Italian and "ja" in Swedish: an acoustic

analysis
Loredana Cerrato
41
Presenting in English and Swedish

Rebecca Hincks
45

Phonetic aspects in translation studies
Dieter Huber
49
Scoring children's foreign language pronunciation

Linda Oppelstrup, Mats Blomberg and Daniel Elenius
51
Speech development and acquisition

On linguistic and interactive aspects of infant-adult communication in a pathological
perspective
Ulla Bjurster, Francisco Lacerda and Ulla Sundberg
55
Durational patterns produced by Swedish and American 18- and 24-month-olds:

Implications for the acquisition of the quantity contrast
Lina Bonsdroff and Olle Engstrand
59
/r/ realizations by Swedish two-year-olds: preliminary observations

Petra Eklund, Olle Engstrand, Kerstin Gustafsson, Ekaterina Ivachova and sa
Karlsson
63
Tonal word accents produced by Swedish 18- and 24-month-olds

Germund Kadin and Olle Engstrand
67
Development of adult-like place and manner of articulation in initial sC clusters

Fredrik Karlsson
71
Poster session
Phonological interferences in the third language learning of Swedish and German
(FIST)
Robert Bannert
75
Word accents over time: comparing present-day data with Meyers accent contours
Linna Fransson and Eva Strangert
79
Multi-sensory information as an improvement for communication systems efficiency

Francisco Lacerda, Eeva Klintfors and Lisa Gustavsson
83
Effects of stimulus duration and type on perception of female and male speaker age
Susanne Schtz
87
Effects of age of learning on VOT in voiceless stops produced by near-native L2

speakers
Katrin Stlten
91
Prosody F0, intonation and phrasing

Prosodic phrasing and focus productions in Greek
Antonis Botinis, Stella Ganetsou and Magda Griva
vi
95

Syntactic and tonal correlates of focus in Greek and Russian
Antonis Botinis, Yannis Kostopoulos, Olga Nikolaenkova and Charalabos
Themistocleous
99
Prosodic correlates of attitudinally-varied back channels in Japanese

Yasuko Nagano-Madsen and Takako Ayusawa
103
Speech perception
Prosodic features in the perception of clarification ellipses
Jens Edlund, David House, and Gabriel Skantze
107
Perceived prominence and scale types

Christian Jensen and John Tndering
111
The postvocalic consonant as a complementary cue to the perception of quantity in

Swedish a revisit
Bosse Thorn
115
Gender differences in the ability to discriminate emotional content from speech

Juhani Toivanen, Eero Vyrynen and Tapio Seppnen
119
Speech production
Vowel durations of normal and pathological speech
Antonis Botinis, Marios Fourakis and Ioanna Orfanidou
123
Acoustic evidence of the prevalence of the emphatic feature over the word in Arabic
Zeki Majeed Hassan
127
Closing discussion
Athens 2006 ISCA Workshop on Experimental Linguistics
Antonis Botinis, Christoforos Charalabakis, Marios Fourakis and Barbara
Gawronska
131
Additional paper submitted for the poster session

A positional analysis of quantity complementarity in Swedish with comparison to
Arabic
Zeki Majeed Hassan and Barry Heselwood
135
Author index
139
vii
viii
Phonological Quantity in Swedish dialects: A data-driven categorisation

Felix Schaeffler
Department of Philosophy and Linguistics, Ume University
Swedish dialects was performed, based on
measurements of sound duration from 86 places
in Sweden and Finland.
This aimed at a categorisation of the dialects that was independent of traditional descriptions and categories, thus providing the
possibility to discover new dialectal groups or
typological categories.
The study is an extension of Strangert & Wretling (2003), Schaeffler & Wretling (2003) and
Schaeffler (2005), motivated by an extended
data-set and methodological improvements.
Abstract
This study presents a data-driven categorisation (cluster analysis) of 86 Swedish dialects,
based on durational measurements of long and
short vowels and consonants. The study reveals
a clear geographic distribution that, for the
most part, corresponds with dialectological descriptions. For a minor group of dialects,
however, the results suggest mismatches
between the quantity system and the observed
segment durations. This phenomenon is discussed with reference to a theory of quantity
change (Labov 1994).
Material and Subjects

The material used for this study was part of the
SweDia corpus (www.swedia.nu). It comprised
hand-segmented data from 976 speakers from
86 recording locations (usually 12 speakers per
recording location). One word pair was investigated: tak vs. tack (english: roof thanks). For
historical reasons, these two words should have
a V:C vs. VC: structure, even in those dialects
where additional quantity patterns exist. In
most dialects, the words consist of a voiceless
dental or alveolar plosive, a low vowel and a
voiceless velar plosive.
Segmentation included the vocalic phase
and the closure phase of the velar plosive. If
present, preaspiration was marked as a separate
segment, but was treated as a part of the consonantal closure. Four variables were measured:
the durations of the long vowel, the short consonant, the short vowel and the long consonant.
To arrive at a measure of central tendency
for every recording location, the median of each
variable was calculated for every speaker. The
median of the speaker medians for every recording location then served as the value that
was used for the data-driven categorisation.
Medians instead of arithmetic means were calculated as, unlike the arithmetic mean, the median is not sensitive to outliers.
Introduction
Phonological quantity in Standard Swedish is
usually described as being complementary:
Long vowels in closed syllables are followed
by short consonants, while short vowels are always followed by a long consonant or a consonant cluster.
This modern system has developed from a
quantity system with independent vowel and
consonant quantity, where all four possible
combinations of long and short segments (VC,
V:C, VC: and V:C:) existed. The modern system evolved by shortening of V:C: and lengthening of VC structures. Not all dialects of Modern Swedish have completed this change. Some
dialects kept the four-way-distinctions. This applies to a group of dialects in the FinnishSwedish region and in Dalarna in Western
Sweden. Another group of dialects abandoned
V:C: successions but kept VC structures. This
has mainly been reported for large parts of
Northern Sweden, but also for some places in
Middle Sweden.
There are, thus, today at least three different
quantity systems in the dialects of Modern
Swedish: 4-way-systems (VC, V:C, VC: and
V:C:), 3-way-systems (VC, V:C and VC:) and
2-way-systems with complementarity quantity
(V:C and VC:).
Data-driven categorisation
The method of choice in this study was a hierarchical cluster-analysis, with euclidean dis-
Aims and Objectives

In this study, a data-driven categorisation of
1

tances as dissimilarity measures and the Ward
method as the linkage criterion (see e.g. Gordon, 1999).
Hierarchical clustering treats each object
initially as a single cluster. In the next steps, the
objects and resulting clusters are combined according to the dissimilarity measure and the
linkage criterion, until all objects are joined in a
single cluster. The increment of the linkage criterion is recorded during the process. It is usually displayed in a so-called dendrogram, that
visualises the increment of the linkage criterion
as the length of vertical branches in a tree structure. This information can be used for the selection of an adequate number of clusters.
In the present study, the recording locations
constituted the objects. Each recording location
was described by four variables: the median
durations of the four segments (the long and
short vowels and consonants).
It is often recommended in the literature to
verify a hierarchical cluster analysis with a nonhierarchical method (Bortz, 1999). This has
been done in the present study but did not lead
to strongly deviating results. The results of the
non-hierarchical method are therefore not reported here.
to the Northern parts of Sweden, while the

clusters (2), n=17, and (1), n=39, are restricted
to Southern Sweden, with the exception of an
area comprising Jmtland, ngermanland and
Medelpad. The clusters (1) and (2) show less
geographic separation, but there is a tendency
for cluster (2) to occur mainly in the Southwestern parts and coastal regions.
Results
The visual inspection of the cluster dendrogram
suggested a four cluster solution as an appropriate number of clusters for the current analysis.
Additionally, the parameter 2 was calculated,
which is usually used in analyses of variance to
describe the amount of explained variance, but
has also been suggested as a criterion for the
estimation of different cluster solutions (Timm,
2002). The four-cluster solution lead to an 2
value of 0.68. For comparison, the three-cluster
solution showed an 2 of 0.32, and a tencluster solution showed an 2 of 0.86. The large
increment of 2 from the three to the four
cluster solution followed by comparatively low
increments supported the four cluster solution.
Figure 1: Geographic distribution of the 4

clusters obtained by cluster analysis (see text).
Durational
characteristics
of
the
clusters
The figures 2 and 3 show the distributions of
the four segment durations across the four
clusters in the form of box-plots. Figure 2
shows the distribution of V: and V durations,
figure 3 of C: and C durations.
The Finnish cluster (4) is separated from the
other clusters by longer V: and V durations and
shorter C durations. C: durations in the Finnish
cluster (4) are close to the ones in the Northern
cluster (3), but clearly longer than those in the
Southern clusters (1) and (2). Consequently, the
Northern cluster (3), as well, is separated from
the Southern clusters (1) and (2) by longer C:
durations. However, the Northern cluster (3)
shows rather long C durations, which constitutes a clear difference to the short C durations
in the Finnish cluster (4).
Geographic distribution
Figure 1 shows a stylised map of Sweden and
the Swedish parts of Finland. The colour-coding and the numbers show the geographic distribution of the four clusters.
The clusters show a clear geographic distribution. Cluster (4), n=7, separates all dialects
on the Finnish mainland from the rest of the
dialects. Cluster (3), n=23, is mainly restricted
2

The vowel durations show no clear separation between the Northern cluster (3) and the
Southern clusters (1) and (2): The V and V: durations of cluster (3) lie in between the durations
of (1) and (2). Cluster (1) shows a tendency to
have longer V and V: durations than cluster (2).
The same relationship exists between the consonants in cluster (1) and (2). Thus, all segments tend to be longer in cluster (1) than in
cluster (2).
cluster (1) and (2), due to their longer C:. V:/C

ratios as well as V/C: ratios are very similar in
the Southern clusters (1) and (2).
Table 1: Median V:/C and V/C: ratios in the four
clusters.
V:/C
V/C:
Cl. (1)
1.15
0.53
Cl. (2)
1.13
0.57
Cl. (3)
1
0.41
Cl. (4)
1.83
0.44
Relations between long and short segments

The ratios between the durations of the long
and the short segments are shown as boxplots
in figure 4.
The figure shows that the consonant ratios
have a wider range than the vowel ratios, but
are generally lower. Some dialects show consonant ratios close to 1.0, hence long and
short consonants have approximately the
same duration, while the highest values are
around 2.0-2.2, showing long consonants that
are twice as long as short vowels. The median
of the distribution is 1.22, and 50% of the values lie between 1.14 and 1.32. The vowel ratios, on the other hand, are rarely lower than
1.4, the highest values are similar to the highest
values for the consonant ratios, around 2.0-2.2.
The median is 1.82, 50% of the values lie
between 1.7 and 1.9.
Figure 2: V and V: durations per cluster. Grey: short vowels, white: long
vowels.
Figure 3: C and C: durations per cluster. Grey: short consonants, white: long
consonants.
Relations between vowel and consonant duration

The different durational characteristics of the
segments result also in different V:/C and V/C:
ratios. Because of the very long V: and very
short C, the Finnish cluster (4) shows a deviating V:/C ratio (see table 1). The differences in
the V/C: ratios are less pronounced. However,
cluster (3) and (4) show lower V/C: ratios than
Figure 4: V:/V and C:/C ratios per cluster. Grey: vowel ratios, white: consonant ratios.
Cluster (4) is clearly separated from the rest

of the dialects by the consonant ratios. All dialects in cluster (4) show values higher than 1.6,
while all other dialects lie below this value. The
3

dialects in cluster (3) show somewhat higher
C:/C ratios than those in the clusters (1) and
(2), although there is a lot of overlap. The median for cluster (3) is 1.29, for cluster (2) it is
1.14 and for cluster (1) it is 1.19.
ence, that the loss of the V:C: structure in the

Finnish areas is a relatively recent phenomenon
(Ivars, 1988). If this is correct, then 4-way durational values in 3-way dialects could be attributed to the recent loss of a richer quantity system. While the phonological system is a 3-waysystem, the phonetic durations still reflect a 4way system.
Labov (1994) has argued that quantity
changes spread via lexical diffusion, i.e.
phonetically abrupt but lexically gradual. Thus,
when V:C: recently became VC: in certain
Finnish dialects, the change was presumably
phonetically abrupt. A speaker adopting the
change, replaces V: in V:C: by the corresponding V, and replaces thus a V:C: structure with a
VC: structure with the same durational characteristics as original VC: structures. maintaining
the durational characteristics of V and C:. The
change does not necessarily affect all words
with the matching environment at the same
time (lexically gradual), but might gradually
spread through the lexicon until all V:C: sequences have vanished from the dialect.
This scenario would provide an explanation
for the observed mismatches between phonological system and phonetic duration in some of
the Finnish dialects.
Discussion
The cluster analysis showed results similar to
the analysis presented in Schaeffler (2005).
There are three main groups with a clear geographic distribution: A Finnish cluster, which
includes all dialects on the Finnish mainland, a
Northern cluster, mainly concentrated in Northern Sweden from Lappland to Gstrikland, and
two Southern clusters.
The consideration of two Southern cluster
instead of one was motivated by a major increase in the value of 2. The durational characteristics, however, do not present much support
for such a partitioning. All segments in cluster
(1) are longer than in cluster (2), which leads to
very similar segment ratios (see table 1 and figure 4). This, together with the lack of a clear
geographic distribution, suggests that speech
rate effects might be responsible for the difference.
The geographic distribution corresponds
with the traditional descriptions of the Swedish
dialects as outlined in the introduction. 4-way
distinctions are mainly found in the Finnish region, 3-way distinctions frequently in the
Northern regions and 2-way distinctions in the
Southern Swedish areas (see e.g. Wessen 1969,
Riad 1992). In Schaeffler (2005), the observed
durational differences were attributed to these
functional differences. In 4-way distinctions,
where V:C and VC: sequences contrast with VC
and V:C:, clear durational distinctions of vowels and consonants are expected. This corresponds with the observed durations for the
Finnish region. All other dialects, however,
show rather low consonant ratios, while a clear
durational difference between the vowels is
maintained.
A further aspect of the results deserves attention: The geographic distribution resulting
from the cluster analysis is almost too clear-cut.
According to dialectological descriptions, some
dialects in the Finnish cluster (4) show 3-way
systems (Ivars 1988), comparable to those in
the Northern Swedish regions. In spite of these
functional congruences between parts of Northern Swedish and Finland-Swedish, the segment
durations differ. There is, however, strong evid-
References
Bortz J. (1999) Statistik fr Sozialwissenschaftler. 5th edition. Berlin: Springer.
Gordon A.D. (1999) Classification. 2nd edition.
Boca Raton: Chapman & Hall
Ivars A.-M. (1988) Nrpesdialekten p 1980talet. In Std. i nordisk fonologi, volume 70.
Labov W. (1994) Principles of Linguistic
Change., Vol. I. Cambridge: Blackwell P.
Riad, T. (1992) Structures in Germanic Prosody. Stockholm: Univ.
Schaeffler F. and Wretling P. (2003) Towards a
typology of phonological quantity in the
dialects of Modern Swedish. Proc. 15th
ICPhS Barcelona, p. 2697-2700.
Schaeffler, F. (2005, forthcoming) Drawing the
Swedish quantity map: from Bara to Vr.
Proc. Nordic Prosody IX. Lund.
Strangert & Wretling (2003) Complementary
quantity in Swedish dialects. Proc. Fonetik
2003. Ume.
Timm N.H. (2002) Applied Multivariate Analysis. New York: Springer.
Wessn, E. (1969) Vra folkml. C E Fritzes
Bokfrlag, Lund.
4
Phonological variation and geographical orientation

among students in a West Swedish small town school
Anna Gunnarsdotter Grnberg
Institute for Dialectology, Onomastics and Folklore Research, Gteborg, Sweden
1998 had a total of 1400 students in 14 different national study programmes. These students
come from quite a large area surrounding Alingss. The informants in the study represent
five municipalities, ten different study programmes, and there is an even gender distribution.
A sample of the results was also compared
with the GSM corpus2, consisting of recordings
of group conversations with 105 upper secondary school students in Gteborg 19971998.
Even in this corpus, the informants are distributed evenly with respect to gender, study programmes and geographical areas in Gteborg
and surroundings.
Abstract
This paper presents main results from a Ph.D.
thesis on sociolinguistic variation among students in an upper secondary school in Alingss,
a town of 25,000, northeast of Gteborg.
Phonological variants are found to be associated with traditional local dialect, regional and
supraregional standard, Gteborg vernacular,
general and Gteborg youth language.
Correlations with demogeographical areas
generally show a pattern going from southwest
to northeast (along the E20 highway and the
railway from Gteborg). One area does not fit
into the continuum, Sollebrunn (NW of Alingss), where particularly female informants tend
to use standard and innovations to a surprisingly high extent. Gender is the second most
important social factor, but in different ways.
There are major differences from one social
group to another when it comes to expressing
gendered identity through linguistic means.
Method
The informants were categorized according to
social variables representing different aspects
of background and identity: Gender, type of
study programmes (vocational, intermediate,
preparatory for university), demogeographical
areas (based on the extent of urbanization in the
five municipalities), Alingss neighbourhoods
(divided on the basis of socio-economic factors), and lifestyle based on two-dimensional
mapping (concerning taste, leisure, mobility,
plans for the future, etc.). The lifestyle analysis
both complements and includes traditional sociolinguistic variables.3
Eight linguistic variables were analyzed extensively, four phonological, two lexical and
two morpho-phonological. Instants of variants
in the recorded interviews were counted manually, and frequencies of variants were correlated
statistically to social variables on a group level.
Examples from analyzes of three phonological
variables will be used in the followng discussions.
Introduction
This article presents some of the core results
from my Ph.D. thesis (Grnberg 2004), a study
of sociolinguistic variation among students
from five municipalities, all attending an upper
secondary school in Alingss, a town of 25,000,
northeast of Gteborg, Sweden.1
The main aim of the thesis was to study covariation between linguistic variation and social
identity, and to relate it to language and dialect
change. A number of questions were raised, of
which the following will be discussed here:
To what extent does linguistic variation
depend on the informants orientation towards
the place where (s)he lives or other places?
How do the findings from the upper secondary school in Alingss differ from results
from comparable informants in Gteborg?
Which social factors are most important
for linguistic choices?
Geographical orientation and linguistic

variation
One of the main issues was to find out to what
extent linguistic variation depends on the informants orientation towards the place where
(s)he lives, towards Gteborg, Stockholm, or
other places. Does the influence stem from the
Material
The material studied consists of tape-recorded
interviews with 97 students at the Alstrmergymnasium, which at the time of recording in
5

Gteborg dialect, ideas about standard Swedish, or from an ideal national youth language?
The variants studied can be related to several layers of spoken Swedish: traditional local
dialect (vstgtska), regional and supraregional
standard, traditional Gteborg dialect, Gteborg
youth language, and general youth language.
The question concerning the origin of a variant
in a certain level or variety is not always easy
to answer, as there are several cases of identical
forms in different layers. One such example is
the variable (long //, except when preceding
/r/), where the variant closed [:] (grn
[gr:n] green) can be found in both local dialect and standard language, while at the same
time contrasting with the innovation open
[:] (grn [gr:n]), which can be found in both
traditional Gteborg dialect and in general
youth language.
The variation found can be interpreted as attributable to differences in geographical orientation in the various groups. Some groups seem
quite locally rooted, while others are more oriented towards Gteborg, and some have more
far-reaching aspirations not necessarily towards Stockholm, but towards major urban areas in general. There is hardly any orientation
towards other areas than these, except in the
case of a few informants who are drawn towards other places in West Sweden.
The linguistic influence as seen in the use of
standard forms and innovations is associated
with an orientation towards both youth and
standard language, on both a regional and supraregional level. When correlated with geographical areas, the linguistic variables generally show a pattern going from southwest to
northeast. The frequency of both standard
forms and innovations grows higher the closer
the informants live to Gteborg, and the two
peripheral areas Lerum (SW) and Herrljunga
(NE) are, in almost every case, the extremes
with the highest frequencies of non-local and
local forms. Even in central Alingss, a similar
tendency can be found with informants, in the
NE parts of town showing a high frequency of
local forms, while those living in Centre and
SW are more prone to using standard forms and
innovations.
One area does not, however, fit into the dialect continuum discernible along the E20 highway and the railway from Gteborg through
Lerum, Alingss and Herrljunga: in Sollebrunn
(NW of Alingss), some distance away from
both the highway and the railway, the nine female informants particularly tend to use stan-
dard forms and innovations to a much higher

extent than the informants in Herrljunga, situated at the same distance from Gteborg. (Results are statistically significant at a five percent level.)
One example is the variable U, that represents three variants of the pronunciation of long
/u /. The variants are the local lowered U [:]
(hur [h:r] how), the standard U [:] (hur
[h:r]), and the diphthongized U [] (hur
[hr]) that is considered an innovation from
Gteborg in this study. As can be seen in figure
1, informants in Sollebrunn and Lerum have a
similar frequency of the Gteborg innovation
diphthongized U.
%
100
80
60
Diphtongized U
Standard U
40
20
Lowered U
0
Alingss
Herrljunga
Lerum
Grfsns
Sollebrunn!!
Figure 1. Frequency of U variants in geographical

areas
Elements in the Sollebrunn-informants adherence to groups point towards a stronger need

for identification with other geographical areas
than their own. I hope to be able in future research to go into greater detail with regard to
the attitudes and values of these informants, to
find out why they differ from the overall pattern.
Comparing results from Alingss and
Gteborg
How do the findings from the upper secondary
school in Alingss differ from findings from
comparable students in Gteborg?
The answer to this question is in some ways
already given above. The distribution pattern
going from SW to NE, as seen between Lerum
and Herrljunga, is supplemented and strengthened through a comparison with the GSM corpus. In relation to three of the phonological
variables, the results are unambiguous, with the
frequency of innovations being substantially
higher in the Gteborg informants than in the
Alingss informants.
6

The curves, which show a strong slant between NE and SE, show an even steeper slant
between the areas of Lerum and Gteborg. One
example is the variable I/Y, as illustrated in figure 2.
The variable I/Y represents three variants of
pronunciation of the long /i/ and /y/. The local
variant is the lowered I/Y [:]/[:] (fin [f:n]
nice, typ [t:p] like, sort of), the standard
variant is the standard I/Y [i:]/[y:] (fin [fi:n],
typ [ty:p]), and the Gteborg innovation is
fricativized I/Y [i:z]/[y:z] (fin [fi:zn], typ
z
[ty: p]). The results of correlation with geographical areas forms a step, increasing curve
for the Gteborg fricativized I/Y from Sollebrunn via Lerum to Gteborg, and a steeply decreasing curve for the local lowered I/Y between the same areas.
Further spreading of Gteborg features?

One interesting question is whether innovations
that are spreading in Gteborg will spread even
more in the region. This would change the spoken language of the Alingss area even more
towards that of Gteborg, as has already happened in, for instance, Kunglv and
Kungsbacka, some 30 km to the north and
south of Gteborg (Grnvall 2005), or whether
the variants close to standard will take over.
Thelander (1979) and Westerberg (2004)
describe the rise of a West Bothnia regional
standard, where forms which are common to
dialects in a larger area survive, while more local forms disappear. The question is whether a
similar development might take place in relation to certain West Swedish forms. The variable R, for instance, might suggest such a
thing. R represents the pronunciation of the
long // before /r/, with two possible variants:
the traditional, local closed R [:r] (gra
[j:ra] do), and the standard open R [:r]
(gra [j:ra]). The closed R is present in a
large area including the region of Vstergtland
(but not in Gteborg or the coastal regions),
and this feature shows a high frequency in central Alingss and all three of the areas to the
north (Herrljunga, Grfsns and Sollebrunn), as
shown in figure 3. This points to the possibility
of the closed R surviving as a part of a Vstgta regional standard, distinguishable from the
Gteborg regional standard, in which the open
R is standard.
%
100
80
60
Fricativized I/Y
Standard I/Y
40
20
Lowered I/Y
0
Herrljunga
Alingss
Lerum
Grfsns
Sollebrunn
Gteborg
Figure 2. Frequency of I/Y variants in geographical

areas including Gteborg
One of the lexical variables displays a

somewhat different distribution, but when a
study of adolescents from Stockholm is added
to the comparison, the results point to a spreading of the innovation typ (like) from Stockholm to Gteborg in the first place, and then to
Alingss, and that the use has stagnated in favour of other discourse markers in the two major cities, while upper secondary school students in Alingss and the catchment area still
use typ to quite a large extent.
Differing patterns of distribution for innovations (three phonological and one lexical) can
be interpreted as two types of regionalization
taking place at the same time. The first type
consists of a gradual spread from the regional
centre of Gteborg towards Alingss and then
further north, and the other type consists of a
form of urban jumping, where forms spread
from the capital to Gteborg, and then on to the
town of Alingss, and from there to surrounding areas (cf Sandy 1993:119).
%
100
Closed R
80
60
Open R
40
20
0
Herrljunga
Sollebrunn!!
Lerum
Grfsns
Alingss
Figure 3. Frequency of R variants in geographical

areas
Discussion
Which social factors are most important for linguistic choices? Is it possible to identify groups
7

groups with common social identities in order
to explain differences in linguistic behaviour?
The social variables used in the study gender, study programme, demogeography, and
lifestyle all show co-variation with linguistic
variables as well as with each other, to some
extent. The hypotheses formulated were not
always verified, but this was not attributable to
a lack of variation but to the fact that this variation was not the predicted variation. For the
eight linguistic variables analyzed, different
social factors are important, but the one factor,
which is most often salient, is that of demogeography. After that, gender can be seen as
being second most important, but in two different ways. On the one hand there are general differences between girls and boys seen as groups,
on the other hand there are differences between
different groups when sex is combined with
study programme, and also in the lifestyle
analysis. This proves that there are major differences from one social group to another when
it comes to expressing gendered identity
through linguistic means. The most salient geographical variation can be found in the
phonological variables, while the lexical variables co-vary to a higher extent with gender,
programme type, and lifestyle.
As was discussed above, a distribution pattern going from SW to NE is discernible, and
this is not only related to distance in kilometres
to Gteborg, but also to the dominant lifestyles
and values in the adult population in the different areas. The ones who stand out most clearly
are the girls in Sollebrunn, with respect to both
the demogeographical and the socio-economic
categorization of the informants. The lifestyle
analysis is an attempt to supply extra information to combine with the traditional social variables, and there is good potential for developing this method further in studies of linguistic
change. It provides a better understanding of
the informants social background and aspiration, both in that it takes into account more aspects, and makes it possible to move away from
the hierarchical way of thinking which characterizes e.g. social indexation, and thus to capture more aspects of how social identities are
constructed in contemporary society.
school, thus having reached the age of 15-16

years. About 98 percent of Swedish 16-yearolds apply to the gymnasium.
2. Gymnasisters sprk- och musikvrldar, The
Language and Music Worlds of High School
Students. See Norrby & Wirdens (1998).
3. The lifestyle analysis was based on Ungdomsbarometern (1999). Cf Bourdieu (2002)
and Dahl (1997).
References
Bourdieu, Pierre (2002) [1984]. Distinction. A
Social Critique of the Judgement of Taste.
Reprint. London: Routledge & Kegan Paul
Ltd.
Dahl, Henrik 1997. Hvis din nabo var en bil. En
bog om livsstil. Kbenhavn: Akademisk
Forlag A/S.
Grnberg, Anna Gunnarsdotter. (2004) Ungdomar och dialekt i Alingss. (Nordistica
Gothoburgensia 27.) Gteborg: Acta Universitatis Gothoburgensis.
Grnvall, Camilla. (2005) Lttgteborgska i
Kungsbacka. En beskrivning av ngra gymnasisters sprk 1997. Gteborg. Unpublished manuscript.
Norrby, Catrin & Karolina Wirdens. (1998)
The Language and Music Worlds of High
School Students. In: Pedersen, Inge Lise &
Jann Scheuer (eds.). Sprog, Kn og Kommunikation. Rapport fra 3. Nordiske Konference om Sprog og Kn. Kbenhavn, 11.
13. Oktober 1997. Kbenhavn: C.A.
Rietzels Forlag.S. 155164.
Sandy, Helge (1993). Taleml. Oslo: Novus.
Thelander, Mats. (1979) Sprkliga variationsmodeller tillmpade p nutida
Burtrsktal. Del 1 & 2. (Studia Philologiae
Scandinavicae Upsaliensa 14:1 &14:2.)
Uppsala: Acta Universitatis Upsaliensis.
Ungdomsbarometern. (1999) Livsstil & fritid.
Stockholm: Ungdomsbarometern AB.
Westerberg, Anna. (2004) Norsjmlet under
150 r. (Acta Academiae Regiae Gustavi
Adolphi LXXXVI.) Uppsala: Kungl. Gustav Adolf Akademien fr svensk folkkultur.
Notes
1. The Swedish upper secondary school, gymnasium, gives courses of three years duration
for students having completed nine years of
8
On the phonetics of unstressed /e/ in Stockholm Swedish

and Finland Swedish
Yuni Kim
Department of Linguistics, University of California at Berkeley
have affected the lower frequencies where the
vowel formants were located.
The data for the rural Finland Swedish
dialects consisted of the audio recordings in
Harling-Kranck (1998), a transcribed collection
of spontaneous narratives by speakers born
around 1900. The scope of this study was limited
to the southern dialects, and of these, 10 dialects
(represented by one speaker each) had enough
tokens of unstressed /e/ for consistent patterns to
arise. From west to east, these were: Fgl and
Kkar in eastern land; Houtskr in western
boland; Tenala and Karis in western Nyland;
Sjunde and Helsinge in central Nyland; and
Borg, Lapptrsk, and Pyttis in eastern Nyland.
F1 and F2 values were measured for
unstressed tokens of the phoneme /e/ in wordfinal syllables. Using Praat, measurements were
taken at a stable point at or near the midpoint of
each vowel. Formants were calculated by LPC,
and FFT spectra were also consulted in cases of
inconsistency between the LPC value and visual
inspection of the spectrogram. Excluded from
the measurements were: extremely reduced
tokens with unclear formant structure, and tokens
with dramatic formant movement throughout the
course of the vowel (e.g. a linear drop of 300 Hz
in F2). These criteria had the effect of excluding
most tokens following velars, but due to the
small total number of tokens it seemed safer for
purposes of comparability to only include vowels
with reasonably stable formant values.
Abstract
Dialects of Swedish vary in the pronunciation of
unstressed /e/ in different phonological
environments. In this pilot study, Stockholm
Swedish is compared with several Finland
Swedish dialects. Stockholm and one land
dialect lower and back /e/ before [n], while
Helsinki and most Nyland dialects lower and
back /e/ before [r]. The data provide evidence
for the sociolinguistic relevance of unstressed
vowel pronunciation.
Introduction
Stressed short [e] and [] are in complementary
distribution in most Swedish dialects: the
allophone [] occurs before [r], and [e] occurs in
all other environments. In Finland Swedish,
transcription conventions (e.g. in Harling-Kranck
1998) and informal reports by native speakers
suggest that the same distribution may hold in
unstressed syllables as well. Since it is not clear
how widespread this phenomenon is, a pilot
study was conducted to investigate the phonetics
of unstressed /e/ across several dialects. The
following questions were addressed: 1) How is
unstressed /e/ pronounced in Stockholm
Swedish? 2) Are unstressed [e] and [] in fact in
complementary distribution in standard Helsinki
Swedish? 3) Do rural Finland Swedish dialects
pattern with Helsinki, or with Stockholm or do
they show their own patterns? Finally, 4) Can
the regional differences be explained?
Results
Materials and methods
Preliminary inspection of the data indicated three

categories of environments relevant to the
phonetic realization of unstressed /e/: preceding
[n], preceding [r], and elsewhere (usually wordfinal, or preceding [t] or [s]). Below, tokens for
these environments are graphed in each dialect.
The ellipses are marked N, R, and E,
respectively.
For the Stockholm and Helsinki speech samples,

5-minute news broadcasts from each city were
recorded from the Internet into Praat at 22500
kHz. One male and one female newscaster were
recorded for each variety. The audio files had
been compressed for Internet broadcast, but it
was assumed that the compression would not
9
Stockholm
Unstressed /e/ in Stockholm Swedish was
generally realized as schwa, but a pattern
emerged for both Stockholm speakers that the
schwa had higher F1 and lower F2 when
preceding [n] than in other environments. There
is little overlap between the [n]-environment
tokens and the other tokens in the F2 vs. F1 plots
in Figs. 1 and 2.
2100
1900
1700
1500
2200
2000
1800
1600
350
450
550
1300
300
400
750
500
Figure 3. Helsinki newscaster. Female, rec. 2005.
600
land and the bo archipelago

The next question is which pattern we find in
dialects of land and the bo archipelago,
geographically located halfway between
Stockholm and Helsinki. Previously part of
Sweden, land became an autonomous part of
Finland in 1921 and maintains contacts with both
countries. Thus it is not immediately obvious
whether land dialect speakers would orient
themselves more toward a Central Swedish or
Finland Swedish norm in unstressed vowel
pronunciation.
The speaker from Fgl in eastern land has
the Stockholm pattern, as shown in Fig. 4.
700
800
Figure 1. Stockholm newscaster. Female, rec. 2005.

1650
650
1550
1450
1350
1250
400
E
500
N
R
600
1900
1800
1700
1600
1500
400
700
E
R
Figure 2. Stockholm newscaster. Male, rec. 2005.
Helsinki
The Helsinki newscasters had a very different
pattern from Stockholm.
Both speakers
categorically lowered and backed unstressed /e/
before [r], as in Fig. 3. This result seems to
confirm the existence of [e]~[] allophony in
unstressed syllables, at least on a phonetic level.
500
600
N
700
Figure 4. Fgl, eastern land. Male, b. 1901.
On the other hand, the speakers from Kkar

and Houtskr (females, born 1900 resp. 1899)
show a third type of pattern, where /e/ has lower
10
F2 before [r], but with (apparently) less of a

difference in F1 between the environments.
2200
2000
1800
Easternmost Nyland, on the other hand,

presents a bit of a mystery. The Lapptrsk
speaker (not shown here) has a tendency to lower
and back /e/ before [r], but unlike in other parts
of Nyland there is significant overlap with non[r]-environment tokens. The Pyttis speaker has
an even more divergent pattern, illustrated in Fig.
8. Since the easternmost dialects have undergone
heavy phonetic influence from Finnish, it may be
possible to relate these divergent patterns to
Bergroths (1917) observation that it is
characteristic of Finnish-accented Swedish to
pronounce unstressed er as [er] instead of [r].
The easternmost Nyland dialects should be
investigated further.
1600
450
E
550
R
650
750
Figure 5. Kkar, eastern land. Female, b. 1900.

2000
Nyland
In most rural villages of Nyland (the province
where Helsinki is located), the Helsinki pattern
obtains: before [r], unstressed /e/ approaches an
[]-like pronunciation.
2450
2250
2050
1800
1600
1400
400
E
R
N
500
600
1850
450
550
700
650
Figure 8. Pyttis, western Kymmene (E. Nyland dialect

group). Male, b. 1895.
750
Discussion and conclusion
850
Although recordings of only one or two speakers

per dialect were available, multiple speakers in
each region showed approximately the same
patterns. Thus the results, though preliminary,
seem to point to robust regional differences in the
quality and distribution of unstressed tokens of
/e/.
It may be possible to explain some of this
variation. As mentioned in the introduction, the
Helsinki pattern, where [e] and [] are in
complementary distribution, is parallel to an
identical alternation in stressed syllables in many
Swedish dialects. The fact that the alternation
seems to have generalized to unstressed syllables
precisely in Finland Swedish may perhaps be
attributable to contact with Finnish, which tends
not to reduce vowel quality in unstressed
syllables. That is, speakers of Finland Swedish
950
Figure 6. Tenala, western Nyland. Female, b. 1885.

2200
2000
1800
1600
1400
450
550
650
750
Figure 7. Borg, eastern Nyland. Male, b. 1900.
11
may have acquired a habit of articulating the full

or nearly-full quality of unstressed vowels, which
could have triggered the [e]~[] alternation. The
[] of Finland Swedish is noticeably more open
than in the Swedish of Sweden (Reuter 1971),
which also seems to contribute to the salience of
the allophony. This hypothesis must remain as
speculation, however, pending further data on
vowel reduction in Finland Swedish (as well as
evidence that the Helsinki pattern really is an
innovation and not archaic).
Once a wider range of dialects is studied, it
may be possible to assemble a more coherent
picture of cross-dialectal variation in unstressed
vowel pronunciation. In future research,
comparing measurements of unstressed /e/ to the
rest of the vowel system, for example to stressed
realizations of [e] and [], could shed further
light on this topic. Normalization of the vowel
formants would also allow direct comparison
among speakers.
Finally, these results have more general
implications. Although sociophonetic research
has often focused on stressed vowels (e.g. Labov
1994), the evidence presented here suggests that
unstressed vowels can also have sociolinguistic
relevance.
Reuter M. (1971) Vokalerna i finlandssvenskan:

en instrumentell analys och ett frsk till
systematisering enligt srdrag. Studier i
nordisk filologi 58, 240-249. Helsinki:
Svenska Litteratursllskapet.
Acknowledgements
I would like to thank Leanne Hinton for valuable
advice and discussion. This research has been
supported by a Fulbright Grant and a Jacob K.
Javits Graduate Fellowship.
References
Bergroth H. (1917) Finlandssvenska:
handledning till undvikande av
provinsialismer i tal och skrift. Helsinki:
Holger Schildts.
Harling-Kranck G. (1998) Frn Pyttis till
Nedervetil: tjugonio dialektprov frn Nyland,
boland, land och sterbotten. Helsinki:
Svenska Litteratursllskapet.
Kuronen M. and Leinonen K. (2000) Fonetiska
skillnader mellan finlandssvenska och
rikssvenska. Svenskans beskrivning 24, 125138. Linkpings universitet.
Labov W. (1994) Principles of Linguistic
Change: Internal Factors. Oxford: Blackwell.
12
The interaction of word accent and quantity in

Gothenburg Swedish
My Segerup
Department of Linguistics and Phonetics, Lund University, Lund
E-mail: My.Segerup@ling.lu.se
Background
According to the Swedish tonal typology (Grding,
1973, with Lindblad 1975, Bruce & Grding, 1978)
the timing/alignment of the word accent gesture is
essential to the Swedish word accent distinction.
The traditionally described word accent pattern
of the West Swedish prosodic dialect type (see Bruce
& Grding, 1978) involves low pitch on the stressed
vowel for accent 1 words and a peak on the stressed
vowel for accent 2 words in focal position. Bruce
(1998) has suggested that Gothenburg Swedish is
characterized by two-peaked pitch contours for both
word accents with an earlier timing in accent 1. A
previous production study (Segerup, 2004) confirms
that Gothenburg Swedish accent 1 deviates from the
generally accepted West Swedish accent 1 pattern
through having a fall on the stressed vowel.
Furthermore, the fall of accent 2 is only marginally
later than that of accent 1, meaning that the expected
timing difference between accent 1 and accent 2 is
less than in other dialect types. Consequently, the
overall shape of the pitch contours is strikingly similar,
but yet they are perceptually distinct (Segerup, 2004,
Segerup & Nolan, forthc.).
Abstract
According to the conventional wisdom the word
accent distinction in Swedish (dialects) is
maintained chiefly by a difference in the timing
of the word accent gesture (Grding, 1973).
Gothenburg Swedish, however, does not obey to
the norm since both pitch height and timing
contribute to the word accent distinction in this
dialect (Segerup, 2004). In Gothenburg Swedish
both word accents have a fall on the stressed
vowel, which makes the pitch contours strikingly
similar (Segerup, 2004).
Up until now the material investigated has
consisted of contrastive words in which the
stressed vowel is phonologically long. In the
present production study we proceed with wordpairs where the stressed vowel is phonologically
short for a comparison. The acoustic analysis
involves measurements of fundamental frequency
(F0) and segments duration of five speakers
production of seven word-pairs altogether.
The results show a significant difference in the
duration of the short stressed vowel between
accent 1 and accent 2. Further, that word accent
has effects on the vowel duration
Pitch height and timing collaborating

cues
Perhaps the most unexpected finding of the
production study summarized above is that accent 2
was shown often to involve higher pitch in the stressed
vowel than accent 1. The result of the acoustic analysis
shows that the word accent distinction is maintained
by comparatively small differences in the timing and
height of the fall and further that speakers apparently
use different strategies in order to maintain the
distinction. Some speakers rely primarily on one cue
or the other, other speakers rely on both.
In order to find out whether listeners attend to
pitch height or disregard it most likely as an
unintentional consequence of producing the alignment
difference, a perception experiment was carried out
(Segerup & Nolan, forthc.). The stimuli used in the
experiment were resynthesized from natural
utterances with alignment and pitch height varied
systematically. Twenty-four native speakers of
INTRODUCTION
The present paper investigates the interaction
between word accent and quantity in Gothenburg
Swedish. Minimal pairs with accent 1 and accent 2
and with either long or short stressed vowel are
examined. How are pitch height and timing affected
when the voiced portion of the syllable is minimized
by having a short vowel followed by a voiceless
consonant as opposed to a more sonorant
environment, i.e. a long vowel or sonorant consonant?
This is related to the general question of truncation
or compression of the f0 contour in an intonationally
unfavourable environment (see e.g. Bannert &
Bredvad-Jensen, 1975).
13

Gothenburg Swedish served as subjects. The results
show that listeners do take note of the height from
which the fall takes place as well as the alignment of
the fall.
The results of the production and perception study
above suggest that there is a trading relationship
between pitch height and timing, meaning that these
two independent dimensions contribute in various
proportions to the perception of the word accent
distinction (Segerup & Nolan, forthc.).
very natural and colloquial speech). At least three

(but up to nine) repetitions of every sentence in each
speaking mode (in random order) were recorded for
each of the 5 speakers.
The total number of tokens (including all 5
speakers repetitions) in the present analysis varies
from approximately 15 to 28 tokens per word in each
speaking style.
Purpose
The acoustic analysis includes segments duration and

seven measurements of pitch values at specific preselected points. These are; the height of the preceding
vowel (1), the start of the stressed vowel (2), the top
corner of the fall (3), the bottom corner of the fall
(4), the start of the rise (5), the phrase accent peak
(6), and the end (7). In the case where the stressed
vowel is followed by a voiced/voiceless consonant,
the measurement point (5) is at the onset of the second
vowel. In Figures 1-3 below the measurement points
are marked by triangles for accent 1 and squares for
accent 2.
Acoustic analysis
The purpose of the present study is to investigate the

interaction between word accent and quantity, and,
further, to investigate whether there is a difference
between accent 1 and 2 as regards the duration of
the short stressed vowel and long stressed vowel,
respectively.
INVESTIGATION
Materials, subjects, method
The speech materials comprise seven contrastive disyllabic word-pairs, all of which are listed pair-wise
in Table 1 below. (Since the present investigation is
part of a large-scale study, the word-pairs included
here are not completely symmetric). The target words,
in phrase-final focal position, were extracted from
various sets of sentences (statements) spoken in two
different speaking styles; normal and clear voice.
RESULTS
The results of the acoustic analysis are exemplified in
Figures 1-4. Figures 1-3 show the average f0 curves
for five speakers production of malen/malen, pollen/
pllen and tecken/tcken in clear style, respectively.
The duration of the stressed vowel is indicated by a
bar (at the top for accent 2 and at the bottom for
accent 1). The pitch contours are aligned at the start
of the stressed vowel and earlier points are shown as
having negative times relative to the alignment point.
In words with a long vowel the duration of the stressed
vowel and the overall timing of the two word accents
is very similar, meaning that a direct comparison of
the timing of pitch events is possible, which is generally
not the case in words with a short vowel.
Figure 4 compares, for accent 1 and accent 2,
the average duration of the stressed vowel for the
word-pairs malen/malen, pollen/pllen and tecken/
tcken.
Table 1. Contrastive word-pairs included in the

present study.
Polen (Poland)
Judith
malen (the moths)
buren (the cage)
Accent 2 V:
plen (the stake)
ljudit (to have sounded)
malen (ground)
buren (carried)
Accent 1 V
pollen (pollen)
tecken (signs)
tjecker (Czechs)
Accent 2 V
pllen (horsey)
tcken (quilts)
checker (cheques)
Accent 1 V:
It is clear from the acoustic results that words

with a short vowel behave differently from words
with a long vowel. Words with a long vowel (Judith/
ljudit, Polen/plen and buren/buren) behave nearly
the same as malen/malen, which is shown in Figure
1. For both accents the f0 contour is falling throughout
the vowel segment from an initial f0 maximum (defined
as the top corner of the fall), which starts slightly
later at a higher frequency level for accent 2 than for
accent 1, to an f0 minimum (the bottom corner of the
fall) at the end of the stressed vowel.
Speakers were five elderly male native speakers

of Gothenburg Swedish. The recordings were made
using a portable DAT recorder in the subjects local
environment. Two sets of recordings were made on
two separate occasions. A Gothenburg Swedish
interlocutor read various questions, to which the
subjects read the answer (which proved to induce
14
frequency (hz)
275
250
225
200
175
150
125
100
75
50
-300
The falling f0 contour of accent 2 starts at a higher

frequency level and with slightly later timing than the
fall of accent 1 and reaches Low in the following
consonant.
Figure 3 reveals a further effect. In tecken/tcken
and tjecker/checker (not shown here) where the
voicing part is very short it appears that speakers
compress the fall in accent 1, so that the Low is
achieved at the end of the short vowel, whereas in
accent 2 the pitch stays high at the end of the stressed
vowel. The graph interpolates to the Low measured
at the beginning of the second vowel, so that the true
slope of the fall cannot be determined because of the
voicelessness.
Accent 1
Average 5 speakers
malen A1 & A2
Accent 2
Acc 1 V
Acc 2 V
-100
100
300
time (ms)
500
700
Figure 1. Average fundamental frequency contours

for accent 1 (triangles) and accent 2 (squares) for
five speakers. The first point in the curves is in the
preceding unstressed vowel. The bars show the
duration of the stressed vowel for accent 1 (bottom)
and accent 2 (top). Times are expressed relative to
the onset of the stressed vowel.
frequency (hz)
There is no durational effect of accent evident in

long vowel words, while this does seem to be the
case in short vowel words. The difference seen for
pollen/pllen (Figure 2) and tecken/tcken (Figure
3) was also seen for tjecker/checker.
In the short vowel contours of pollen and pllen
(Figure 2) the main pattern of the f0 contours of the
long vowel words is preserved. Even if the fall of
accent 1 pollen starts comparatively late into the
vowel, the f0 contour is falling rapidly through the
second part of the vowel to a final Low,
approximately at the VC-boundary.
frequency (Hz)
275
250
225
200
175
150
125
100
75
50
-300
Acc 2
Acc 2 V
100
300
time (ms)
500
Acc 1 V
Acc 2 V
-100
100
300
time (ms)
500
700
From the results it is obvious that the gradient of

the fall in accent 1 differs between long vowel words
and short vowel words. The results suggest that the
gradient of the fall also differs between short vowel
words, i.e. between words where the stressed vowel
is followed by a voiced consonant versus a voiceless
consonant. The gradient of the fall in tecken appears
to be twice as steep as that of malen, while the
gradient of the fall in pollen is approximately in
between that of malen and tecken.
Acc 1 V
-100
275
250
225
200
175
150
125
100
75
50
-300
Accent 2

Acc 1
Average 5 speakers pollen A1

pllen A2
Accent 1
Average 5 speakers tecken A1

tcken A2
700
Figure 4 compares the duration of the stressed vowel

of accent 1 and accent 2 for comparison between
malen/malen, pollen/pllen and tecken/tcken. The
difference in duration of the short stressed vowel
between accent 1 and accent 2 is noticeable.
A t-test showed this difference to be significant for
each individual speaker in each speaking style (at 5
% level). One exception is tjecker/checker in clear

15
References
speaking style, where two of the speakers vowel

duration for accent 2 was only marginally shorter than
that of accent 1.
Acc 1
Average duration stressed vowel

5 speakers
Acc 2
vowel duration (ms)
50
100
150
200
250
malen
malen
pollen
pllen
tecken
tcken
Figure 4. Average duration (ms) of the stressed vowel

for malen/malen, pollen/pllen and tecken/tcken five
speakers. Accent 1 is represented by the light bar and
accent 2 by the dark bar.
DISCUSSION
In Gothenburg Swedish short vowel words, accent
2 seems to demonstrate truncation of the pitch fall
and accent 1 seems to demonstrate compression
of the fall and also some lengthening of the stressed
vowel. It appears to be the case that Gothenburg
Swedish speakers strategy is to preserve the fall
on accent 1, while the fall seems to be of less
importance for accent 2.
One interpretation of this is that the falling f0
contour in the stressed vowel of accent 1 and the
height from which the fall takes place in accent 2
is enough of a cue to maintain the distinction
between the word accents in words with a short
stressed vowel. House (1990) has worked with a
model of tonal feature perception which may be
applied to these findings.
In order to fully understand the interaction of
these cues a perceptual experiment with synthetic
stimuli is in preparation, which will manipulate pitch
height and slope in order to discover the relative
importance of these factors.
16
Bannert R. and Bredvad-Jensen A-C. (1975)

Temporal organization of Swedish tonal
accents: The effect of vowel duration. Working
Papers (Phonetics Laboratory, Department of
general Linguistics, Lund University) 10,1-26.
Bruce G. and Grding E. (1978) A prosodic
typology for Swedish dialects. In Grding G.,
Bruce G. and Bannert R. (eds) Nordic
Prosody: Papers from a symposium
(Department of Linguistics, Lund University)
219-229.
Bruce G. (1998) Allmn och svensk prosodi
(Department of Linguistics, Lund University)
16.
Grding E. (1973) The Scandinavian word
accents. Working papers (Phonetics
Laboratory, Lund University) 8.
Grding E. and Lindblad P. (1975) Constancy and
variation in Swedish word accent patterns.
Working papers (Phonetics Laboratory, Lund
University) 3, 36-100.
House, D. (1990) Tonal Perception in Speech.
(Traveaux de lInstitut de Linguistique de
Lund, Lund University) 24.
Segerup M. (2003) Word accent gestures in West
Swedish. In Heldner M (ed.) Proceedings
from FONETIK 2003, Phonum (Department
of Philosophy and Linguistics, Ume
Segerup, M. (2004) Gothenburg Swedish word
accents a fine distinction. In Branderud, P. &
H. Traunmller (eds). Proceedings Fonetik
2004 (Department of Linguistics, Stockholm
University) 28-31.
Segerup, M. & Nolan F. (forthc) Gothenburg
Swedish word accents a case of cue trading?
Nordic prosody (Department of Linguistics and
Phonetics, Lund University) IX.
Visual Acoustic vs. Aural Perceptual Speaker Identification in a Closed Set of Disguised Voices
Jonas Lindh
Gteborg University
the effects of low quality recordings. Generally,
one can say that primarily aural identification
has been the leading method when it comes to
casework. Many studies have been carried out
to see what parameters are most stable or where
effects of low quality can be calculated, for example the telephone effect (Knzel, 2001).
Generally, LTAS becomes rather stable after
30-40 seconds of speech. (Boves, 1984; Fritzell
et. al., 1974; Keller, 2004) LTAS reflects the
energy highs and lows generated by the vocal
tract filter on average, which means that it
should be more difficult to alter than, for example, F0 or specific phones, why this measure is
often chosen to visually represent the general
energy distributions in frequency for the speech
signal. Several studies have been conducted to
study energy ratios and level differences for
LTAS (Lfqvist, 1986; Lfqvist & Mandersson, 1987; Gauffin & Sundberg, 1977; Kitzing,
1986). Kitzing (1986) recommended that patients should read at the same degree of vocal
loudness to avoid the differences that occurred
especially in higher frequencies. Kitzing & kerlund (1993) pointed out the need for an investigation of the effect of vocal loudness on LTAS
curves. Nordenberg & Sundberg (2003) performed such a test and showed that vocal loudness and varied f0 gave variations in Long Time
Average Spectra. However, even though an expected variation has been shown, the ability to
perform pattern matching on the graphs seems
to be possible. It has been observed that a slight
difference between the identification results between subjects depends on whether they consider distance more important than
shape/pattern. Hollien & Majewski (1977)
tested long-term spectra as a means of speaker
identification under three different speaking
conditions, i.e. normal, during stress and disguised speech. LTS for fifty American and fifty
polish male speakers were used under fullband
as well as passband conditions. The results
demonstrated high levels of correct identification (especially under fullband conditions) for
normal speech with degrading results for stress
and disguise.
Abstract
Many studies of automatic speaker recognition
have investigated which parameters that perform best. This paper presents an experiment
where graphic representations of LTAS (Long
Time Average Spectrum) were used to identify
speakers from a closed set of disguised voices
and determine how well the graphic method
performed compared to an aural approach.
Nine different speakers were recorded uttering a fake threat. The speakers used different
disguises such as dialect, accent, whisper, falsetto etc. and the verbatim threat in a normal
voice.
Using high quality recordings, visual comparison of the Praat vocal tract graphs of
LTAS outperformed the aural approach in identifying the disguised voices. Performing speaker
identification aurally does not mean analyzing a
different sample than the one being analyzed
acoustically. Studies of aural perception show a
hypothesizing, top-down, active process, which
create interesting questions regarding aural
speaker identification with bad quality recordings in noisy backgrounds etc. However, more
tests on telephone quality recordings, authentic
material and additional types of acoustic measurements, must be performed to be able to say
anything about LTAS with implications for forensic purposes.
Background and Introduction

The so-called voiceprint approach introduced
by Lawrence Kersta (1962) suggested a pattern
matching procedure comparing broadband spectrograms for speaker identification purposes. It
is within this context that an interest in studying visual vs. aural methods arose. Since complex visual pattern matching activates the right
hemisphere of the brain and speech- and language processes usually the left (Rose, 2002) it
would be preferable to find a way to integrate
both. There are many problems to be considered when using visual representations of
acoustic data within the context of forensic
speaker identification, especially considering
17
Method
The sixteen disguised voices and suspects
(references), were recorded by six females and
three males. The recordings were made with a
high quality microphone in front of a personal
computer and the subjects recorded one normal and as many disguised voices as they
wanted, repeating the same fake threat in
Swedish. All recordings were between four and
six seconds long and sampled at 16kHz. Forced
choice was applied in both the aural and visual
tests.
then decided which one they thought was the

most similar one comparing both shape and/or
distance. The subjects were also told that the
graphs had no timeline and that they were supposed to perform pattern matching, answering
which graphs were the most similar ones in each
test sample. They were also asked to comment
on how they reached each conclusion and if distance or shape was most important when coming to a decision. This was done to be able to
interpret how subjects compared the visual input. They were allowed to inspect the graphs
as many times and as long as they wanted.
The Graphic Representations of LTAS

The vocal tract function in Praat draws the
LTAS envelope (in decibel) as if they were vocal tract areas (in square meters). This gives a
graph representing the LTAS. The graph does
not give the axis values, which is reasonable
since the overall absolute amplitude, as a parameter, has no real value (Nordenberg & Sundberg, 2003). The important information lies in
the relative spectral envelopes represented by
the line showing the energy distribution as a
function of frequency.
The Aural Identification Test

Seven subjects performed aural identification on
the same set of samples to be able to compare
the results easily.
The recordings were put in a list in a randomized order. Subjects used headphones and
could listen to the samples as many times as
they wanted before deciding which one of the
references they thought sounded most like the
target. All subjects were of the same category as
in the visual test. Some test subjects were the
same as in the visual test.
Results and Discussion
Even though there is a great difference in performance between subjects within each test, it
is clear that the visual identification outperforms the aural.
The Visual Identification Results
The results for the visual tests show consistency.
Figure 1. A graph comparison sample (in the test

the target is red and each reference blue).
Table 1. Inter-rater Reliability Analysis (Cronbachs alpha).
The graphic representations of LTAS were created from an LTAS object using 100 Hz frequency bins. (Boersma & Weenink, 2005)
N of Disguised Voices
N of Subjects
Alpha
The Visual Comparison Test

Graphs representing LTAS were created for sixteen disguised voices and paired up with each of
the reference samples to be used in a visual
identification test performed by ten subjects.
The order in which they were presented was
randomized. The subjects were all students or
employees at the Department of Linguistics,
Gteborg University. They had all, at some
point, taken an undergraduate course in phonetics and/or speech technology.
The subjects compared each disguised voice
with all the suspects/references in pairs and
16
10
0.91
The impression based on the comments is that

subjects with a preference for pattern and shape
rather than distance generally performed better
in the visual test.
18
% Correct

thought that no decision should have been
added as an alternative answer.
% Correct Visual Identifications / Sample

100
100
90
80
80
80
70
Table 2. Inter-rater Reliability Analysis (Cronbachs alpha).
60
40
40
20
0
10
0
1
30
40 40 40
10
6
30
10
N of Disguised Voices
N of Subjects
Alpha
20
10 11 12 13 14 15 16
The reliability score is lower in this test compared to the visual test. However, the correlation is high enough to be interpreted as a rather
high correlation between subjects.
Disguised Voice Samples
Figure 2. Percent correct visual identifications per

sample (16) for 10 subjects.
% Correct Aural Identifications / Sample

100
100
% Correct
Figure 2 shows how many correct identifications that were made per disguised voice sample. Some graphs were obviously very difficult
to identify. Why that is so, or how those
graphs differ, has not yet been investigated.
% Correct
% Correct Visual identifications / Subject

100
90
80
70
60
50
40
30
20
10
0
16
7
0.83
80
71
71
57
60
43
40
29
20 14 14
43
29
29
14 14
14
14
0
0
1
10 11 12 13 14 15 16
Disguised Voice Samples

56
44
31
44
31
50
50
38
44
44
Figure 4. Percent correct aural identifications per

sample (16) for 7 subjects.
5
10
Figure 4 gives a result overview, which may be

compared with the corresponding Figure 2 for
the visual test. The amount of correct identifications per sample is significantly lower though the
maximum is lower (seven subjects vs. ten).
Subject
Figure 3. Percent correct visual identifications per

subject (10) for 16 samples.
% Correct Aural Identifications / Subject
% Correct
Figure 3 shows the identification results for

each subject, which varies from nine correct
identifications to five. As mentioned above the
performance was clearly related to whether the
subject used pattern/shape matching more than
distance. The average identification score for the
visual test is 6.9, which could be considered as
rather low, but considering the difficulties presented in the aural test results it is merely the
comparison which is taken into consideration in
this study.
100
90
80
70
60
50
40
30
20
10
0
38
44
31
31
38
44
19
Subject
The Aural Identification Results

The results in the aural test were less correlated.
The reason is simply that subjects found the
task much more difficult, i.e. most subjects
Figure 5. Percent correct aural identifications per

subject (7) for 16 samples.
19

Figure 5 presents the figures corresponding to
table three in the visual identification test. The
subjects results are significantly lower even
though lowest visual score (five) is higher than
the highest aural score (seven). Since there
seems to be an individual strategy success involved. The result per subject in the aural test
also shows a higher degree of variation than the
visual. This is probably due to the difficulties
they showed in deciding on which reference to
choose.
References
Boersma P. & Weenink D. (2005) Praat: doing
phonetics by computer (Version 4.3.01)
[Computer Program]. Retrieved from
<http://www.praat.org/>
Boves L. (1984) The phonetic basis of perceptual ratings of running speech Foris Publications, Dordrecht.
Gauffin J. & Sundberg J. (1977) Clinical application of acoustic voice analysis. Part II:
Acoustic analysis, results 1977/2-3: 39-43.
Grosjean F. (1980). Spoken word recognition
processes and the gating paradigm. Perception and Psychophysics, 28, 267-283.
Hollien H. & Majewski W. (1977) Speaker
identification by long-term spectra under
normal and distorted speech conditions.
Journal of the Acoustical Society of America 62: 975-980.
Keller E. (2004) The analysis of voice quality
in speech processing. In Lecture notes in
computer science, Springer Verlag, Berlin.
Kersta L. G. (1962) Voiceprint identification.
Nature 196: 1253-1257.
Kitzing P. (1986) LTAS criteria pertinent to the
measurement of voice quality. Journal of
Phonetics, 14: 477-482.
Knzel H. J. (2001) Beware of the 'telephone
effect': The influence of telephone transmission on the measurement of formant frequencies. Forensic Linguistics 8: 80-99.
Lfqvist A. (1986) The long-time-average
spectrum as a tool in voice research. Journal
of Phonetics, 14: 471-475.
Lfqvist A. & Mandersson B. (1987) Longtime average spectrum of speech and voice
analysis. Folia phoniatrica, 39: 221-229.
Nordenberg M. & Sundberg J. (2003) Effect on
LTAS of vocal loudness variation. In:
TMH-QPSR, KTH, 45: 93-100.
Rose P. (2002) Forensic Speaker Identification.
New York, Taylor & Francis.
Stevens K. N. (1993) Lexical access from features. In Speech communication group
working papers (Vol. VIII, p. 119-144). Research Laboratory of Electronics, Massachusetts Institute of Technology.
Conclusions
General advantages with graphic representations are:
Intra subjectively applicable (depending
on the amount of data).
Relatively simple fundamentals for calculation.
Rather easy to visualize.
General disadvantages are:
Difficult to quantify and substantiate
comparisons.
The visualization depends on F0 and
vocal loudness variations.
An average always ignores specific
events in the speech signal.
Considering the categorical, top-down active
human speech perception process (Grosjean,
1980), it is interesting to find complementary
visual acoustic information to aural methods in
forensic speaker identification. When two voice
samples are compared, the same input is judged
no matter if it is aurally or acoustically. The
question is how it is analyzed and how the
acoustic visual and the aural perceptual information are processed. If a better understanding
between the two is reached, objective methods
can be used to judge similarities. Objective
acoustic methods can also more easily be excluded on well-grounded arguments as well as
subjective aural ones. This could also lead to
better statistical data in forensic speaker identification if computer based methods can be used
with more confident supervision. It is clear that
aural mistakes are made, especially for disguised voices.
The graphic representations used in this experiment are not claimed to be complete images
reflecting the voice of a speaker. They are but
examples showing that in some cases visual
acoustic input are better at discriminating between speakers than are ears alone.
20
A Model-Based Experiment towards an Emotional Synthesis

Jonas Lindh
Gteborg University
comes to synthesizing expressive speech. One
simply treats the relationship in a hierarchy
where the abstract underlying expression is neutral and the surface expressions are the emotions we want to induce, in this case the basic
emotions from discrete emotional theory - anger, sadness and happiness (Levenson, 1994;
Laukka, 2004; Tatham & Morton, 2004; Narayanan & Alwan, 2004).
A modern state of the art unit selection
speech synthesis normally produces a sentence
as neutrally as possiblein order to avoid undesired side effects or miscommunication. Neutral
in this case means near monotone or containing
as few speech fluctuations as possible. This is
not always desirable when it comes to for example dialogue systems. To be able to compare
whether a system succeeds in expressing a certain emotion or desire, it is obviously also important to study how well people in general
succeed in communicating emotions.
The development of conversational systems
has increased, meaning that understandable,
neutral synthetic speech is barely acceptable
anymore. Some success has been reached, but
the best ones still depend too much on stored
data, including a separate emotional speech database. (Bulut et al., 2002)
The most successful attempts to synthesize
emotions have been built by using additional
speech databases containing only recordings
representing specific emotions uttered (this applies to concatenative/unit selection synthesis
systems). The system has to be able to switch
database when a specific emotion is desirable.
The system must perhaps also use different algorithms/analyses for the different databases
since the acoustic content might differ significantly. The databases needed for such a system
also mean a substantial increase of data to
choose from. A simpler and computationally
more efficient method is to induce rules for expressive speech and resynthesize an utterance
produced by the system.
Nowadays, most unit selection systems are
created by recording a single professional
speaker and then using specified parts (nor-
Abstract
The most successful methods to induce emotions
on state of the art unit selection speech synthesis
have been built by switching speech database
depending on the desired emotion. These methods require a substantial increase of memory
compared to a single database and are computationally slow. The model-based approach is
an attempt to reshape a neutrally recorded utterance (comparable to the desired output from
a modern unit selection system) into simulating
a recorded model of a desired emotion.
Factors for manipulation of duration, amplitude and formant shift ratio are calculated by
comparing the recorded neutral utterance with
three recorded, basic emotional models in accordance with discrete emotion theory sadness, happiness and anger. F0 (regarded as the
intonation) is copied from the model and is then
imposed on the neutrally recorded utterance.
The evaluation of the experiment shows that
subjects easily categorize discrete emotions in a
forced choice. They also grade the resynthesized
emotional quality from the neutrally recorded
utterance almost equally high as the naturally
recorded models for the male voice. The female
voice created more difficulties and contained
more synthetic artifacts, i.e. it was judged to
have a lower quality than the recorded models.
Background and Introduction

Creating emotional synthesis has been a research area for quite some time. Formant speech
synthesis is easily distinguished from human
speech not only because of the underdeveloped
naturalness, but also due to the lack of expressiveness. Several attempts to implement emotions in formant synthesis have taken place
(Cahn, 1988; 1989; 1990; Carlson et al., 1992).
When dealing with emotional content in
speech the point of departure is almost always
the neutral utterance. What is neutral speech,
i.e. speech without emotions? Normally, neutral
speech is thought of as a carrier being modulated to reveal the emotions being communicated. Such a concept is rather useful when it
21

mally diphones as basic element) of the utterances to concatenate new ones. This normally
means that a professional speaker must be
available to be recorded for emotional utterances
of different lengths. If these recordings are used
as models, they will then hopefully not differ
more from the utterances that will be produced
by the system than there will be differences for
a specific speaker.
The desire for creating a simpler way of inducing emotions in unit selection synthesis
based on rules have been proposed by for example Murray et al. (2000) and Zovato et al.
(2004). However, in this paper an experiment
using models to calculate differences between a
neutral and an emotional utterance is presented
and tested. The results show both difficulties
and promising results, which are then discussed
concerning how to find ways to induce emotions in synthesis. If emotions are to be created
by a system they cannot be expected to outperform the communication of emotional content
from recorded models.
The model-based approach

The approach described and tested in this paper
is similar to the rule-based idea that is described
in Zovato et al. (2004) and Murray et al.
(2000), except that the rules are based on interactive calculations compared to models. The
model calculations are also applied to the complete utterances and not applied to specific
units (i.e. syllables or diphones etc.).
A state of the art unit selection synthesis attempts to sound as natural and neutral as possible. If the voice used in the system is recorded
to produce models of emotions, the neutrally
produced output can be seen as the underlying
neutral representation. The representation can
then be compared to the produced models to be
able to calculate variations for specified parameters.
The aim of the model-based approach is to
approach the recognition rate for the models
themselves and keep naturalness. The limitations are obvious, when stretching and changing
too much, PSOLA will create synthetic artifacts.
Method
Two speakers, one female and one male, were
recorded uttering the same sentence (Jag har
inte varit dr idag) in four different expressive
styles: natural, sad, happy and angry. The recordings were made in a studio environment
using a high quality microphone. The speakers
were told to first consider how to express the
emotions in speech concerning duration, amplitude and intonation. They were then told to express the emotions as clearly as possible while
recorded, even though the semantic content did
not suggest a specific emotion.
Each recorded emotion was then used, both
as a model to induce the specific emotion in the
neutrally recorded utterance as well as a reference against which the resynthesized speech
should be compared. If one uses the same
speaker and the calculated differences from the
same utterance with different emotions one
should be able to resynthesize at least the specific parameters correctly. Six subjects finally
evaluated the results by categorizing and grading
the neutral recording, the recorded models and
thre three resynthesized objects for the two
speakers, i.e. fourteen utterances of the same
sentence.
Figure 1. Flow chart showing the script procedure

for the model based experiment.
22

Figure 1 shows how a neutral utterance and a
model is compared and the neutral utterance finally resynthesized. First, the neutral utterance
and model duration and average amplitude are
calculated. Equal duration is then calculated for
the objects. Pitch tier objects are then created
after point processing the framed fundamental
frequency values. A point-processed object is a
sequence of points (ti) in time, defined on a domain [tmin, tmax]. The index (i) runs from 1 to the
number of points. The points are sorted by
time (i.e. ti+1 > ti). Points are generated along the
entire time domain of the pitch tier, because
there is no voiced/unvoiced information, then
the F0 contour is linearly interpolated. This
means that one can easily exchange the point
processed signal tier from one object to another,
thus cloning the intonation (Boersma & Weenink, 2005). The formant shift ratio is then calculated for the first three formants and manipulated. Finally the duration (relative to the
model) and the average amplitude is modified
and resynthesized.
As can be observed in Table 1 the F0 values are

modified fairly well compared to the models.
The formant shift ratio should be individualized
to each formant and not changed depending on
the general averages from the first three. For the
female voice (table 2) the neutral recording contained some traces of creakiness, which led to
some failure in the F0 analysis and thereby also
the resynthesis. Generally, the values approach
the models.
Evaluation Test
Seven subjects with normal hearing and some
previous experience of listening to synthesized
speech (six employees and one student at the
department of linguistics) performed an evaluation. In the evaluation the subjects listened to
sixteen samples, eight male and eight female.
The samples were the neutral utterance plus the
three recorded models of the same sentence and
the three resynthesized samples. When hearing
the samples the subjects had to categorize each
sample belonging to one of the four categories
neutral, happy, sad and angry. After categorizing they had to grade the confidence level of
their categorization from 1 to 5 (absolutely confident). They finally had to grade the naturalness, meaning a score between very synthetic
(1) to sounding like a recorded voice (5). The
average results calculated from all subjects in
Table 3 and 4 below.
Results and Discussion

The result of the modulations was calculated by
comparing averages and standard deviations for
the resynthesized objects and the models.
Table 1. Model and modified parameter values
for the male voice
Male voice F0
mean
Neutral
95
Sad Model
153
Resynth
148
Happy Model 133
Resynth
133
Angry Model 84
Resynth
83
F0
std
24
18
16
52
46
8
5
Ampl
(dB)
68
69
69
75
72
70
68
F1
mean
519
405
512
528
522
517
507
F2
mean
1482
1300
1508
1464
1451
1367
1452
F3
mean
2644
2592
2517
2602
2629
2672
2629
Table 3. Average results from the categorization

and grading (1-5) by seven subjects of the male
voice.
Male voice
Neutral
Neutral
4.7
Sad Model
Resynth
Happy Model
Resynth
0.7
Angry Model
Resynth
Table 2. Model and modified parameter values

for the female voice
Female
Voice
Neutral
Sad Model
Resynth
Happy Model
Resynth
Angry Model
Resynth
F0
mean
172
328
311
358
349
250
236
F0
std
17
73
25
119
107
53
52
Ampl
(dB)
70
67
68
77
73
77
74
F1
mean
573
587
610
707
608
638
614
F2
mean
1670
1535
1651
1661
1734
1658
1689
F3
mean
2687
2783
2681
2709
2767
2649
2686
Sad Happy
Angry
4.3
4.2
4.8
3
3.67
3.5
Natural
4.3
3.8
2.7
4.7
3.5
4.33
3.5
The average naturalness score for the four resynthesized male samples is 3.37, while the
overall average for the recorded models is 4.29.
Whether this decrease in naturalness is an acceptable has not been investigated. Categorizing
the male samples created no problems except
one uncertain exception (0.7 happy-neutral).
This means that there is a trade-off between
naturalness and an computationally cheap
method.
23

[Computer program]. Retrieved March 31,
2005, from http://www.praat.org/
Bulut M., Narayanan S., Syrdal A. (2002) Expressive speech synthesis using a concatenative synthesizer. In Proc. ICSLP (Denver)
Cahn J. E. (1990) The Generation of Affect in
Synthesized Speech. Journal of the American
Voice I/O Society, Volume 8. 1-19.
Cahn J. E. (1989) Generation of Affect in
Synthesized Speech. Proceedings of the
1989 Conference of the American Voice Input/Output Society . Newport Beach, California. Pages 251-256.
Cahn J. E. (1988) From Sad to Glad: Emotional
Computer Voices.Proceedings of Speech
Tech '88, Voice Input/Output Applications
Conference and Exhibition. New York City.
Pages 35-37.
Carlson R., Granstrm B., Nord L. (1992) Experiments with emotive speech - acted utterances and synthesized replicas. Proc.
ICSLP'92, pp. 671-674
Laukka P. (2004) Vocal Expression of Emotion.
Discrete-emotions and Dimensional Accounts, Acta Universitatis Upsaliensis, Comprehensive Summaries of Uppsala Dissertations from the Faculty of Social Sciences
141. 80pp. Uppsala.
Levenson R. W. (1994) Human emotion: A
functional view. In P. Ekman & R. J. Davidson (Eds.), The nature of emotion: Fundamental questions (pp.123-126). New York:
Oxford University Press
Murray I. R., Edgington M. D., Campion D.,
Lynn J. (2000) Rule-based emotion synthesis using concatenated speech, In SpeechEmotion-2000, 173-177.
Narayanan S., Alwan A. (2004) Text to Speech
Synthesis: New Paradigms and Advances.
Prentice Hall PTR, IMSC Press Multimedia
Series.
Tatham M. & Morton K. (2004) Expression in
speech : analysis and synthesis. Oxford
[England] ; New York : Oxford University
Press.
Zovato E., Pachiotti A., Quazza S., Sandri S.
(2004) Towards emotional speech synthesis:
a rule based approach. Workshop Proceedings, 5:th ISCA Speech Synthesis Workshop, Carnegie Mellon University, Pittsburgh
(US).
Table 4. Average results from the categorization

and grading (1-5) by seven subjects of the female
voice.
Female voice
Neutral
Sad Model
Resynth
Happy Model
Resynth
Angry Model
Resynth
Neutral Sad Happy Angry

0.7
2.8
4.5
0.8
1.8
4
0.3
1.2
1
5
0.5
2.8
Natural
4
3.5
1.7
3.5
2
4.7
2.8
The female voice created more difficulties. The

samples contained more synthetic artifacts,
which was detected by the listeners. The average naturalness score for the resynthesized
samples is 2.62. Since the categorization as well
as the grading was worse here, it is likely that
the synthetic low quality of the output made
categorizing more difficult. This might also be
an example of what happens when bad models
are created (see table 2).
Conclusions and Further Developments

Categorizing discrete emotions does not seem to
be a problem. The difficulty certainly increases
as quality degrades. Female voices turned out to
be more difficult to resynthesize without degrading naturalness. More research is needed to
be able to make a well-grounded comparison.
One problem may be individual voice variation.
The purpose of a model-based approach is
to be able to induce discrete emotions on a neutrally uttered sentence, produced by a state of
the art unit selection system. By comparing
well-formed models from one individual
speaker, characteristics such as intonation, F0changes, formant shift ratios and amplitude can
be calculated and induced on a neutrally uttered
sentence successfully.
More research on which segment level (syllable, diphone etc.) the calculations and inducing
should be done is desirable for the future. There
also remains several questions regarding what a
model should look like and which parameters
that really should be modified to reach the
model-based goal.
References
Boersma P. & Weenink D. (2005) Praat: doing
phonetics by computer (Version 4.3.07)
24
Annotating Speech Data for Pronunciation Variation

Modelling
Per-Anders Jande
KTH: Department of Speech, Music and Hearing/CTT Centre for Speech Technology
Abstract
Background
This paper describes methods for annotating

recorded speech with information hypothesised
to be important for the pronunciation of words
in discourse context. Annotation is structured
into six hierarchically ordered tiers, each tier
corresponding to a segmentally defined linguistic unit. Automatic methods are used to segment and annotate the respective annotation
tiers. Decision tree models trained on annotation from elicited monologue showed a phoneme error rate of 9.91%, corresponding to a
55.25% error reduction compared to using a
canonical pronunciation representation from a
lexicon for estimating the phonetic realisation.
Work on pronunciation variation in Swedish on

the phonological level has been reported by
several authors, e.g. Grding (1947), Bruce
(1986), Bannert and Czigler (1999) and Jande
(2003a, 2003b, 2004).
There is an extensive corpus of reports on
research on the influence of different context
variables on the pronunciation of words. Variables that have been found to influence the segmental realisation of words in context are foremost speech rate, word predictability (often estimated by global word frequency) and speaking style (cf. e.g. Fosler-Lussier and Morgan,
1999; Finke and Waibel, 1997; Jurafsky et al.,
2001; van Bael, 2004).
Introduction
Speech Data
The pronunciation of a word depends on the

context in which the word is uttered. A model
of pronunciation variation due to discourse
context is interesting in a description of a language variety. Such a model can also be used to
increase the naturalness of synthetic speech and
to dynamically adapt synthetic speech to different areas of use and to different speaking styles.
The pronunciation of words in context is affected by many variables in conjunction. The
amount of variables and their complex relations
make data-driven methods appropriate for modelling. Data-driven methods are methods used
to create general models from examples, e.g.
using machine learning.
To use data-driven methods, data (examples)
is of course a prerequisite. The method for acquiring data for variables hypothesised to be
important for the pronunciation of words is to
annotate recorded spoken language with information about the variables. The pronunciation
and the set of context variables is thus used as
an example, which can be used for finding general structures in the data.
This article describes methods for annotating
speech data with information hypothesised to
be important for predicting the segment-level
pronunciation of words in discourse context.
The speech data used for pronunciation variation modelling has not been recorded specifically for this project, but has been collected
from various sources. The speech corpus includes data recorded or made available for research within the fields of phonetics, phonology and speech technology in different earlier
research projects. The speech data has been selected to be dialectally homogeneous, to avoid
dialectal pronunciation variation. The language
variety used is central standard Swedish.
The speech data has been recorded in different situations and speaking style related variables are defined from the speaking situation.
The collection of speech data collected for the
project includes radio news broadcast and interviews, spontaneous dialogues, elicited
monologues, acted readings of childrens
books, neutral readings of fact literature and
recordings of dialogue system interaction.
Methods and software for annotation has
been developed using mainly the VAKOS corpus (Bannert and Czigler, 1999) as the target to
be annotated. This corpus was originally recorded and annotated for the study of variation
in consonant clusters in central standard Swedish. It consists of ~103 minutes of monologue
from ten native speakers of central Standard
Swedish.
25

located manually with support from the word
boundary annotation. The phrase tier is segmented utterance-by-utterance using the output
of the TNT part of speech tagger (Brants, 2000;
Megyesi, 2002a) and the SPARK parser (Aycock, 1998) with a context-free grammar for
Swedish (Megyesi, 2002b).
Method
Automatic methods (with some minor exceptions) are used for annotation of spoken language data, where annotation is not supplied
for the corpora used. The word level annotation
is the base for all other annotation. The automatically obtained word boundaries and orthographic transcripts are manually corrected. In
this way, relatively little work can give a large
gain in annotation performance for most types
of annotation.
The annotation system is built as a serialised
set of modules, producing output at different
levels. The output can be manually edited and
used as input to modules later in the chain.
Discourse Tier Annotation

The discourse annotation is related to speaking
style characteristics and global speech rate. The
speaking style/speaking situation variables included in the annotation are the number of discourse participants (monologue, two-part dialogue or multi-part dialogue), degree of formality (formal, informal), degree of spontaneity
(spontaneous, elicited, scripted, acted, read),
discourse type (human-directed, computerdirected). These variables are manually defined.
A number of automatically estimated measures of the average speech rate over the dialogue are also included. Speech rate is estimated by inverse segment duration. Segments
were estimated by the canonical phonemes and
segment boundaries by the automatically obtained alignment of the phoneme string to the
signal. Speech rate estimates based on all segments and estimates based on vowel segments
only are calculated. Duration normalised for
inherent phoneme length and for speaker, respectively, is used as well as non-normalised
duration. Both duration on a linear scale and on
a logarithmic scale are used. All combinations
of strategies are included in the annotation, resulting in 16 different speech rate measures for
each unit.
Annotation Structure
All annotation is connected to some durationbased unit at one of six hierarchically ordered
tires. The tiers correspond to 1) the discourse,
2) the utterance, 3) the phrase, 4) the word, 5)
the syllable and 6) the phoneme. Each tier is
strictly sequentially segmented into its respective type of units. Some non-word units can be
introduced in the word tier annotation to ensure
that parts of the signal that are not speech can
be annotated, e.g. pauses and inhalations.
A boundary on a higher tier is always also a
boundary on a lower tier. An utterance boundary is thus also always a phrase boundary, a
word boundary, a syllable boundary and a phoneme boundary. Thus, information can be unambiguously inherited from units on higher
tiers to units on the tires below.
Having the information stored at different
tiers enables easy access to the sequential context information, i.e., properties of the units adjacent to the current unit at the respective tiers.
Utterance Tier Annotation

The utterance tier annotation includes the variables speaker sex, utterance type (statement,
question/request response, answer/response)
and a set of speech rate measures.
Segmentation
Each annotation tier is segmented into its corresponding units, beginning at the word tier.
Based on the word tier segmentation and information derived from the word units, the tiers
above and below the word tier are segmented.
The phoneme tier is segmented word-by-word
using the orthographic annotation, a canonical
pronunciation lexicon and an HMM phoneme
aligner, NALIGN (Sjlander, 2003). The phonemes are clustered into syllables with forced
syllable boundaries at word boundaries and the
syllable tier is segmented using this clustering
and the durational boundaries from the phoneme segmentation. Utterance boundaries are
Phrase Tier Annotation

The phrase tier annotation includes the variables phrase type, phrase length (word, syllable
and phoneme counts), prosodic weight (stress
count, focal stress count), and measures of local speech rate over each phrase unit and of
pitch dynamism and pitch range.
A pitch extraction algorithm included in the
SNACK sound toolkit (Sjlander and Beskow,
2000; Sjlander, 2004) is used to obtain information about the pitch contour of the speech
26

data. A slope tracking algorithm was used for
localising minimum and maximum points or
plateaus in the extracted pitch contour. The
mean pitch is calculated over each segment of
the signal corresponding to a unit over which
pitch dynamism and range was to be computed.
The sum of the absolute distance between the
mean and each extreme value is the pitch dynamism. The difference between the largest extreme value and the smallest extreme value is
the pitch range. In addition to a normal Hz frequency scale, pitch is also measured on the
Mel, ERB (equivalent rectangular bandwidth),
and semitone scales. The three latter scales are
used to give estimates of pitch differences
closer to the perceived frequency differences of
human listeners.
consonant cluster, consonant cluster length

(phoneme count) and the realised phone.
Canonical phonological representations of
words were collected from a pronunciation
lexicon developed at the Centre for Speech
Technology (CTT). Phonological forms for
words not included in the lexicon were generated using grapheme-to-phoneme rules.
Phonetic transcripts are provided by a system using statistical decoding and a set of correction rules. First, a pronunciation network is
created. For each phoneme, a list of possible
realisations (tentative phones) is collected from
an empirically based realisation list. The phone
label set is the same as the phoneme label set
and includes 23 vowel symbols and 23 consonant symbols. There is also a place filler null
label in the phone label set used for occupying
the phone positions of phonemes with no realisation in the phonetic string.
A finite state transition network is built from
the pronunciation net and a set of HMM monophone models (Sjlander and Beskow, 2000;
Sjlander, 2003). SNACK tools (Sjlander,
2004) are then used for Viterbi decoding (probability maximisation) given the observation sequence defined by the parameterised speech.
A layer of correction rules are applied to
correct some systematic errors made in the
Viterbi decoding. The rules use phoneme context (including word stress annotation) and tentative phone context as well as estimated phoneme and tentative phone durations as context.
The correction rules were compiled using a
manually transcribed gold standard to detect
Viterbi decoder errors and to evaluate the effects of introduced rules. For each phoneme in
the canonical representation, the gold standard
phone and the phone produced by the decoder
were compared. Each type of deviation from
the gold standard was investigated with the aim
to find consistencies in the context which could
be used in formulating correction rules. Rules
were written to minimise the phoneme error
rate (PER), however with the restriction that
the rules should be generally applicable. The
final rule system was evaluated on a gold standard different from the development standard
used in the development phase. The evaluation
showed similar PERs and error distributions for
the evaluation gold standard as for the development gold standard, both generally and when
separating different speakers. The PER of the
autotranscription system when compared to the
evaluation gold standard was 14.37%.
Word Tier Annotation

In addition to a reference orthographic representation, the variables included in the word
tier annotation are word length (syllable and
phoneme counts), part of speech, morphology
(number, definiteness, case, pronoun form,
tense/aspect, mood, voice and degree), word
type (content word or function word), word
repetitions (full-form and lexeme), word predictability (estimation based on trigram, bigram
and unigram statistics from an orthographically
transcribed version of the Gteborg Spoken
Language Corpus, Allwood et al., 2000), global
word probability (unigram probability), the position of the word in the phrase, focal stress,
distance to preceding and succeeding foci (in
number of words), pause context, filled pause
context, interrupted word context, prosodic
boundary context and measures of local speech
rate over each word unit and of pitch dynamism
and pitch range.
Syllable Tier Annotation
The syllable tier annotation includes the variables stress, accent, distance to preceding and
succeeding stressed syllable (in number of syllables), syllable length (phoneme count), syllable nucleus, the position of the syllable in the
word and measures of local speech rate over
each syllable unit.
Phoneme Tier Annotation
On the phoneme level, the annotation includes
the canonical phoneme and a set of articulatory
features describing the canonical phoneme, the
position of the phoneme in the syllable and in a
27

Aycock J. (1998) Compiling little languages in
Python. Proc /th International Python Conference.
Bannert R. and Czigler P. E. (1999) Variations
in consonant clusters in standard Swedish.
Phonum 7, Ume University.
Brants T. (2000) TnT A statistical part-ofspeech tagger. Proc ANLP.
Bruce G. (1986) Elliptical phonology. Papers
from the Ninth Scandinavian Conference on
Linguistics, 8695.
Finke M. and Waibel A. (1997) Speaking mode
dependent pronunciation modeling in large
vocabulary conversational speech recognition. Proc Eurospeech, 23792382.
Fosler-Lussier E. and Morgan N. (1999). Effects of speaking rate and word frequency
on pronunciations in conversational speech.
Speech Communication, 29(24):137158.
Grding E. (1974) Sandhiregler fr svenska
konsonanter. Svenskans beskrivning 8, 97
106.
Jande P-A (2003a) Evaluating rules for phonological reduction in Swedish. Proc Fonetik,
149152.
Jande P-A (2003b) Phonological reduction in
Swedish. Proc 15th ICPhS, 25572560.
Jande, P.-A. (2004). Pronunciation variation
modelling using decision tree induction
from multiple linguistic parameters. Proc
Fonetik, 1215.
Jurafsky D., Bell A., Gregory M., and Raymond W. (2001). Probabilistic relations between words: Evidence from reduction in
lexical production. In Bybee and Hopper
(eds) Frequency and the emergence of linguistic structure, 229254. John Benjamins.
Megyesi B. (2002a) Data-driven syntactic
analysis Methods and applications for
Swedish. Ph. D. Thesis. KTH, Stockholm.
Megyesi B. (2002b). Shallow parsing with pos
taggers and linguistic features. Journal of
Machine Learning Research, 2, 639668.
Sjlander K. (2003) An HMM-based system
for automatic segmentation and alignment
of speech. Proc Fonetik, 193196.
Sjlander K. (2004) The snack sound toolkit.
http://www.speech.kth.se/snack/
Sjlander K. and Beskow J. (2000) WaveSurfer
- a public domain speech tool. Proc ICSLP,
IV, 464467.
Van Bael C., van den Heuvel H., and Strik H.
(2004). Investigating speech style specific
pronunciation variation in large spoken language corpora. Proc ICSLP, 586589.
Model Performance
The annotation has been used for decision tree
model induction (initial results are reported in
Jande, 2004). The decision tree pronunciation
variation model works with phonemes in a canonical phonemic pronunciation representation
as its central units. A vector containing all
available context information is connected to
each canonical phoneme. For each canonical
phoneme, the model makes a decision about the
appropriate phone realisation given the context
associated with the canonical phoneme.
Decision tree models trained on annotation
from elicited monologue showed a PER of
9.91% when evaluated against the same type of
data as they were trained on in a tenfold cross
validation setting. This meant a 55.25% error
reduction compared to using the canonical pronunciation representation for estimating the
phonetic realisation.
The decision tree models were pruned to
make them more general (less specific to the
particular training data from which they were
induced). Thus, not all variables were used in
the final models. None of the discourse or utterance tier attributes were used in any of the
pruned models, probably due to the fact that
only one type of speaking style was used. From
the phrase, word, syllable and phoneme tiers,
many different types of attributes were used. As
could be expected, the identity of the canonical
phoneme was the primary phone level realisation predictor.
Conclusions
A system for annotation of speech data with
variables hypothesised to be important for the
pronunciation of words in discourse context has
been described. Automatic methods used for
obtaining or estimating variables have been
presented. The annotation has been used for
creating pronunciation variation models in the
form of decision trees. The models show a decrease in phone error rate with 55.25% compared to using canonical phonemic word representations from a pronunciation lexicon.
References
Allwood J., Bjrnberg M., Grnqvist L., Ahlsn E. and Ottesj C. (2000) The Spoken
Language Corpus at the Linguistics Department, Gteborg University. Forum
Qualitative Social Research 1.
28
Estonian rhythm and the Pairwise Variability Index

Eva Liina Asu1 and Francis Nolan2
1
Institute of the Estonian Language, Tallinn
2
Department of Linguistics, University of Cambridge
isochrony to a scalar prominence gradient between successive syllables. On average, British
English alternates prominent and highly reduced syllables, while French syllables are
more even. The research reported by Low,
Grabe and Nolan (2000), and Grabe and Low
(2002), focusing on the PVI, and that of Ramus,
Nespor and Mehler (1999), using rather different measures, has shown that it is possible to
achieve useful scalar characterisation (though
not discrete categorisation) of the utterance
rhythm of different languages.
According to the traditional rhythmic dichotomy, Eek and Help (1987) classify Estonian as a syllable-timed language that has in
its history undergone a Great Rhythm Shift
from a rhythmically more complex type. The
only study where Estonian has been included in
the PVI calculations is Grabe and Lows (2002)
comparison of 18 languages where Estonian
showed a vocalic variability roughly similar to
Romanian, French and Catalan.
The present pilot study pursues the characterisation of Estonian utterance rhythm using
the PVI. In particular, since the PVI is a completely general concept, its application is extended to alternative phonetic units such as syllables and feet.
Abstract
The Pairwise Variability Index (PVI), a measure of how much unit-to-unit variation there is
in speech, has been used as a correlate of
rhythmic impressions such as syllable-timed
and stress-timed. Grabe and Low (2002) included Estonian among a number of languages
for which they calculated the durational PVI
for vowels and for intervocalic intervals, but
other than that Estonian rhythm has not been
studied within recent approaches to rhythm
calculation. The pilot experiment reported in
this paper compares the use of various speech
units for the characterisation of Estonian
speech rhythm. It is concluded that the durational PVI of the syllable and of the foot are
more appropriate for capturing the rhythmic
complexity of Estonian, and might provide a
subtle tool for characterising languages in general.
Introduction
The Pairwise Variability Index (PVI) is a
quantitative measure of acoustic correlates of
speech rhythm which calculates the patterning
of successive vocalic and intervocalic (or consonantal) intervals showing how one linguistic
unit differs from its neighbour. It was first applied, at the suggestion of the second author of
this paper, by Low (1998: 25) in her study of
Singapore English rhythm. Among other measures Low compared successive vowel durations
and showed that Singapore English had a lower
average PVI over utterances than British English. This fits in with the impressionistic observation that Singapore English is more syllable
timed than British English.
Syllable timing (Abercrombie, 1967: 97)
carries the implication that speakers make syllables the same length, and is opposed to stress
timing, the tendency to compress syllables
where necessary to yield isochronous feet (i.e.
inter-stress intervals). Attempts to find
isochrony of either kind have produced disappointing results, event for languages canonically perceived as syllable-timed (e.g. French)
or stress-timed (e.g. British English). The PVI,
however, shifted the emphasis from absolute
Method
Materials
The materials used were the first four sentences
of a read passage recorded for intonation analysis in Asu (2004). The four sentences comprised 62 syllables and (depending on the
speaker) four to seven intonation phrases. This
is less than half the 160 or so syllables on
which Grabe and Lows (2002) Estonian results
are based, but compensatorily the present
analysis uses data from five speakers compared
to only one in theirs. The speakers are all female speakers of Standard Estonian who were
asked to read the passage in a normal tempo.
Three subjects were recorded in a quiet environment in Tartu, Estonia, and two in the
sound-treated booth of the phonetics laboratory
of Cambridge University.
29

Analysis
Two sets of measurements were made. First,
the start time of each vowel and each interconsonantal interval were measured. Mostly this is
straightforward in Estonian, but the segmentation of [j] required fine judgement, and, of
course, is in theory not possible; nonetheless for
completeness it was attempted as best possible
on the basis of formant dynamics and listening.
From the vocalic and intervocalic measurements could be calculated the vowel PVI
(VPVI) and intervocalic PVI (CPVI) for comparison with Grabe and Low (2002).
Additionally, the measurements allowed the
derivation of pseudo-syllable units, consisting
of an intervocalic interval and the following
vowel, as apparently used by Barry et al.
(2003). This unit totally disrespects linguistic
syllabification, so that for instance the words
hommikust si ((he/she) ate breakfast), which
linguistically would be syllabified as
hom.mi.kust.si, yielded the pseudo-syllables
ho.mmi.ku.stsi. The motivations for calculating the pseudo-syllable are merely that it is the
natural corollary of the vocalic-intervocalic
PVI dichotomy, and, more interestingly, that it
corresponds to the Articulatory Syllable of
Kozhevnikov and Chistovich (1965), which
they proposed as the domain of coarticulation.
Although many studies disconfirmed this hypothesis, the pseudo-syllable nevertheless has a
heritage in research into the organisation of
speech production which makes it worth considering.
The second set of measurements took as its
starting point a traditional phonological syllabification of the utterances. There is relatively
little controversy over how Estonian syllabifies
(unlike in the case of English): e.g. a long (Q2)
or overlong (Q3) vowel or a diphthong forms
one syllable but consonant clusters of two or
more consonants are split so that the last consonant starts a new syllable. Acoustically some
decisions had to be made; for instance long
consonants, or sequences of identical vowels at
the end of one word and the beginning of another, were simply divided at their mid-point;
while sequences of two different vowels at
word boundaries were divided at the point
which best preserved their acoustic and auditory identity. Most cases, however, were unproblematic. The beginning of each syllable
was recorded, and the syllable lengths calculated.
A further set of durations was derived from

the linguistic syllable, namely the phonological
feet. These are considered to consist of a
stressed (not necessarily accented) syllable and
zero, one, or two following unstressed syllables. Trisyllabic words constitute one prosodic
foot if there is no secondary stress on the second or third syllable of the word (Ross and Lehiste, 2001: 51). Phrase-initial unstressed syllables (an anacrusis) are left out as they do not
participate in a well-formed foot. The calculation of foot durations merely involved adding
together the durations of the syllables making
up the foot.
The sentences are listed below with syllable
divisions indicated by points, feet enclosed by
square brackets, sentence ends marked by ##,
and word divisions by spaces:
[Sel] . [hom.mi.kul] . [r.kas] . [Tuu.li] . [en.ne]
. [ema] . ## Ta . [pani . end] . [ta.sa].[ke.si] .
[rii].[des.se] . ja . [lks] . [al.la] . [k.ki] .##
Kui . ta . [ta.lu].[k.gi] . [pi.ka] . [lau.a] .
[ta.ga] . [hom.mi.kust] . [si] . [il.mus] .
[uk.se.le] . [.de] . [A.nu] .## Kas . sa .
[l.hed] . [t.na] . [koo.li]? ##
Among other units, Ramus (2002) suggests
the foot as a possible alternative unit for the
measure of speech rate but as far as the present
authors are aware the foot PVI has not been
calculated previously in any language. The rationale for calculating it will be discussed with
the results.
One further point must be mentioned. The
PVI can be calculated raw (rPVI), where the
differences between successive pairs of units
are averaged over the material, or normalised
(nPVI). Normalisation involves expressing each
difference as a proportion of the average of the
two units involved (e.g. their average duration).
The original point of this (Low, 1998: 39) was
to neutralise the effect of utterance level rate
variation, particularly between-speaker differences in rate and phrase final lengthening or
rallentando. There are arguments for and
against this (e.g. Barry et al., 2003, Grabe and
Low, 2002) as a matter of principle, but the fact
that our units are of widely differing size (segment, syllable, foot) means that normalisation is
essential. The magnitude of equivalent variation
between feet will inevitably be greater in absolute terms than that between syllable-parts, but
expressing the variation as a proportion of the
two units involved neutralises this difference of
magnitude. The resultant fractional value of
30

level. The vocalic measure nVPVI is not significantly different from either of the two syllable measures nCVPVI and nSPVI.
each normalised PVI is multiplied by 100 to

express it as a percentage.
Results
Discussion
The results are summarised in Figure 1 which

shows five normalised PVI measures, and five
speakers within each measure. The first measure is the normalised vowel PVI (nVPVI). The
values range between 39 and 52, averaging
44.6, compared to Grabe and Lows (2002)
45.4, which reassures us both that our sample is
large enough and that there is reasonable consistency between speakers. Grabe and Lows
second measure, the raw intervocalic PVI, is
not shown on the graph as it is not comparably
scaled (consonantal intervals are generally
much shorter). Our group mean of 45.9 is a little higher than their 40.0, but still of a similar
order; since this value is not normalised, it
would be very sensitive to speech rate. The second measure plotted in Figure 1 is the normalised version of this intervocalic PVI (nCPVI),
the mean of the speakers being 57.5.
There are, on the face of it, two curious aspects

to the PVI tradition. The first is why, when syllable-timing is at issue, the pairwise variability
not of the syllable but of components of the syllable (vowel and intervocalic interval) have
been favoured (except by Barry et al., 2003).
Low (1998: 25) attributes her choice of the vocalic PVI to Taylor (1981), who claims that
vowel duration is the key to syllable timing;
and, pragmatically, the choice also allows researchers to side-step controversies about English syllable-division; but there is little detailed
justification in the literature. Subsequently, the
intervocalic (CPVI) measure was adopted to
capture, in particular, languages permitted consonant sequences. The present data, however,
give us reasons to question the desirability of
splitting the syllable. For one thing, nVPVI and
nCPVI have least between-speaker stability (SD
around 5.0, compared to 2.0 for the syllable
measures and 1.5 for the nFPVI). This suggests
that timing regularity below the syllable may
not be controlled as tightly as higher up the prosodic hierarchy. For another, the nCPVI reflects
two very different influences: the degree of
phonotactic permissiveness, and the dynamics
of individual consonants (in Estonian, for instance, the presence of a very short tapped /r/ at
one extreme and overlong (Q3) consonants at
the other).
The second curious aspect, given the opposing term stress-timing, is that little if any attention has been paid previously to the PVI of
the foot. Admittedly the values for nFPVI in the
present paper need to be treated with caution, as
they are based on only 28 feet per speaker, but
the measure is consistent across speakers; and
we suggest that there are good reasons in principle for including this measure in future studies. In Estonian, the nFPVI turns out to be significantly smaller than our syllable-based
measures (nCVPVI, nSPVI, and, arguably,
nVPVI), but we expect nFPVI to have a variable relationship with these measures across
languages which will define rhythmic type in a
rather subtle way. Languages perhaps need not,
as in the traditional dichotomy, either (like English) squash their unstressed syllables to
achieve approximate foot-isochrony, or (like
French) keep their syllables fairly even and not
Figure 1. Estonian PVI measures for five prosodic

units for five speakers (EP, KK, LL, LS, MU).
The next two PVI measures in Figure 1 have

group means around forty: 40.5 for the PVI of
the pseudo-syllable comprising a vowel and
all preceding consonantal material (nCVPVI),
and 45.7 for the linguistic syllable (nSPVI). The
foot PVI (nFPVI) has clearly the lowest group
mean at 35.3. Although thorough statistical
testing has not yet been carried out on this limited pilot data, a t-test shows that nFPVI is different from nSPVI and nCVPVI at the p<0.01
31

Asu E. L. (2004) The Phonetics and Phonology
of Estonian Intonation. Unpublished doctoral dissertation. University of Cambridge.
Barry W. J., Andreeva B., Russo M., Dimitrova
S. and Kostadinova T. (2003) Do rhythm
measures tell us anything about language
type? Proceedings of the 15th ICPhS, Barcelona, 2693-2696.
Eek A. and Help T. (1987) The interrelationship between phonological and phonetic
sound changes: a great rhythm shift of Old
Estonian. Proceedings of the 11th ICPhS,
Tallinn, Estonia, 6, 218-233.
Grabe E. and Low E. L. (2002) Durational variability in speech and the rhythm class hypothesis. In Gussenhoven C. and Warner N.
(eds) Laboratory Phonology 7, 515-546.
Berlin: Mouton de Gruyter.
Kozhevnikov V. A. and Chistovich L. A.
(1965) Speech: Articulation and Perception.
Translation: Joint Publications Research
Service 30-543, US Department of Commerce.
Low E. L. (1998) Prosodic Prominence in Singapore English. Unpublished doctoral dissertation. University of Cambridge.
Low E. L., Grabe E. and Nolan F. (2000) Quantitative characterisations of speech rhythm:
syllable-timing in Singapore English. Language and Speech 43, 377-401.
Ramus F. (2002) Acoustic correlates of linguistic rhythm: perspectives. Proceedings of
Speech Prosody 2002, Aix-en-Provence,
115-120.
Ramus F., Nespor M. and Mehler J. (1999)
Correlates of linguistic rhythm in the speech
signal. Cognition 72, 1-28.
Ross J. and Lehiste I. (2001) The Temporal
Structure of Estonian Runic Songs. Berlin:
Mouton de Gruyter.
Taylor D. S. (1981) Non-native speakers and
the rhythm of English. International Review
of Applied Linguistics in Language Teaching 19 (3).
bother about foot timing. They could also

equalise their feet to some degree, but share the
squashing more fairly in polysyllabic feet. Estonian, with its strong stress but near absence of
vowel quality reduction in unstressed syllables,
and despite its three-way quantity contrast
which sporadically curtails syllable-equality,
may be at base such a language. Our prediction
for English, therefore, is a higher nSPVI than
for Estonian and a similar or lower nFPVI; and
for French, a similar or lower nSPVI compared
to Estonian and a higher nFPVI. The fourth
logical possibility, a language with a steep
prominence gradient between stressed and unstressed syllables but with no tendency to footisochrony, seems less likely.
A pilot study on one speaker of English appears to confirm our prediction. In a sample
consisting of 42 feet the nSPVI was higher
(53.5) than for Estonian (45.7) and the nFPVI
lower (30.3) as compared to Estonian (35.3).
This result is suggestive, although there are
considerable problems in English defining both
syllables and feet so that further research will
be required to confirm this characterisation of
the two languages. However, we are confident
that syllable and foot PVIs provide independent
dimensions which will prove effective in capturing the rhythmic properties of languages.
Conclusion
This paper has presented a preliminary investigation of Estonian rhythm, comparing a number
of measures, each of which expressed the fluctuation in duration between successive phonological units. It has been argued that the common practice of characterising languages in
terms of pairwise variability of vowels and
intervocalic intervals may be less appropriate
than using variability measures of phonological
syllables and of stress feet. This is particularly
so when the results are to be related to impressionistic characterisations in terms of syllabletiming and stress-timing. However the point
is made that these terms are not opposites
ranged on a single continuum, but two independent parameters along which languages can
vary.
References
Abercrombie D. (1967) Elements of General
Phonetics. Edinburgh: Edinburgh University
Press.
32
Duration of syllable-sized units in casual and elaborated

Finnish: a comparison with Swedish and Spanish
Diana Krull
Department of Linguistics, Stockholm University, Stockholm
is no interdependence between vowel and consonant quantity (Engstrand and Krull, 1994).
Abstract
Recordings of Finnish casual dialogue and
careful reading were analyzed auditorily and
on spectrograms. Syllables on the phonological
level were compared to syllable-sized units
(CVs) on the phonetic level. Comparisons
with existing Swedish and Spanish data revealed several differences: Finnish had much
less temporal equalization of syllable-sized
units in casual speech than Swedish, and even
slightly less than Spanish. Instead, there was a
greater decrease in the number of CVs. In all
three languages, the duration of a CV was
partly dependent on its size. In Finnish, however, (in contrast to Swedish and, to a lesser
degree, Spanish) CV duration was only marginally affected by lexical stress. Finnish, like
Spanish, had rhythmic patterns typical of syllable-timed languages in both speaking styles,
while Swedish changed from a more stresstimed pattern in careful reading to a more syllable-timed in casual speech.
Methods
The Finnish speech material consisted of a
lively dialogue between native speakers PJ and
EK, and text reading (PJ). The dialogue was
recorded in in the early 1990-ies and was used
for segment duration analyses (Engstrand and
Krull, 1994). The text reading was recorded in
2005. The Swedish and Spanish materials cited
for comparisons come from Engstrand and
Krull 2001, 2002. All recordings were made in
sound-treated recording studios using high
quality professional equipment.
The digitized Finnish material was segmented into syllable-sized units and labeled using the Soundswell Signal Workstation
(http://www.hitech.se/development/products/so
undswell.htm). Since casual speech is characterized by numerous coarticulation and reduction phenomena, a conventional
morphophonetically based syllabification was not possible. Reliable identification and segmentation
of units posed problems; e.g. the Swedish word
behandla could be pronounced as [bela].
Therefore, contoid-vocoid(-contoid) sequences
mirroring opening-closing movements were
chosen as units (see Engstrand and Krull,
2002). For simplicity, they will be referred to
as CV units, where C may be a single contoid or a cluster and V a single vocoid or a
diphthong. The term unit is used in a strictly
phonetic sense, it may sometimes containin
traces of underlying segments.
The segmentation was carried out auditorily
and visually on spectrograms in the same manner as with the Swedish and Spanish material
(Engstrand and Krull 2001, 2002). Onsets consisted where possible of a single contoid or a
contoid cluster, and a unit was considered wellformed if it agreed with Jespersens sonority
hierarchy (Jespersen 1926). No consideration
was given to the phonotax of a given language
or to word and morpheme boundaries. For example, the Finnish words mys kielellisen
would be segmented as my-skie-le-lli-sen re-
Introduction
The complex Swedish phonotax has been
shown to be considerably simplified in casual
speech (Engstrand and Krull, 2001). Syllables
containing heavy consonant clusters on the
phonological level were often represented by
alternating simple contoid-vocoid sequences in
casual conversation. As a consequence, the durations of syllable-sized units tended to be
equalized and the rhythmic pattern of Swedish
came closer to that of a syllable timed language
such as Spanish (Engstrand and Krull, 2002).
The present paper addresses the question:
How would the durations of syllable-sized units
in different Finnish speaking styles compare
with Swedish and Spanish? On the one hand,
Finnish resembles Spanish in the simplicity of
the phonotax which would let us expect a similarity to the Spanish pattern. On the other hand,
there is a segmental short-long contrast as in
Swedish, although not limited to stressed syllables. Phonetically, the difference between short
and long segments is larger in Finnish and there
33

sequences amounted to about 60% of all phonological syllables, in Swedish to no more than
around 30%. In the phonetic representation,
however, the difference in the amount of open
units between the two languages was smaller,
mainly due to the more radical simplification of
syllables in Swedish. (Several spectrographic
examples of such simplifications are given in
Engstrand and Krull, 2001).
sulting in an initial cluster which is not normal

in Finnish.
Results
Table 1 shows the incidence of phonetic syllable-sized units vs. phonologically determined
syllables in Finnish. Swedish data (Engstrand
and Krull 2001) were added for comparison.
(The duration of units in prepausal positions
was not included.) It can be seen that in both
languages, the total number of phonetic units
was lower than the corresponding number of
phonological syllables. The decrease was larger
for the Finnish speakers: 12% (PJ) and 15%
(EK) in casual speech, while the corresponding
amount for the Swedish was 9% (RL) and 10%
(JS). In reading condition, the decrease was
10% for Finnish and only 2% for Swedish.
In addition, Table 1 shows that the share of
open units (i.e. sequences ending in a vowel or
vocoid) was larger in the phonetic representation of both languages. The increase was much
larger for the Swedish speakers: from 27% to
73% (RL), 40% to 73% (JS), and 31% to 62%
(GT, read text). For Finnish, the corresponding
increase was from 58% to 80% (PJ), 61% to
79% (EK), and 57% to 77% (PJ, read).
Table 2. Mean durations and standard deviations

(ms) for open syllable-sized units (CVs) in read
and casual Finnish. Swedish and Spanish data
(Engstrand and Krull, 2002) are added for comparison.
Analysis No.
of
syll.
Finn. PJ
casual
Finn. EK
casual
Finn. PJ
read
Phonet.
Phonol.
Phonet.
Phonol.
Phonet.
Phonol.
Swed. RL
casual
Swed. JS
casual
Swed. GT
read
Phonet.
Phonol.
Phonet.
Phonol.
Phonet.
Phonol.
%
open
syll.
960
1097
584
647
876
954
%
%
CV CCV
and
V
73 7
58 0
67 11
61 0
71 7
57 0
822
900
543
491
977
997
61 12
25 2
67 13
37 2
53 9
29 2
73
27
81
40
62
31
Condition Mean
Std
Read
Casual
Finnish EK Casual
156
154
166
48
42
48
675
761
429
Swedish
Read
Casual
200
178
86
62
350
306
Spanish
Read
Casual
156
155
59
51
167
287
Table 2 shows mean CV durations and

standard deviations for three languages and two
speaking conditions. Note the difference in
standard deviations between Swedish on the
one hand and Spanish and Finnish on the other.
Of special interest is the decrease in standard
deviation between the read and casual conditions.
Figure 1 illustrates the distribution of CV
durations in Finnish, Swedish and Spanish
graphically. In the Swedish read condition the
frequeny distribution was much broader and
flatter than in casual speech. The difference between speaking conditions was smaller in Spanish and smallest in Finnish.
The data in Figure 2 show a strong dependance of the duration of a CV on the number
of segments it contains. Moreover, they reveal
an additional difference between Finnish and
Swedish. In Swedish, there was a regular difference in duration between stressed and unstressed CVs in casual speech and a slightly
larger difference in the reading condition. In
Finnish, the difference was negligible. For all
CVs, the difference between the mean duration of stressed and unstressed units was 107
ms for read Swedish and 76 ms for casual
Table 1 Phonetic syllable-sized units and. phonological syllables in casual and read Finnish. Swedish data (Engstrand and Krull 2001) are added for
comparison.
Language,
speaker
condition
Language
speaker
Finnish PJ
80
58
79
61
77
57
Another difference between the two languages was the share of simple syllables
(mainly CV, in a few cases V). In Finnish, such
34
Finnish
Swedish
150
Spanish
50
40
0.14
0.2
120
90
t
n
u
o
C
0.12
40
0.1
60
30
Pr
o
p
o
r
oti
n
p
e
r
B
a
r
0.10
30
t
n
u
o
C
0.08
0.06
20
0.04
10
0.02
0
0
125
250
Duration (ms)
375
0.0
500
0
0
150
125
250
Duration (ms)
375
50
0.2
32
P
or
p
o t
r n
oti u
o
n C
p
e
r
B
a
r
24
8
0
0
0.00
500
0.1
16
125
250
Duration (ms)
375
0.14
0.12
90
t
n
u
o
C
0.10
0.08
60
0.06
0.04
30
125
250
Duration (ms)
375
0.00
500
32
40
0.12
P
or
p
o
r
oti
n
p
e
r
B
a
r
0.10
30
t
n
u
o
C
0.08
20
0.06
0.04
10
0.10
Pr
o
p
ot
rn
otiu
no
pC
e
r
B
a
r
24
0.08
0.06
16
0.04
8
0.02
0.02
0.02
0
0
0.12
0.14
0.16
0
0
125
250
Duration (ms)
375
0.00
500
0.0
500
40
0.16
0.18
120
Pr
o
p
o
r
oti
n
p
e
r
B
a
r
0
0
125
250
Duration (ms)
375
P
or
p
o
r
oti
n
p
e
r
B
a
r
0.00
500
Figure 1. Distributions of CVunit durations (ms) in Finnish, Swedish and Spanish in two speaking
conditions: upper row read text, lower row casual speech. Data affected by prepausal lengrhening are
removed (The Swedish and Spanish data are from Engstrand and Krull, 2002)..
speech. For read Finnish, the difference was 10

ms and for casual speech 11 ms
According to Engstrand and Krull (2002),
The 107 ms stress effect in the Swedish read
condition agrees with the expected durational
effect of of stress-timed languages (Eriksson
1991), whereas the durational quantum added
by stress in the remaining conditions (unscripted Swedish as well as read and unscripted
Spanish would be more typical of syllabletimed languages. For Finnish, the same can be
said as for Spanish.
structures made up more than half of the syllables on the phonological level, while in Swedish the corresponding amount was less than a
third. Compared to Swedish, therefore, Finnish
allows less possibilities for syllable simplification.
It appears that instead of simplifying syllables, the Finnish speakers dropped them. There
was a relatively large decrease in the number of
syllables between the phonological representtion and its phonetic counterpart in Finnish,
both in casual speech and in text reading. Another difference between Finnish and Swedish
was found in the distribution of syllable durations Although Finnish has a comparatively
large difference in duration between short and
long segments (see Engstrand and Krull, 1994
for data from speakers PJ and EK), the duration
of (open) syllables tended to center around a
narrow area around a peak (Figure 1). There
was not much change in distribution breadth
between reading and casual speech. Part of an
explanation for this phenomenon may lie in the
near-equality of the durations of stressed and
Discussion and conclusions

In casual Swedish, sound sequences have been
shown to be greatly simplified in relation to the
phonotactic structures. Specifically, Swedish
tends to produce alternating contoid and vocoid
articulations which relate to more complex
structures on the phonological level (Engtrand
and Krull, 2001). In Finnish, the phonotactic
structure is much less complex. In the read text
of the present study, for example, simple CV
35
Finnish
)
s
m
(
n
oti
ar
u
D
Swedish
500
500
400
400
)
s
m
(
n
oi
t
ar
u
D
300
200
)
s
m
(
n
oi
t
ar
u
D
200
100
100
0
0
300
2
3
4
Number of segments
0
0
500
500
400
400
)
s
m
(
n
oti
ar
u
D
300
200
2
3
4
Number of segments
2
3
4
Number of segments
300
200
100
100
0
0
2
3
4
Number of segments
0
0
Figure 2. Mean durations (ms) as a function of CV unit size in read and unscripted Swedish and Finnish. Upper graph show casual speech; lower graphs text reading. Filled circles represent stressed units
and triangles unstressed units.
unstressed syllables in Finnish. In contrast to

Swedish, Finnish has quantity distinction also
on unstressed syllables.
To sum up, the relatively simple syllable
structure of Finnish resembles that of Spanish.
As a consequence, there was a strong resemblance between these languages in the manner
of how phonological syllables were represented
on the phonetic level. Moreover, there was little
difference between the syllable structures of
careful reading and casual speech.
Swedish differs from Finnish and Spanish
through its complex syllable structure, with
heavy consonant cluster in all positions. On the
phonetic level, especially in casual speech,
these structures were extensively simplified. As
a result, the Swedish speech rhythm became
more similar to that of Finnish and Spanish. It
could be said that with a change from careful
reading to casual speech, Swedish changed
froma stress-timed to a more syllable-timed
language.
Further investigation will show whether this
phenomenon is specific for Swedish or whether
it is generally valid for languages with elaborate syllable structures.
References
Engstrand, O. and Krull, D. (1996). Durational
correlates of quantity in Swedish, Finnish
and Estonian: cross-language evidence for a
theory of adaptive dispersion. Phonetica
Vol. 51, No. 1-3, 1994.
Engstrand, O. and Krull, D. (2001). Simplification of phonotactic structures in unscripted
Swedish. J.I.P.A. 31, 41-50.
Engstrand, O. and Krull, D. (2002). Duration of
syllable-sized units in casual and elaborated
speech: cross-language observations on
Swedish and Spanish. In: Fonetik 2002,
TMH-QPSR Vol. 44.
Eriksson, A. (1991). Aspects of Swedish
speech rhythm. Gothenburg Monographs in
Linguistics 9. Department of Linguistics,
University of Gteborg.
Jespersen, O. (1926). Lehrbuch der Phonetik. 4.
Aufl., 190-91, Leipzig and Berlin: Teubner.
36
The sound of Swedish on Multilingual Ground

Petra Bodn1, 2
1
Department of Linguistics and Phonetics, Lund University, Lund
2
Department of Scandinavian Languages, Lund University, Lund
Greenland (Jacobsen 2000) and in the so-called
multi-ethnolect of adolescents in Copenhagen
(Quist 2000). In some other language varieties
that have developed through language contact,
e.g. Nigarian English (Udofot 2003), Maori
English (Holmes & Ainsworth 1996) and Singapore English (Low & Grabe 1995), the
speech rhythm has been described as approaching syllable-timing. However, in the present
paper, we will restrict ourselves to investigating
the similarities (and differences) between three
varieties of SMG.
Abstract
In the present paper, recordings of Swedish on
multilingual ground from three different cities
in Sweden are compared and discussed.
Introduction
In Sweden, an increasing number of adolescents speak Swedish in new, foreign-sounding
ways. These new ways of speaking Swedish are
primarily found in the cities. The overarching
purpose of the research project Language and
language use among young people in multilingual urban settings is to describe and analyze
these new Swedish varieties (hereafter referred
to as Swedish on multilingual ground, SMG)
in Malm, Gothenburg and Stockholm.
Most SMG varieties are known by names
that reveal where they are spoken: Rinkeby
Swedish in Rinkeby, Stockholm, Grdstenska in Grdstena, Gothenburg and Rosengrd
Swedish in Rosengrd, Malm. However, if
you discuss Rinkeby Swedish with young people in Malm, they will instantly associate with
Rosengrd Swedish (i.e. with the corresponding
Malm SMG variety), if you play examples of
Rosengrd Swedish to teenagers in Lund, they
will associate with the Lund SMG variety
Fladden (named after Norra Fladen), and so
on. In other words, obvious similarities are perceived between different varieties of SMG.
Method
The material comes from the speech database
collected by the research project Language and
language use among young people in multilingual urban settings.
During the academic year 2002-2003, the
project collected a large amount of comparable
data in schools in Malm, Gothenburg and
Stockholm. The speakers are young people
(mainly 17-year-olds) who attended the second
year of the upper secondary schools educational program in social science during 20022003.
The recordings are comprised of both
scripted and spontaneous speech. The recordings include: (01) interviews between a
project member and the participating pupils,
(02) oral presentations given by the participating pupils, (03) class-room recordings, (04) pupil group discussions, and (05) recordings made
by the pupils themselves (at home, during the
lunch break, at cafs, etc.).
The recordings were made with portable
minidisk recorders (SHARP MD-MT190) and
electret condenser microphones (SONY ECM717), and subsequently digitized.
The naturalness of the speech material has
been obtained on the expense of good sound
quality. Acoustic analyses using the speech
analysis programs WaveSurfer and Praat have
been undertaken when possible, other parts of
the material have primarily been examined using auditory analyses.
Purpose
In the present paper, a first comparison between
SMG materials recorded in Malm, Gothenburg and Stockholm is undertaken with the object of searching for differences and similarities
in the varieties phonology and phonetics.
Previous research
Descriptions in the literature of so-called ethic
accents or (multi) ethnolects of different languages reveal some similarities. One example
of such a similarity is a staccato-like rhythm or
syllable-timing. A staccato-like rhythm has
been observed in e.g. Rinkeby Swedish (Kotsinas 1990), in the so-called Nuuk Danish spoken
by monolingual Danish-speaking adolescents in
37

quently. Finally, trilled r sounds can be heard in
the Gothenburg SMG material, although here it
is not evident that they are more numerous than
in the Gothenburg dialect in general.
Results
In the following, we will restrict ourselves to
describing a small set of SMG features that
demonstrate interesting differences and similarities between the cities.
Prosody
Word accents
It is a well known fact that L2 learners of
Swedish find it difficult to perceive and produce the word accent distinction. Given the
close relation between foreign accent and
SMG, one possible common feature of the
SMG varieties is a lack of word accent distinction.
Phonetically, the difference between accent
I and II is one of F0 peak timing. The F0 peak
of accent I has an earlier alignment with the
stressed syllable than accent II. In the Malm
dialect, the F0 peak is found at the beginning of
the stressed syllable in accent I words, and at
the end in accent II words. The same pattern
can be seen in examples of Rosengrd Swedish,
see Figure 1.
Segmentals
and t
When we ran a perception experiment in
Malm with the object of investigating whom
of our informants spoke Rosengrd Swedish
(Hansson & Svensson 2004), we noted that one
of the stimuli contained something typical for
SMG at the very beginning of the recording.
Instead of listening to the entire 30 second long
stimulus, the listeners (adolescents from
Malm) marked their answer after having heard
only the first two prosodic phrases (approximately 6.5 seconds). The two phrases in question are given in (1).
(1) ja(g) ska g plugga lite nu ass hon
checkar sprket snt
400
Apart from the phrase snt and stuff which

adolescents in Malm perceive as particularly
common in Rosengrd Swedish, the pronunciation of the word checkar checks stands out as
being non-representative of the Malm dialect.
The first sound in checkar, //, is pronounced
with the affricate [t]. Although not a nonexistent sound in Swedish dialects, it is perceived as foreign in the Malm dialect, and, by
the listeners in the perception experiment, as a
typical feature of SMG. The same sound can be
heard in the materials recorded in Gothenburg
and Stockholm, e.g. in words like chillar
[tlar] chill and other borrowings.
300
200
100
'bra
ti(ll)
'hm,sidor
men
1
Time (s)
Figure 1. F0 contour of speaker C41s production of

the accent I word bra good and the accent II word
hemsidor home pages.
In Stockholm Swedish, the F0 peak is found at

the end of the stressed syllable in focussed accent I words. In focussed accent II words, two
F0 peaks are found: one at the beginning of the
stressed vowel and another one later (midways
between the preceding peak and the next accent
or boundary tone or, in compounds, in association with the secondary stress). A first look at
the Stockholm SMG data reveals that a word
accent distinction is used, but it also reveals
some deviating patterns. Perceptually prominent accent II words are, e.g., not always assigned two F0 peaks (the focal rise is missing),
see Figure 2.
R sounds
If you ask a Scanian to imitate Rosengrd
Swedish, he or she will most likely use front r
sounds. Indeed, among the SMG speakers in
Malm, the pronunciation of the r sound varies
greatly. Out of the ten stimuli perceived as
Rosengrd Swedish, front r sounds can be
heard in five. Among them, there are both fricative and approximant r sounds and the more
perceptually salient trilled r sounds.
Also in the Stockholm SMG material, the r
sounds differ from the regional dialect in that
trilled r sounds appear to be used more fre38
400
600
500
300
400
300
200
200
100
de(t r)
''skt,go(d)
100
de(t r) tju(go)sju p(ro)cent
som villha kvar
'mat
1
kungen
1.8
Time (s)
Time (s)
Figure 2. F0 contour of speaker L31s production of

the accent II word skitgod very good.
Figure 5. F0 contour of speaker D40s production

of det r tjugosju procent som vill ha kvar kungen
its twenty-seven percent that want to keep the
king with an expanded F0 range (female speaker
from Malm).
The word accents in the Gothenburg SMG material still remain to be investigated.
In summary, the SMG varieties have both features in common and regional features.
Intonation
An expanded F0 range can be observed in utterances recorded in all three cities. The pattern
is found mainly in exclamations and rhetorical
questions, see Figures 3, 4 and 5.
Discussion
How come there are similarities?
How come the different SMG varieties share
the above-mentioned features? The relation to
learner language and foreign accent is of course
obvious in both Malm, Gothenburg and Stockholm, but a foreign accent can sound in a
multitude of different ways.
One possible explanation is, of course, that
all SMG varieties are influenced by the same
language or language family. On the other
hand, SMG does not sound as one particular
foreign accent. Another possible explanation is
that the varieties are characterized by features
that are typologically unmarked and frequent in
the worlds languages. It is either related to the
fact that many of those features exist in the
teenagers first languages, or to the fact that
simplification and usage of unmarked features
is generally favored in language contact situations (regardless of what the languages in contact are). A third explanation is that it is features in the Swedish language that give rise to
the varieties similar sound, e.g. the difficulties encountered when learning Swedish.
All three alternatives probably have some
explicative power, although either completely
accounts for why the varieties sound like they
do. Word accents are unusual in the speakers
first languages, tend to disappear in language
contact situations (as in Finland Swedish), and
are difficult for second language learners to
learn. A word accent distinction is, nevertheless, maintained in SMG.
600
500
400
300
200
100
ve
vem
smart
ingen av dom kan
nt
1.7
0
Time (s)
Figure 3. F0 contour of speaker P11s production of

ve- vem e smart wh- who is clever with an expanded F0 range and, for comparison, ingen av dom
kan nt either of them know anything (male
speaker from Gothenburg).
600
500
400
300
200
100
ja(g)ba
m(en) oke(j) ja(g) e
hungri(g)
4.7
3.5
Time (s)
Figure 4. F0 contour of speaker L31s production of

jag e hungrig Im hungry with an expanded F0
range and, for comparison, jag ba men okej I just
okay (female speaker from Stockholm).
39

A forth explanation is given by the gravity
model of diffusion (Trudgill 1974) or the cascade model (Labov 2003): language change
spreads from the largest to the next largest city,
and so progressively downwards (i.e. by socalled city-hopping). In other words, the similarities among the SMG varieties can be explained as the result of a spreading of SMG
from city to city (i.e. from Stockholm to Gothenburg, from Gothenburg to Malm, and so
on). A spreading from city to city rather than a
spreading in a more wave-like pattern does not
assume acceptance of the spreading features in
the rural areas between the cities. The model
thereby explains why SMG cannot be found
among young people everywhere between
Stockholm and Malm.
What mechanism produces sufficient contact among speakers from different cities for
the spreading to occur? Labov (2003) discusses
two possibilities: 1) that people from the
smaller city come to the larger city (for employment, shopping, entertainment, education,
etc) and 2) that representatives of the larger city
may travel outwards to the smaller city, and
bring with them the dialect features being diffused (e.g. in connection with the distribution
of goods). In the case of SMG, the first explanation is the most likely. Spreading through
music (like that of e.g. Latin Kings) is also a
plausible explanation.
In the present paper, we have presented a number of segmental and prosodic features that are
common for all SMG varieties, but also discussed a feature that distinguishes them from
each other (the word accent realization). Future
research will reveal more similarities and differences and, thereby, hopefully shed some
light on the relationship between the different
SMG varieties on the one hand (e.g. if cityhopping has occurred), and on the relationship
between SMG and foreign accent on the other.
Acknowledgements
The research reported in this paper has been
financed by the Bank of Sweden Tercentenary
Foundation.
References
Hansson P. & Svensson G. (2004) Listening for
Rosengrd Swedish. Proceedings
FONETIK 2004, The Swedish Phonetics
Conference, May 26-28 2004, 24-27.
Holmes J. & Ainsworth H. (1996) Syllabletiming and Maori English. Te Reo 39, 75-84.
Jacobsen B. (2000) Sprog i kontakt. Er der opstet en ny dansk dialekt i Grnland? En pilotundersgelse. Grnlandsk kultur- og
samfundsforskning 98/99, 37-50.
Kotsinas U-B. (1990) Svensk, invandrarsvensk
eller invandrare? Om bedmning av frmmande drag i ungdomssprk. Andra
symposiet om svenska som andrasprk i
Gteborg 1989, 244-274.
Labov W. (2003) Pursuing the cascade model.
In Britain D. & Cheshire J. (eds) Social
Dialectology: In Honor of Peter Trudgill.
Amsterdam: John Benjamins.
Low, E. & Grabe, E. (1995). Prosodic patterns
in Singapore English. Proceedings of the
Intonational Congress of Phonetic Sciences,
Stockholm 13-19 August 1995, 636-639.
Quist P. (2000) Ny kbenhavnsk multietnolekt. Om sprogbrug blandt unge i sprogligt
og kulturelt heterogene miljer. Danske
Talesprog, 143-212. Copenhagen: C.A.
Reitzels Forlag.
Trudgill P. (1974) Linguistic Change and Diffusion: Description and Explanation in
Sociolinguistic Dialect Geography.
Language in Society 2, 215-246.
Udofot I. (2003) Stress and rhythm in the
Nigerian accent of English: A preliminary
investigation. English World-Wide 24: 2,
201-220.
Differences
Despite the similarities perceived between
Rinkeby Swedish and Rosengrd Swedish by
adolescents in Malm, many are surprised to
hear that the Malm adolescents perceive the
soccer player Zlatan Ibrahimovic as a speaker
of Rosengrd Swedish (and not simply a
speaker of the Malm dialect). How large is the
difference between SMG and the regional dialect? How large is the difference between e.g.
Rosengrd Swedish and Scanian? Although
Rosengrd Swedish clearly contain a number of
non-regional features, not all speakers of
Rosengrd Swedish use all of those features,
and many features of Rosengrd Swedish are
not distinct from the regional dialect (e.g. the
word accents). Swedish on Multilingual
Ground should, therefore, only be seen as an
overarching term for a number of related but
distinct varieties. SMG in Malm appears to be
Scanian on Multilingual Ground (which incidentally is reflected in the Advance Patrol
members artist name Gonza Blattesknska).
40
The communicative function of s in Italian and ja

in Swedish: an acoustic analysis.
Loredana Cerrato
TMH-CTT, Dept of Speech Music and Hearing, KTH, Stockholm - Sweden
Investigation of the acoustic characteristics of the
function of these short utterances is not only interesting from the phonetic point of view, but may be
useful in technical applications, for example it may
simplify the interpretation of ambiguous utterances
produced by humans talking with computers.
Abstract
The results of an acoustic analysis and a perceptual evaluation of the role of prosody in
spontaneously produced ja and s in
Swedish and Italian are reported and discussed
in this paper. The hypothesis is that pitch contour, duration cues and relative intensity can be
useful in the identification of the different
communicative functions of these short expressions taken out of their context.
The results of the perceptual tests run to verify
whether the acoustic cues alone can be used to
distinguish different functions of the same lexical items are encouraging only for Italian s,
while for Swedish ja they show some confusions among the different categories.
Materials and Method

Four dialogues elicited with the map task technique
have been used to study the realization of s in
Italian and and ja in Swedish.
A total of 48 instances of s uttered by two
speakers (a female and a male speaker from the
area of Naples) were analysed in the two Italian1
dialogues.
A total of 40 instances of ja uttered by two female speakers (from the area of Stockholm2) were
analysed in the two Swedish dialogues.
All the items were first annotated in their context, then segmented from their context. A series of
acoustic measurements were taken on the items in
isolation to find out whether there are systematic
differences in prosodic implementation among different functions that each item can carry out. Annotation, segmentation and acoustic measurement
were carried out with the help of Wavesurfer
(Sjlander & Beskow 2000).
Finally a perceptual test was run to check
whether listeners are able to identify the communicative functions carried out by the different items
taken out from their context.
Introduction
Short expressions, such as "mh" "ah" and yes or
no are widely produced during spontaneous
conversation and seem to carry a variety of communicative functions. For instance (Gardner 2000)
reports to have recognized eight main types of
"mm" in corpora of spoken English. However
(Cerrato 2003) reports that one of the most common function that these short expressions carry out
is that of feedback. Feedback can have different
nuances of meaning; therefore the same expression
can be produced in several contexts, to convey
different communicative functions. It seems possible that the specific dialogue function of short utterances is reflected in their suprasegmental characteristics.
The focus of this paper is on the role of prosody in signaling specific dialogue functions for ja
in Swedish and s in Italian, (i.e. yes) which are
frequently used in natural conversational interaction
and which are essential for a smooth progressing of
the communication process.
Ja and s are used by dialogue participants
to indicate that the current listener is following and
willing to go on listening, or that he/she is following,
but willing to take the turn, or to signal that the listener has understood what has been said so far, is
still paying attention to the dialogue, prompting the
speaker to go on. Moreoever ja and s, can
be used to answer yes-no questions.
Analyses
A functional categorization of Italian s
and Swedish ja was first carried out using the
labels reported in table 1. The categorization was
carried out listening to the short expressions and
considering the explicit function that they were carrying in the given context. Short expressions were
1 The two Italian dialogues were recorded in a sound-treated room at the University of
Naples, they are part of the Italian corpus called CLIPS. More information about the CLIPS
corpus are available on the web page: http://www.cirass.unina.it/ingresso.htm
2 The two Swedish dialogues are not part of a big corpus. They were recorded in a soundtreated room at Stockholm University for the special purpose of analysing pre-aspiration
phenomena in Swedish stop consonants. More information on the Swedish map-task dialogues in Helgason, P. Preaspiration in the Nordic Languages: Synchronic and Diachronic
Aspects. Doctoral thesis, Stockholm University, 2002.
41

categorized either as feedback (F) or as answers
(RP). The difference between a positive feedback
and an affirmative response is quite subtle, however the criteria followed to assign the label of affirmative response (RP) was that of looking for a
positive answer to a polar question.
The function of feedback expressions can be further specified as:
Continuer, when the interlocutor shows that s/he is
willing and able to pay attention, perceive the message and continue the interaction, either by letting
the other speaker talk (FCY: you go on) or by getting the turn (FCI: I go on).
Acceptance (FA), when the interlocutor acknowledges comprehension and acceptance of the message.
The labels for feedback are part of a complete
coding scheme developed to code feedback phenomena. (Cerrato 2004).
of question intonation, moreover (Kohler 2004,

and House 2005), analysing respectively German
and Swedish material, found that final rises can
pragmatically signal intended social interaction and
friendliness. This pragmatical-attitudinal explanation
might be the key to understand the different contours shown by Italian and Swedish for the categories FA and RP. This difference is also strictly
linked to the cultural difference, which depicts Italian people as being more assertive, categorical and
self-confident in expressing their points of view
(acceptance, agreement) and in giving their responses, while Swedish people as being oriented
to seek consensus, by not showing self-confidence
and categoricalness, hence by using a rising pitch
contour which denotes uncertainty, openness towards the addressee and friendliness.
The intensity was measured relatively intra
speaker, per category. The results show that FCY
is produced with the lowest relative intensity by all
the speakers, while RP, in isolation, is produced
with the highest intensity.
Table 1 Labels used to code the communicative function

Italian s and Swedish ja.
Function
Label
Comment
Feedback Continuation
FCI
FCY
I want to go on
you go on
Feedback
Acceptance
FA
Understanding, agreement, acceptance
Answer
Positive
RP
Positive answer to a
polar question
Table 2 Summary of the acoustic characteristics of each

category in the two languages.
Acoustical measurements included pitch contour,

relative intensity and duration. Pitch contour was
analysed in terms of rising, flat and falling contour.
A summary of the pitch contour characteristics
for each category in the two languages is reported
in table 2.
Feedback expressions showing continuation of
attention, perception and understanding, without
showing the intention of taking the turn (FCY) are
characterized, both in Italian and Swedish, by a
rising pitch contour, which is a typical continuation
contour. A raised F0 is considered a marker of
non assertiveness (Ohala 1983) and in fact the
feedback category FCY signals continuation of
attention, perception and understanding, but not
acceptance or agreement, which is instead signaled
by the category FA, in Swedish characterized by a
rising F0 and in Italian by a falling F0. The falling
contour is typical for assertiveness and categoricalness (Kohler 2004) which is quite consistent with
the realization of Italian s having a rising F0 even
as a positive answer (RP).
In Swedish instead the category FA and RP are
realized with a rising F0. Rising F0 can signal feedback continuation of contact, but also nonassertiveness and uncertainty, therefore it is typical
Function
label
Swedish
Italian
FCY
Rising F0, lengthening
Rising
lengthening
FCI
Flat F0,
(lengthening for speak 1)
Mainly Falling F0,

short duration
FA
Rising F0
Falling F0
RP
Rising
and
long
F0 Falling F0
(if in context)
Falling
and
short
(if in isolation)
F0,
The duration of all the occurrences of s and

ja was measured both from spectrograms and
waveforms. Most of the items were produced in an
utterance of their own, which made the segmentation easier: the onset of word initial was set at the
appearance of energy, while the offset was marked
at the disappearance of energy. In those cases
where the short expressions were coarticulated
with preceding or following items, the transitions
were included in the segmentation and in the measurement of the duration.
Figure 1 plots the duration of Italian s in milliseconds, per speaker and per category. When
Italian s is produced as FCY it has a longer duration than when it is produced with any other
function, and this difference is significant. This is not
surprising since the typical pitch contour for FCY is
rising and the typical phonological tonal contour is
42

a continuation rise, usually coupled to a longer duration.
The most evident difference in the Italian stimuli
duration is between the s as Continuer you go
on (FCY) and as Continuer I go on (FCI). The
s FCI is usually produced at the beginning of a
longer utterance, often in coarticulation with what
follows; this s is uttered very quickly since the
speaker wishes to go on speaking. This might explain why in the Italian dialogues s with FCI
function, which signals also the intention to get the
turn, does not undergo a typical continuation
lengthening phenomenon; however in the Italian
dialogues there were very few s with FCI function.
500
450
400
350
300
250
200
150
100
50
0
When Swedish ja is produced with FCI function

by speaker 1, it has a longer duration than when it
is produced with any other function. This might depend on the fact that this speaker, by uttering ja
at the beginning of a longer utterance, differently
from the Italian speakers, prolongs the short expression with a typical continuation rise to signal
the intention to keep the floor and go on speaking.
Apart from the lengthening phenomena in FCI for
speaker 2, there are no evident differences in the
duration of Swedish ja across categories.
Perceptual test
The test consisted of two sub-tests, one with the
Italian stimuli submitted to 8 Italian listeners and
one with the Swedish stimuli submitted to 8 Swedish listeners.
The test material consisted of 8 stimuli for each
category organized in two blocks of 34 stimuli for
Italian and 34 stimuli for Swedish (the first two
stimuli in each block being dummies). No manipulations were performed on the stimuli, in order to
preserve their naturalness; however for the categories RP and FCI there were not enough instances
of stimuli per speaker hence some of them were
played twice. Before the experimental session the
participants were given written instructions and
were involved in a short training session to familiarise with the task. During the experimental session,
the stimuli were presented individually over headphones, with randomized order and after each
presentation the listener chose, on the answering
sheet, one of the 4 available labels (reported in table 1) for the function that they believed the stimulus carried out in the conversation.
The results, under the form of confusion matrices, for the Italian listeners judging the s of the 2
Italian speakers, are reported in Table 4a and 4b.
For the Italian stimuli all the recognition rates appear to be above chance level. In Italian FA and
RP are confused with each other. This might depend on the fact that they have similar acoustic
characteristics, in particular similar pitch contour
and duration. The only difference consisting in the
higher intensity of RP stimuli (+4 dB). FCY for
Italian speaker 1 gets high recognition rates, this
maybe due to lengthening.
Table 3 shows the average duration in milliseconds of Italian s for the 4 functions.
The results for the Swedish listeners judging the
ja of the 2 Swedish speakers are reported in
Table 5a and 5b. For the Swedish stimuli not all
the recognition rates appear to be above chance
level. RP is in fact not distinguished. This might depend on the fact that in Swedish positive answers
speak1
speak2
FA
Fci
FCY
RP
Figure 1 Duration of Italian s in milliseconds, per

speaker and per category
The finding that FA and RP items have the

same pitch contour and the same duration is consistent with a previous phonological tonal analysis
carried out on this materials (Cerrato, DImperio
2003), which showed that items classified as FA
and RP have the same tonal contour.
The results of the duration analysis for Swedish
ja plotted in figure 2.
500
450
400
350
300
speak1
250
speak2
200
150
100
50
0
FA
FCi
Fcy
RP
Figure 2 Duration of Swedish ja in milliseconds, per

speaker and per category.
43

are very seldom expressed by only using ja, but
rather by a sequence of ja and other words, such
as: ja visst, ja precis.
FA and FCY are confused with each other, since
they both show a rising F0. FCI for speaker 1 gets
good recognition rates probably because it is characterized by a flat F0 and lengthening.
cative functions. Even if the analysis was limited to

a particular kind of communicative situation,
namely Map Task dialogues, and to only 4 speakers, it is evident from the results that duration cues
correlate with some dialogue functions, in particular
the most evident difference in duration is between
the Continuer you go (FCY) and all the other
categories in Italian, while in Swedish the most evident duration difference is between the Continuer
I go on (FCI) and the other categories.
Table 3 Average duration in milliseconds of Italian s

according to the 4 different functions.
FCY
275
RP
260
FA
232
FCI
209
References
Cerrato L. & D Imperio M. (2003), Duration and
tonal characteristics of short expressions in Italian, In Proceedings of the ICPhs Barcellona
03.
Cerrato L. (2003) A comparative study of verbal
feedback in Italian and Swedish map-task dialogues, In Proceedings of the Nordic Symposium on the comparison of spoken languages,
2003, 99-126
Cerrato L. (2004) A coding scheme for the annotation of feedback phenomena in conversational
speech
LREC Workshop on Models of Human Behaviour for the Specification and Evaluation of
Multimodal Input and Output Interfaces, Lisboa, 25-28
Gardner R. (2001), When Listeners Talk John
Benjamins Publishing Company
House D. (2005) Phrase-final rises as a prosodic
feature in wh-questions in Swedish humanmachine dialogues, accepted in Speech Communication
Khler, K.J. (2004) Pragmatic and attitudinal
meanings of pitch patterns in gemran syntactically marked questions. In Fant G. et al (eds)
From traditional phonology to modern speech
processing, 205-214 Beijing Foreign Language
Teaching and Research Press.
Ohala J.J. (1983) Cross language use of pitch: an
ethological view, Phonetica 40, 1-18
Sjlander K, Beskow J. (2000) aveSurfer - an
Open Source Speech Tool, Proceedings
ICSLP 2000, Bejing, China
Table 4a Confusion matrix for the identification test

for Italian speaker 1
FA
FCI
FCY
RP
FA
48%
8%
2%
42%
FCI
16%
59%
5%
20%
FCY
2%
5%
90%
3%
RP
38%
11%
6%
45%
for Italian speaker 2
FA
FCI
FCY
RP
FA
45%
13%
5%
37%
FCI
14%
59%
5%
22%
FCY
9%
8%
69%
14%
RP
37%
14%
11%
38%
for Swedish speaker 1
FA
FCI
FCY
RP
FA
48%
2%
30%
20%
FCI
2%
83%
14%
0%
FCY
33%
2%
55%
11%
RP
38%
0%
38%
23%
Table 5b Confusion matrix for the identification test
for Swedish speaker 2
FA
FCI
FCY
RP
FA
41%
9%
40%
9%
FCI
29%
52%
2%
17%
FCY
33%
5%
59%
3%
RP
23%
31%
19%
27%
Acknowledgements
A special thanks to my supervisor David House,
for inspiring discussions about the results of this
study.
Conclusions
The aim of this study was to investigate the acoustic and perceptual characteristics of spontaneously
produced ja and s in Swedish and Italian, in
order to find out whether acoustic cues can be
used to render and to recognize different communi44
Presenting in English and Swedish

Rebecca Hincks
Department of Speech, Music and Hearing, KTH, Stockholm
Unit for Language and Communication, KTH, Stockholm
requirements. The aim of the small study described in this paper was to gather data to shed
light on the question of how individual speakers might differ in speaking characteristics
when presenting in a first or second language.
Other research has suggested that a narrowed
pitch range is a characteristic of second language speech (Mennen 1998; Pickering 2004),
at the same time as it has been shown that using
pitch effectively is an important means of structuring instructional discourse. In situations such
as exist in Sweden, where students are increasingly judged on tasks performed in a second
language, it is of interest to know the extent to
which that requirement constrains them.
This paper investigates pitch variation levels and speaking rates in the English and Swedish versions of the same presentations. If
speakers were found to use less pitch variation
when speaking English than Swedish, then second language users could be seen as primary
users of a system for encouraging more pitch
variation. It was expected that speaking rates
would be faster for Swedish than for English;
this examination could quantify the differences.
Abstract
This paper reports on a comparison of prosodic
variables from oral presentations in a first and
second language. Five Swedish natives who
speak English at the advanced-intermediate
level were recorded as they made the same
presentation twice, once in English and once in
Swedish. Though it was expected that speakers
would use more pitch variation when they
spoke Swedish, three of the five speakers
showed no significant difference between the
two languages. All speakers spoke more quickly
in Swedish, the mean being 20% faster.
Introduction
Two earlier contributions to the Annual Swedish Phonetics Conference have outlined ideas
for a feedback mechanism for public speaking.
Briefly, Hincks 2003 proposed that speech
technology be used to support the practice of
oral presentations. Speech recognition could
give feedback on repeated segmental errors
produced by non-natives as well as provide a
transcript of the presentation, which could then
be processed for lexical and syntactic appropriateness. Speech analysis could give feedback
on the speakers prosodic variability and speaking rate. Hincks 2004 presented an analysis of
pitch variation in a corpus of second language
student presentation speech. Pitch variation was
measured as the standard deviation of F0 for
10-second long segments of speech, normalized
by dividing by the mean F0 for that segment.
This value was termed PVQ, for pitch variation
quotient. Hincks (forthcoming) reports on the
results of a perception test of speaker liveliness,
where a strong correlation (r = .83, n = 18, p <
.01) was found between speaker PVQ and perceptions of liveliness from a panel of eight
judges.
Though automatic feedback on the prosody
of public speaking could be useful for both first
and second language users, the abovementioned studies have been done on a corpus
of L2 English, where native Swedish students
of Technical English were recorded as they
made oral presentations as part of their course
Method
The goal for the data collection used for this
paper was to have a corpus where the same
speaker used both English and Swedish to
make the same presentation, with the same visual material. Because class time could not be
wasted with having students hear the same
presentation twice, the Swedish recordings
needed to be made outside the classroom. All
students studying English at KTH in the fall of
2004 nearly 100 studentswere contacted
and asked whether they would like to participate. They were told that they would first be
recorded in the classroom as they made their
presentations in English, and that they would
then meet in groups and make the same presentations in Swedish to each other. They were offered 150 SEK as compensation for the extra
time it would take. Unfortunately, only five
students were able to participate. These five,
three males and two females, were all intermediate students. They were first recorded in their
45

English classroom, and then met at the end of
term to be recorded in Swedish. The audience
for the second recording consisted of the other
four students, their English teacher, and me.
Four of the five students used computer-based
visual support for their presentations, and were
instructed to use their English slides for the
Swedish presentation. This assured that the
content of the presentations would be the same.
One student, M3, did not use extensive visual
support.
With WaveSurfers (Sjlander and Beskow
2000) ESPS pitch extraction boundaries set at
65-325 Hz for male speakers and 110-550 Hz
for female speakers, pitch extraction was performed for up to 10 minutes of speech for the
five presentations in each language. All pitch
contours were visually inspected for evidence
of extraction errors and the location of the errors noted. The F0 values were exported to a
spreadsheet program, where the erroneous values were deleted, and the means and standard
deviations of 10-second long segments were
calculated. The standard deviation of each segment was divided by the mean of each segment
to determine the PVQ, pitch variation quotient.
Speaking rate was calculated by manually
dividing the transcripts of the presentations into
syllables and dividing by the total time spent
speaking. Because pause time is included in the
calculation, the values achieved are lower than
what might otherwise be found in studies of
spontaneous speech. Another temporal value of
interest is the mean length of runs, which is the
amount of speech, in syllables, a speaker utters
between pauses. This measure has been found
to correlate highly with language proficiency
(Kormos and Dnes 2004). The minimum
pause length was defined as 250 ms.
Pitch variation quotient
0.26
0.24
English
0.22
Swedish
0.20
0.18
0.16
0.14
0.12
0.10
0.08
0.06
M1
M2
M3
F1
F2
Speaker
Figure 1. Mean pitch variation quotient for whole

presentation in both English and Swedish
Temporal measures
The male speakers spoke for a shorter length of
time when making the presentation in Swedish
than when using English, as shown in Figure 2.
700
English
600
Swedish
Seconds
500
400
300
200
100
0
M1
M2
M3
F1
F2
Speaker
Figure 2. Length of time in seconds to make presentation in English and Swedish
Speaking rate
Part of the reason the speakers could make their
presentations in a shorter period of time is that
they spoke on average 20% more quickly. Figure 3 shows the speaking rate per speaker in
syllables per second. The mean speaking rate in
English was 2.97 sps, and for Swedish was 3.58
sps. M3, the only student to use a lot more
pitch variation in Swedish than in English, also
spoke much more quickly in Swedish. Note
also that the two females are more stable in
their speaking rates, and that the fastest and
slowest speakers in one language maintain their
ranking in the other language.
Results
Pitch variation quotients
The mean PVQs per speaker for the two presentations are shown in Figure 1. For three of
the five speakers, there was very little difference in the PVQs when using English and when
using Swedish. Only one speaker, M3, had significantly lower PVQ speaking English, but another, F1, had lower PVQ when speaking
Swedish. Though there are only five speakers,
the mean values reflect the same range as that
found in the all-English corpus, with a low of
about 0.11 and a high of about 0.24.
46

the larger, all-English corpus, where an attempt
was made to gather data from every student in a
class. It is reassuring, however, that the ranges
of prosodic variables for these five speakers
reflect nearly the same ranges as that of the first
corpus.
5.0
Syllables per second
4.5
English
4.0
Swedish
3.5
3.0
2.5
2.0
Language or performance?
The result that three of five speakers showed no
significant difference in PVQ depending on the
language they were using would seem to indicate that PVQ measures are more speaker dependent than language dependent, at least for
native speakers of Swedish. The hypothesis that
the speakers would use less pitch variation
when speaking English was not at all born out
by the study. It seems that the PVQ depends
mostly on speaking style, and perhaps the energy one puts into performing in a certain
situation. The English presentation was a
higher-stakes event, where students were
speaking to more people and, most importantly,
receiving a grade on their work. Speaker F1
performed very well for her first presentation,
and with the high mean length of runs combined with higher-than average mean PVQ,
probably would have received high liveliness
ratings had her speech been part of the perception test. It is interesting that she was the only
student to have lower PVQ values and the only
student to have lower MLR values in Swedish
than in English. This could indicate that she in
some way put less effort into performance for
the Swedish presentation. Speaker M3, on the
other hand, was either hampered by using English or relatively unprepared when making the
first presentation. He could have benefited by
rehearsing with a feedback mechanism beforehand.
For the purposes of a thesis grounded in
computer-assisted language learning, these results throw a bit of a wrench in the works. The
problems I am proposing to help may not depend on the use of a second language, but on
more basic features of speaking style. On the
other hand, at advanced levels of language
courses, it is difficult to separate the needs of
first and second language users. Furthermore,
many native speakers as well as non-natives
obviously have problems achieving an engaging speaking style, and it has never been my
intention to propose a device restricted to nonnative use.
1.5
1.0
0.5
0.0
M1
M2
M3
F1
F2
Speaker
Figure 3. Speaking rate in syllables per second for

three males and two females in English and Swedish
Mean length of runs

A variable found to be important in the perception of liveliness in female speech samples
(Hincks forthcoming) is the number of syllables
between pauses of >250 ms (MLR). Four of the
five speakers had higher values for this measure when speaking Swedish (Figure 4). The exception was F1, the same speaker who used less
pitch variation in Swedish.
16
English
14
Mean length of runs
Swedish
12
10
8
6
4
2
0
M1
M2
M3
F1
F2
Speaker
Figure 4. Mean length of runs (number of syllables

between >250 ms pauses) using English and Swedish
Discussion
This study was performed on a small group of
speakers, and any results should be interpreted
with care. The students who participated were
paid volunteers, and in that sense cannot be
considered as representative of the population
to the same extent as the speakers recorded for
47
Further work
A small study is being planned to test the perception of liveliness in these speakers as they
used the two languages.
The corpus described in this chapter could
be augmented by a small number of speakers
over the period of several terms and could provide a wealth of further opportunities for language study. Comparison of the English and
Swedish transcripts will allow examination of
aspects such as how the speakers use pitch
movement in utterances that are comparable
content-wise. This could provide insight into
transfer of Swedish intonational patterns to
English. It is possible that with more speakers,
statistically significant differences in PVQ
could still be found. The differences in mean
speaking rate should also be further investigatedthe 20% difference found in this group
would be interesting to pursue. Does the average Swedish speaker of English manage to say
only 80% of what a native speaker can say during the allotted time at a conference? Documenting such information about first and second language use would give valuable evidence
for those in positions of developing language
policy.
References
Hincks, R. (2003). Tutors, tools and assistants
for the L2 user. Phonum 9: 173-176, Ume
University Department of Philosophy and
Linguistics.
Hincks, R. (2004). Standard deviation of F0 in
student monologue. Proceedings of Fonetik
2004, Stockholm, Department of Linguistics, Stockholm University.
Hincks, R. (forthcoming). Measures and perceptions of liveliness in student oral presentation speech: a proposal for an automatic
feedback mechanism. Accepted for publication in System.
Kormos, J. and M. Dnes (2004). Exploring
measures and perceptions of fluency in the
speech of second language learners. System
32: 145-164,
Mennen, I. (1998). Can language learners ever
acquire the intonation of a second language?
Proceedings of STiLL 98, Marholmen, Sweden, KTH Department of Speech, Music and
Hearing.
Pickering, L. (2004). The structure and function
of intonational paragraphs in native and nonnative speaker instructional discourse. English for Specific Purposes 23: 19-43,
Sjlander, K. and J. Beskow (2000). WaveSurfer: An open source speech tool. Proceedings of ICSLP 2000,
http://www.speech.kth.se/snack/.
Acknowledgements
My thanks to David House, the student
speakers and especially to teacher Beyza Bjrkman, whose encouragement was important in
getting five volunteers for this study. This work
was funded by the Unit for Language and
Communication.
48
Phonetic Aspects in Translation Studies

Dieter Huber
Department of General Linguistics and Culture Studies
Johannes Gutenberg Universitt Mainz
Mainz/Germersheim
Germany
Abstract
Translation Studies cover a subfield of Applied
Linguistics and are concerned with the scientific study of translation and interpreting in its
various media and forms: oral vs. written, simultaneous vs. consecutive, literary vs. technical, human vs. machine, direct vs. relais, remote vs. in situ, etc. While linguistics in general
have a long tradition of both theoretical and
experimental research into various aspects of
translation and interpreting, the phonetics and
phonology of this specialized form of intercultural communication have, until very recently,
attracted only little attention within the scientific community. The purpose of this paper is to
summarize some recent findings of this research and to indicate potential directions for
further studies into the phonetics and phonology of translation.
ness and acceptability of the translated product.

Even more importantly, translation for film
dubbing and synchronization not only involves
careful consideration of phonetic and phonological choices at the segmental and suprasegmental level, but also has to carefully monitor
lip movements, voice quality, duration patterns,
and their compatibility with the paralinguistics
of the respective scene.
Interpreting
Interpreting, both in the simultaneous and in the
consecutive mode, involves linguistic choices
that have to be made by the interpreter at all
levels of language processing at a time when
the source language text, mostly oral speech,
has yet to be completed. Contents and structure,
topic and focus, verbal references, phrasal attachements, presuppositions and often even the
actual goals and intentions of the speaker, may,
at the very extreme, be entirely unresolved at
the time of the original utterance when the interpreter has to perform.
On the other hand, empirical studies of the time
constraints of simultaneous interpreting show
that the dcalage, i.e. the time delay between
the source language input by the original
speaker and the target language output by the
interpreter, that is acceptable to normal listeners
should not exceed two to three seconds2.
This double bind forces the professional interpreter at the very least to develop, in addition to
his or her linguistic, mnemotic and anticipatory
skills a high degree of vocal and articulatory
expertise in order to be able to continuously adjust to the speech rate properties and vocal demands of the actual situation.
In addition, as shown among others by Goldmann-Eisler (1972), ernov (1978), Shlesinger
(1994) and Ahrens (2004), professional inter
Translation
Translation1, in the narrow sense, covers all
forms of the transfer of meaning from a source
language text into one or more target languages.
Both written and oral texts are translated, as
long as they are presented as a whole in a fixed,
finished and thus permanent form.
Clearly, the translation of written texts does not
normally involve any choices at the phonetic or
phonological level. However, as shown among
others by Paz (1971), Lefevere (1975), Kohlmayer (1996) and Weber (1998), expressive
texts including poetry, lyrics and drama, but
also scripted speech, video narrations and advertisements need to be translated in view of
their readability and their potential use in stage
performance. The successful transfer of rhyme,
rhythm, pausing patterns, alliteration, accentuation and word play, based on the segmental
and/or suprasegmental qualities of lexical and
phrasal units, will often be crucial to the useful-
49
Lefevere A. (1975) Translating Poetry. Seven

Strategies and a Blueprint. Amsterdam: Van
Gorcum
Manhart S. (1998) Synchronisation. In SnellHornby M. et al. (eds) Handbuch Translation, 264-266. Tbingen: Stauffenburg Verlag
Paz O. (1971) Traduccin: Literatura y Literalidad, Barcelona: Tusquets
Shlesinger M. (1994) Intonation in the Production and Perception of Simultaneous Interpretation, In: Lambert S. & Moser-Mercer
B. (eds) Bridging the Gap, 225-236, Amsterdam: Benjamins
Snell-Hornby M. (1998) Translationswissenschaftliche Grundlagen: Was heit eigentlich bersetzen? In Snell-Hornby M. et
al. (eds) Handbuch Translation, 37-38. Tbingen: Stauffenburg Verlag
preters make systematic use of the prosodic

properties of the source language input in order
to derive complementary and/or compensatory
information about the structural properties of
the ongoing and yet incomplete utterance.
Pauses, intonation contours, accentuation and
final rise/final lengthening patterns have been
studied systematically in authentical data by
Ahrens (2004) who shows that prosody is not
merely transfered or reconstructed but apparently used in a translation-specific way involving particular translation units and tonal riselevel patterns as situation-specific markers.
References
Ahrens B. (2004) Prosody beim Simultandolmetschen. Frankfurt am Main: Peter Lang,
Europischer Verlag der Wissenschaften.
ernov G.V. (1978) Teorija i praktika sinchronnogo perevoda. Moskva: Me dunarodnie otnoenia
Fodor I. (1976) Film Dubbing: Phonetic; Semiotic, Esthetic and Psychological Aspects,
Hamburg: Buske
Gile D. (1995) Regards sur la recherch en interprtation de conference. Lille: Press Universitaire
Goldmann-Eisler F. (1972) Segmentation of
Input in Simultaneous Translation. Journal
of Psycholinguistic Research 1/2, 127-140
Huber D. (1990) Prosodic transfer in spoken
language interpreting. Proceedings of the International Conference on Spoken Language
Processing. ICSLP90 (Kobe, Japan), 509512.
Kohlmayer R. (1996) Oscar Wilde in Deutschland und sterreich. Untersuchungen zur
Rezeption der Komdien und zur Theorie
der Bhnenbersetzung. Tbingen: Niemeyer
Notes
1
Not to complicate matters, I neglect to include a lengthy discussion of the various uses and ambiguities of the term translation in this and other scientific disciplines such as physics,
biogenetics, economics, theology, history (cf. translatio imperii) and others. Suffice to say that even within the restricted
scope of translation studies per se, translation as a scientific
term is used somewhat incoherently both in the narrow sense as
translation proper (versttning, bersetzung, traduction) and,
in the wide sense, as the generic term to cover the whole field
of translation and interpreting (tolkning, Dolmetschen, interpretation).
2
50
Compare Gile (1995) for an overview.
Scoring Children's Foreign Language Pronunciation

Linda Oppelstrup, Mats Blomberg, and Daniel Elenius
Speech, Music and Hearing
KTH, Stockholm
models in a hidden Markov model (HMM) system. Although prosody is very important for
pronunciation quality, this report is limited to
segmental quality. More detailed results are
given in Oppelstrup (2005).
Abstract
Automatic speech recognition measures have
been investigated as scores of segmental pronunciation quality. In an experiment, contextindependent hidden Markov phone models were
trained on native English and Swedish read
child speech respectively. Among various studied scores, a likelihood ratio between the
scores of forced alignment using English phoneme models and the score of English or Swedish phoneme recognition had the highest correlations to human judgments. The best measures
have the power of evaluating the coarse proficiency level of a child but need to be improved
for detailed diagnostics of individual utterances
and phonemes.
Theory
An approximation in this work is that pronunciation quality is composed of two components, knowledge and ability. The knowledge
component reflects the speakers knowledge of
the correct phonetic transcription of a written
text. The ability component reflects the
speakers ability to pronounce the phonemes of
the target language correctly.
The knowledge score, Sk, can be formulated as
a confidence measure that the speaker has chosen the correct transcription, TrCorrect, in his
spoken utterance (U) of the written text. This is
modeled by:
Introduction
Automatic evaluation of foreign language pronunciation presents possibilities for computerassisted language learning as well as for prediction of speech recognition performance in a
non-native language. Although children constitute a very large portion of foreign language
learners, speech technology research in this
domain has previously been mainly focused on
adults. The current work has been produced as
part of the EU project PF-Star, in which one
goal was to assess the current possibilities of
speech technology for children.
Previous work has used the fact that the better you are at pronouncing the new language,
the more the utterances should resemble sounds
from the target language instead of the mother
tongue (Eskenazi, 1996; Matsunaga, Ogawa,
Yamaguchi and Imamura, 2003). However, the
pronunciation quality of read speech does not
depend solely on the ability to produce the phonemes correctly; it also depends on knowledge
of how the words should be pronounced. Spectral quality and time-related scores have shown
high correlation with human judgment
(Neumeyer, Franco, Digalakis and Weintraub,
2000; Cucchiarini, Strik and Boves, 2000).
The foreign language considered in this report is English and the mother tongue is Swedish and also Italian in some cases. The scoring
procedure used context-independent phoneme
Sk =
P(U | TrCorrect , T )
=
(1)
max[P(U | Tri , T )]
P(U | TrBest , T )
i
where T is the set of target language phoneme

models, trained by native reference speakers. In
speech recognition terminology, the operation
of the nominator is forced alignment and that of
the denominator is phoneme recognition. This
ratio has been used for pronunciation scoring by
Cucchiarini et al (2000).
The ability score is a measure of the acoustic quality of the speakers realization of the
phonemes in the target language. It is possible
that some phonemes are pronounced as the
most similar phoneme in the mother language.
To score the ability of a speaker to produce the
correct non-native sounds, we will try the ratio
between the probabilities that the target language phonemes were used and that the mother
tongue ones were used:
Sa =
P (U | TrCorrect , M )
(2)
if the correct phonetic transcription of the written text to be pronounced is known. M is the
set of mother language phoneme models. If the
51

The English material for PF-STAR was recorded in three different regions of England by
the University of Birmingham. 158 children of
the ages six to fourteen were recorded but only
those above eight were included. Each child
recorded approximately 90 utterances. The database was used both for training the English
phoneme models (ENG-tr) and for performance tests (ENG-te). A part of this material received an increased noise level in this experiment due to unintentional mixing of two channels from a headset and a desktop microphone.
The Italian database, ITEN, was recorded
near Trento by ITC-irst. The part used here
comprises 25 children, ten and eleven years old,
reading 75 English prompts each.
The Swedish non-native PF-STAR material,
SWENG, was recorded in Stockholm. Each of
40 children of both sexes, in the ages 10-11
years read in average 64 English utterances,
prompted on a computer screen. Most of the
utterances were the same as in ITEN. If the
child was uncertain of the pronunciation he/she
had the option to listen to a prerecorded pronunciation of that prompt. This option was used
in about 15% of the recordings.
text or the transcription is unknown, we can use

the ratio between the phoneme recognition
scores using the two language models:
Sa =
P(U | TrBest , T )
P(U | TrBest , M )
(3)
A combined knowledge and ability score

can be computed by multiplying Sk and Sa in
Eqs. (1) and (3) :
Sc =
P (U | TrCorrect , T )
P(U | TrBest , M )
(4)
Implemented Scores
In this report we present results of the following
pronunciation score parameters:
Knowledge:
English forced alignment / English phoneme
recognition (EFA/EPR)
Ability:
English phoneme recognition / Swedish phoneme recognition (EPR/SPR)
Combined:
English forced alignment / Swedish phoneme recognition (EFA/SPR)
Fraction language use (FRAC); see below.
Rate of speech (ROS)
Utterance duration (DUR)
Experiments
Recognition performance tests and pronunciation evaluation experiments have been performed. Word recognition tests used the SVE
and ENG development sets both containing
children of ages eight and nine only. The language model allowed any word to follow any
other with equal probabilities. The word insertion penalty was experimentally set to minimize
WER. In phoneme recognition tests the penalty
was non-optimized and equal to zero.
The English and Swedish phoneme models
were trained on ENG-tr and SVE1, respectively. The phoneme models consist of three
states. The 39 elements of the acoustic feature
vector are the 13 lowest mel frequency cepstral
coefficients (MFCC) including number 0, and
their first and second order time derivatives.
The mel filterbank is computed with 25 ms
Hamming window at a frame rate of 10 ms. The
output likelihood values of the recognizer are
logarithmic, which turns the implementation of
ratio between scores into subtraction instead of
division.
The scores were measured in different ways,
including or excluding non-speech intervals before and after the utterance and optional pauses
between words. In this report, the presented
In FRAC, both language model sets are active in parallel and the score is the percentage
of English language models selected by the recognizer.
Speech Data Bases

The speech material used in this report
comes from five child speech databases. The
utterances were separate words and short sentences chosen to make good coverage of all
phonemes. All recordings were made with
headset microphones. The sampling frequency
and amplitude resolution was, when necessary
downsampled to 16 kHz and reduced to 16 bits,
respectively. The material was split into sets for
training, development and evaluation.
SVE1 consists of 50 Swedish children in
the ages eight to fourteen years in the EUSpeeCon corpus (Iskra et al, 2002). Each child
recorded 60 utterances in average.
SVE2 (Blomberg and Elenius, 2003) is part
of the Swedish PF-STAR corpus of 198 Swedish children between four and eight years old.
Only the children above eight were used. Each
child recorded approximately 80 utterances.
52

ity per child. The correlation in the individual
languages is generally low and is increased for
the combined group of Swedish and Italian
children. Still higher correlation is achieved
when including also the native English children. The increase can be due to a larger range
of pronunciation quality among the speakers.
The knowledge score EFA/EPR, the ability
score EPR/SPR, and the combined score
EFA/SPR all got high correlation in this case.
EFA/EPR and EPR/SPR are shown in Figure 1.
The effect of an increased correlation when including native English speakers can also be
seen in FRAC.
scores include non-speech intervals, which generally performed better. Several other combinations of scores, models and normalization techniques have been studied by Oppelstrup (2005).
The pronunciation scores were correlated
with human judgment of the utterances. The
SWENG and ITEN speech files were scored by
an English teacher with phonetic training. Segmental and prosodic qualities were judged separately. Each utterance was scored on a threegraded scale: 3 for correct pronunciation, 2 for
small errors and 1 for erroneous utterances. To
get grade per child, the average of all graded
sentences was calculated. At the time of the experiments, the ENG database had no human
judgments but were given the score 3, assuming
that all English children had a correct pronunciation. Afterwards, judgments have been made
also on the English children and some results
including these are given in this report.
Table 3. Correlation of pronunciation scores with

human judgment for various test sets. S = SWENG,
I = ITEN, E = ENG-te. English children are given
grade 3, except for values after / which are based
on human judgment of a part of these.
Test set
Score
S
I
S+I
S+I+E
EFA/EPR 0.20 0.61 0.65 0.70/0.75
EPR/SPR 0.18 0.26 0.56 0.80/0.68
EFA/SPR 0.20 0.37 0.60 0.82/0.72
FRAC
0.22 0.09 0.48 0.82/0.71
ROS
0.18 -0.25 0.43 0.57/0.42
DUR
-0.11 -0.12 -0.54 -0.47/-0.47
Results
Word and phoneme recognition
Results of the word and phoneme recognition
experiments are shown in Table 1 and 2, respectively. The error rates are generally quite
high, which is not surprising considering the
combined difficulties of child, non-native
speech from different databases and a highperplexity language model.
EFA / EPR
0
Table 1. Word error rates for the Swedish and English test sets.
Test
SVE2
ENG-te
SWENG
ITEN
Voc. size
1051
1097
617
629
-200
Automatic
Training
SVE1
ENG-tr
ENG-tr
ENG-tr
-100
WER %
94
54
79
85
-300
-400
-500
-600
1
1.5
2.5
3.5
3.5
Human judgment
Table 2. Phoneme error rates (PER) in percent for

different training and test set combinations.
EPR/SPR
3000
2500
2000
SVE2
97
72
1500
Automatic
Test
Training ENG-te SWENG ITEN
ENG-tr
66
103
119
SVE1
92
86
93
1000
500
0
-500
Pronunciation scoring
Correlation with human judgment was low for
single utterances but was increased when averaging the scores of all utterances by a child. Table 3 lists correlations between automatic
scores and human judgment of segmental qual-
-1000
-1500
1
1.5
2.5
Human judgment
Figure 1. Automatic vs human judgment of Swedish, Italian and English children for EFA/EPR (top)
and EPR/SPR (bottom).
53

Another possibility is to replace phoneme
recognition in the scoring algorithms by explicit
modeling of predicted erroneous pronunciations. Better knowledge of the differences between the phoneme sets of the mother tongue
and target language could also help in giving
more weight to the difficult phonemes in the
target language.
Discussion
The low accuracy of word and phoneme recognition even for native English children indicates
that there is low discrimination between the
phoneme models. Child speech recognition is
very difficult in itself and the small size of the
training material allowed only contextindependent phoneme models to be trained.
Another difficulty was the varying recording
conditions in the databases. These problems
makes the pronunciation evaluation uncertain.
Another fact that may lower the correlation
with human judgment is that the human listener
and the scoring algorithms have different targets for correct pronunciation. Whereas the
human listener is likely to compare with neutral
British English, the reference models for the
system are trained on children with different
regional accent.
As was expected from previous studies, correlation increased with the amount of available
data. Correlation for single utterances was
lower than for averages of all utterances from a
speaker.
The correlation within the Swedish children
was quite small. An explanation may be that the
scoring algorithms are not sensitive to the limited pronunciation variation in this group. The
correlation among Italians is larger and the
highest overall correlation is achieved when
including children from all the Italian, Swedish
and English sets. It is interesting to note that the
Swedish phoneme models seemed to work
equally well as native models for Italian as for
Swedish children.
A separate procedure will probably be necessary to reject utterances that match poorly to
both nominator and denominator models in the
likelihood ratios, since the likelihood ratio of
these utterances will be quite random.
Acknowledgement
This work was conducted as a Master of
Science thesis at TMH, KTH, Stockholm,
within the EU project PF-STAR, Preparing Future Multisensorial, Interaction Research. The
human pronunciation judgments were performed by Becky Hinks.
References
Blomberg, M. and Elenius D. Collection and
recognition of childrens speech in the PFStar project. Phonum 9, Dept. of Philosophy
and Linguistics, Ume University, 2003.
Cucchiarini, C., Strik, H. and Boves, L. Different aspects of expert pronunciation quality
ratings and their relation to scores produced
by recognition algorithms. Speech Communication, Vol 31, pp 109-119, 2000.
Eskenazi, M. Detection of foreign speakers
pronunciation errors for second language
learning preliminary results. Proc. of ICSL
96, vol 3, 1996.
Iskra, D. Grosskopf, B., Marasek, K., van der
Heuvel, H., Diehl, F, and Kissling, A. Speecon speech databases for consumer devices: Database specification and validation.
Proc. of ICSLP 02, 2002.
Matsunaga, S. Ogawa, A., Yamaguchi, Y. and
Imamura, A. Non-native English speech
recognition using bilingual English lexicon
and acoustic models. Proc. of ICME 03, pp
625-628, 2003.
Neumeyer, L., Franco, H. , Digalakis, V. and
Weintraub, M. Automatic scoring of pronunciation quality. Speech Communication,
Vol 30, pp 83-93. 2000.
Oppelstrup L. Speech Recognition used for
Scoring of Childrens Pronunciation of a
Foreign Language, M. Sc. Thesis
TMH/KTH, Stockholm, 2005.
Conclusion
The best scoring functions correlate well
enough with human judgments to allow coarse
grading of a childs pronunciation quality. The
context-independent models used are too insensitive, however, to allow scoring on the utterance or phoneme level. The most important improvement would be to use context-dependent
phoneme models, trained on a large corpus with
recordings of children with correct pronunciation and accent.
54
On Linguistic and Interactive Aspects of Infant-Adult

Communication in a Pathological Perspective
Ulla Bjurster, Francisco Lacerda and Ulla Sundberg
Department of Linguistics, Stockholm University
child has developed a larger vocabulary and
starts to use two-word sentences or more in
communication with its environment. If the
auditory channel of information is disturbed,
the means of integration of stimuli input is disturbed which can result in a linguistic disturbance, which also can affect the ability to produce comprehensible speech (ster, 2002).
Humans seem to have a propensity to integrate the synchronic audio-visual stimuli that is
accessible in a communicative situation
(Bahrick, 2004). For example, when adults are
speaking to infants, they tend to repeat representations of target words as denomination of
whatever object the child is focusing on at the
moment. Characteristic for this kind of interaction is that the adult pronounces several sentences containing the target word, often in final
position, while following the infants gaze. Target words that are pronounced isolated in a repetitive way has a significant positive effect on
the first stages of the development of vocabulary (Brent & Siskind, 2001). In a perspective
of language development, adults behaviour can
be regarded as an efficient way of producing a
correlation between the words and sentences
and the object on which the infant is focusing.
An implicit meaning for the target word may
arise as a result of automatic association between the sensory representations that show
highest correlation (Lacerda, 2003). The learning mechanism that builds on associations of
different sensory impressions is most relevant
for learning of the first words, at the early
stages of linguistic development.
In a natural speech communication situation, competent speakers and listeners rapidly
achieve an effective level of information exchange by adjusting to each others communication needs (Lacerda et al., 2004b). Infants generally learn to use babbling in a communicative
way very early in life. When the communication channels are defect in some way, the manner of communication change by force of nature. As the ambient language of infants very
commonly is dominated by IDS, Infant Directed Speech (Fernald et al. 1989), this is one
of the means of communication the parents of a
Abstract
This is a preliminary report of a study of some
linguistic and interactive aspects available in a
adult-child dyad where the child is partially
hearing impaired, during the ages 8 - 20
months. The investigation involves a male
child, born with Hemifacial Microsomia. Audio
and video recordings are used to collect data
on child vocalization and parent-child interaction. Eye-tracking is used to measure eye
movements when presented with audio-visual
stimuli. SECDI forms are applied to observe
the development of the childs lexical production. Preliminary analyses indicate increased
overall parental interactive behaviour. As babbling is somewhat delayed due to physical limitations, signed supported Swedish is used to
facilitate communication and language development. Further collection and analysis of data
is in progress in search of valuable information
of the linguistic development from a pathological perspective of language acquisition.
Introduction
The typical linguistic development during infancy can be regarded as the result of the interaction between biological and environmental
factors that leads to the childs language converts to the surrounding language. According to
the Ecological Theory of Language Acquisition
(Lacerda et al., 2004a), early language acquisition is an emergent consequence of multisensorial embodiment of the information available
in ecological adult-infant interaction settings.
In agreement with this theory, the basic linguistic referential function emerges from at least
two of the sensory dimensions available in the
speech interaction scene (Lacerda, 2003;
Lacerda, Gustavsson & Svrd, 2003). If there
are restraining biological conditions or a lack of
adequate interaction with the environment, the
childs linguistic development generally will
deviate from the expected age dependent competence of communication. During typical circumstances, a one-year old child starts to use
adult-like word forms. By two years of age, the
55

child with a congenital perception and/or production handicap has to adjust in order to enhance the childs linguistic development.
This study aims at examining the parentchild interaction when the child has some perception and production disabilities. How does
the parent modulate their own and the childs
behaviour to enhance interaction and the
childs linguistic development? In order to investigate how representations of early words
may develop in the disabled human infant,
analyses will be made on the mothers linguistic structure, timing and turn-taking, her repetitions and strategy in adjusting to the infants
focus of attention. The infants vocal productions will be studied in order to observe the
progress of the childs verbal development.
tem that measures the childs eye movements

when presented with different auditory representations. Some stimuli are based on the Peabody Picture Vocabulary Test PPTV (Dunn &
Dunn, 1981), adapted to the Tobii system. Detailed eye-tracking are used to evaluate the
childs integration of audio-visual linguistic
information. A SECDI form (Eriksson & Berglund, 1999), a Swedish version of CDI form
(MacArthur Communicative Development Inventory, Fenson et al., 1993) will be administered every six months to observe the development of the childs lexical production in words
and gestures.
The speech material will be segmented, labelled and transcribed orthographically in
WaveSurfer (www.speech.kth.se/wavesurfer).
The parent-child linguistic and gestured interaction will be annotated in Anvil
(www.dfki.de/~kipp/anvil) for further analysis.
Method
A Swedish mother is recorded monthly while
spontaneously interacting with her child. On
two occasions the father has participated during
the recording substituting the mother.
Result and Discussion

The data is currently being analyzed but there
are preliminary indications of increased parental interactive behaviour. Initial analyses indicate that, as a consequence of the childs handicap, the mother seems to enhance her manner
of communication in order to keep the interaction active. Mother-child eye contact is frequent and expanded and turn-taking is strongly
encouraged by the mother in their interaction.
The mother tends to repeatedly verbalize every
representation of target word that currently is
under attention of them both, by using the target words in different settings, combined with
various physical actions. In interacting with a
child, parents often make use of specific body
language, with frequent and intense eye contact, exaggerated facial expressions, head nods
and shakes, pointing and with concrete physical
contact. All these varieties are used in her
means of communication with her child.
The child has some difficulties in his verbal
production. Before the feeding probe was removed and he could start eating proper food he
just played with his voice with high vowel-like
sounds, as reported by his parents. When the
boy was eight months of age, the probe was
taken out. Now general babbling started with
CV productions. The consonantal sounds produced were, and still are, mostly uvular and velar, like /gV/.
Lack of supporting bone structure on the
left side of the face affects the motor movements of the tongue and there seem to be a gen-
Subject
The subject is a Swedish, male infant from the
age of 8 months to 20 months with his mother
and father. The child was born with Hemifacial
Microsomia, i.e. was born without left outer
and middle ear, no zygomatic or mandible bone
structure on the left side of the face. He has
also a slightly cleft soft palate and a split uvula.
The child was fed by sub-glottal probe until
seven weeks of age and by nasal probe up to 8
months of age. The boy has one older sister.
Recording sessions
Recording sessions take place in a laboratory at
the Department of Linguistics, Stockholm University. The mother receives a selection of toys,
with verbal instructions indicating the significance of using onomatopoetic sounds when appropriate.
Procedure
A digital video camera, Panasonic NV-DS11,
focusing on the boy and his parent was used.
Both parent and child were recorded by a
Fostex CR200 Compact Disc Recorder, with
wireless microphones, Sennheiser Microport
Transmitters, attached to their clothes, connected to a Sennheiser Microport Receiver
EMI1005. Audio-visual perception is studied
by Tobii (www.tobii.com), an eye-tracking sys56

eral weakness in the mobility on the affected
side of the face. The left side of the tongue
tends to slip down into the cavity of the missing mandible bone structure. As a consequence
of a soft palate cleft and split uvula, he has
some problem controlling consonant sounds. In
situations of imitative interaction with his
mother, alveolar and bilabial speech sounds are
produced more at random than by will.
Speech production is somewhat impaired by
a short ligament of the tongue, that will be surgically corrected. This will hopefully help the
child in his articulation of single words. At present, his tongue is comparatively immobile and
he has difficulties forming any kind of consonantal speech sound, especially alveolar as he
has trouble raising the tip of his tongue to the
roof of his mouth. The only proper word he
pronounces understandably at the age of 20
months is mamma, and articulation of the
word pappa is not yet feasible.
As the boy grows his need of making himself understood increases. He makes use of prosodic cues in communicating with his family;
by using intonation he tries to convey his intentions in his utterances, like when he is calling
for his sister or protesting about something. As
his verbal mean of communication is impaired
and delayed, he often gets impatient and frustrated when failing to make himself understood.
The parents has recently introduced sign supported Swedish which diminishes some of the
boys frustration. With sign supported communication, speech is always used parallel with
the signs, which facilitates comprehension of
the message and promotes speech development.
The boy can now make himself more readily
understood, and is able to convey some of his
basic needs to his parents.
References
Anvil: www.dfki.de/~kipp/anvil
Bahrick, L. E. (2004). The development of perception in a multimodal environment. In
G.Bremmer & A. Slater (Eds.), Theories of
Infant Development (1:rst ed., pp. 90-120).
Oxford: Blackwell.
Brent, M.R. & Siskind, J.M. (2001) The role of
exposure to isolated words in early vocabulary development. Cognition, 81, B33-B44.
Dunn, L.M. & Dunn, L.M. (1981) Peabody Picture Vocabulary Test Revised. American
Guidance Service. Circle Pines, Minnesota.
Eriksson M. & Berglund, E. (1999) Swedish
early communicative development inventories: words and gestures. First Language,
19, 55-90.
Fernald, A., Taeschner, T., Dunn, J., Papousek,
M., de Boysson-Bardies, B. & Fukui, I
(1989) A Cross-language study of prosodic
modifications in mothers and fathers
speech to preverbal infants. Journal of Child
Language, 16, 477-501.
Fenson, L., Dale, P.S., Reznick, J.S., Thal, D.,
Bates, E., Hartung, J., Pethick, S. & Reilly,
J. (1993). The MacArthur Communicative
Development Inventories: Users guide and
technical manual. San Diego, Ca. Singular.
Lacerda, F. (2003) Phonology: An emergent
consequence of memory constrains and sensory input. Reading and Writing: An Interdisciplinary Journal, 16, 41-59.
Lacerda, F., Gustavsson, L. & Svrd, N. (2003)
Implicit linguistic structure in connected
speech. PHONUM 9, Ume, Sweden, 6972.
Lacerda, F., Klintfors, E., Gustavsson, L.,
Lagerkvist, L. Marklund, E. & Sundberg, U.
(2004a) Ecological Theory of Language
Acquisition. In. Berthouse, L., Kozima, H.,
Prince, C., Sandini, G., Stojanov, G., Metta
G & Balkenius, C. (Eds.) Proceedings of the
Fourth International Workshop on Epigenetic Robotics (EPIROB 2004) Lund university Cognitive Studies, 117, 147-148.
Lacerda, F., Marklund, E., Lagerkvist, L., Gustavsson, L., Klintfors, E. & Sundberg, U.
(2004b) On the linguistic implications of
context-bound adult-infant interactions. In.
Berthouse, L., Kozima, H., Prince, C.,
Sandini, G., Stojanov, G., Metta G & Balkenius, C. (Eds.) Proceedings of the Fourth
International Workshop on Epigenetic Robotics (EPIROB 2004) Lund university
Cognitive Studies, 117, 149-150.
Conclusion
As a consequence of the childs congenital
physical handicap, the mothers interactive behavior seemed to increase. The childs verbal
production is impaired but steadily improving.
Passive verbal language seems to be present,
and an active form of verbal language with well
articulated words will probably come in time as
the physical impediments are attended to. Further collection and analysis of data will give
hopefully valuable information of the linguistic
development from a pathological perspective of
language acquisition.
57

Tobii: www.tobii.com
WaveSurfer: www.speech.kth.se/wavesurfer
ster, A-M. (2002) The relationship between
residual hearing and speech intelligibility
Is there a measure that could predict a
prelingually profoundly deaf childs possibility to develop intelligible speech? TMHQPSR, KTH, Vol 43, 51-56.
58
Durational patterns produced by Swedish and American

18- and 24-month-olds: implications for the acquisition
of the quantity contrast
Lina Bonsdroff and Olle Engstrand
durational patterns irrespective of age. This
encouraged the hypothesis that the quantity
contrast may be in place in many 24-monthers
or even earlier. In particular, it is possible that
it is during the interval between 18 and 24
months of age that most children develop an
adult-like command of quantity-related durational patterns. The present study was carried out to test this assumption using a larger
group of subjects at 18 and 24 months of age.
For control purposes, the Swedish data were
compared to measurements of similar utterances produced by a comparable group of
children reared in an American English
speaking environment.
Abstract
On the basis of previous, small-scale analyses, it was hypothesized that most Swedish
children develop an adult-like command of
quantity-related durational patterns between
18 and 24 months of age. In this study, VC
structures produced in stressed position by
several Swedish 18- and 24-month-olds were
analyzed for durational correlates of the complementary V:C and VC: quantity patterns.
Durations were typically reminiscent of the
adult norm suggesting that, at 18 months of
age, Swedish children have acquired a basic
command of the durational correlates of the
quantity contrast. In consequence, quantity
development must start well ahead of that
age. It was also found that voicing had a considerable, adult-like effect on the duration of
postvocalic consonants at both ages. This effect was smaller in the American controls,
again indicating the presence of a languagespecific phonetic pattern. The effect of voicing
on preconsonantal vowel duration was relatively moderate. The effect was also present
in the American 24-monthers, but less substantially than commonly observed in adults
speech. No significant voicing-induced vowel
lengthening effect was found in the American
18-monthers.
Methods
Subjects were drawn from a larger group of
Swedish and American English children at
the ages of 6, 12, 18, 24 and 30 months. Audio and video recordings were made as described in Engstrand et al. (2003). These recordings were subsequently digitized and
stored on dvd disks. The present study is
based on disyllabic words produced by
Introduction
11 Swedish 18-monthers
11 Swedish 24-monthers
14 American 18-monthers
13 American 24-monthers.
All measurable disyllabic words produced

by these children were analyzed with respect
to the durational aspect of quantity. Thus,
measurements were made of the duration of
vowels and consonants pertaining to the
rhymes of word-inital stressed syllables.
Cries, screams, whispers were left out as well
as utterances contaminated by environmental
noise. In some cases, video sequences or parents responses were used to interpret word
meanings. Segmentation of vowels and consonants were made from spectrograms according to conventional acoustic criteria.
In Central Standard Swedish, the quantity

contrast has a complementary durational
manifestation such that the rhymes of all
stressed syllables are either V:(C) or VC:(C).
Swedish quantity also entails spectral (i.e.,
vowel quality) differences. It can be expected
that heavy exposure to these rather robust patterns will lead to relatively early ambient language effects in childrens speech production.
In Engstrand and Bonsdroff (2004), durational data were reported on disyllabic words
produced by three Swedish children aged 30
(two children) and 24 months, respectively.
Those limited results suggested adult-like
59

times as many instances of short as of long
vowels in this speech material.
Graphical displays of the above ratios, also
indicating average consonant and vowel durations, are shown in figure 1. The graph depicts consonant against vowel durations for
the 24-month-olds who produced the respective quantity patterns. Each symbol represents
the average pertaining to one single child.
The filled circles refer to the V:C structures,
and the unfilled circles refer to the VC: structures. It can be seen that the two data sets are
well separated, and that vowel and consonant
durations tend to be inversely correlated.
Results
In this section, durational tendencies are reported; a full statistical treatment will be presented elsewhere.
Mean vowel-to-consonant durational ratios
pertaining to the Swedish 18- and 24monthers are summarized in table 1 for the
long vowels (the V:C pattern) and in table 2
for the short vowels (the VC: pattern). For
example, table 1 shows that the V:/C average
ratio was 1.86 for the 18-month-olds, and that
this average was based on a total of 43 productions. Eight of the 11 18-year-olds produced measurable instances of the V:/C pattern. One child turned out to be a far outlier
and was left out. The inclusion of 7 out of 11
child averages is marked as 7(11) in the right
column of the table. The mean value for the
24-month-olds was similar, 1.96, but this
mean was based on more (124) productions.
This means that, on average, the long vowels
had almost twice the duration of the following
consonants. For both age groups, in other
words, the durational relationships are not far
from what can be observed in adult speech
(e.g., Elert 1964).
Table 1. Mean vowel-to-consonant ratios for disyllabic words with long vowels produced by
Swedish 18- and 24-month-olds.
Age
(mos.)
18
24
Figure 1. Consonant by vowel durations for the

Swedish 24-month-olds. Filled circles: the V:C
pattern; unfilled circles: the VC: pattern.
Swedish, long vowels

Ratio Std. # wds
1.86 0.62
43
1.96 0.46
124
# children
7(11)
9(11)
Table 2. Mean vowel-to-consonant ratios for disyllabic words with short vowels produced by
Swedish 18- and 24-month-olds.
Age
(mos.)
18
24
Swedish, short vowels

Ratio Std. # wds
0.85 0.36
147
0.89 0.19
310
# children
11(11)
10(11)
The similarity between the age groups is

equally striking for the VC: pattern. The V/C:
ratios are 0.85 and 0.89 for the 18- and 24monthers, respectively. On average, then,
vowel durations amounted to 85-90 percent of
the consonant durations. This is, again, reminiscent of what can be observed in adult productions. Also note that there were almost 3
Figure 2. Consonant by vowel durations for the

Swedish 18-month-olds. Filled circles: the V:C
pattern; unfilled circles: the VC: pattern.
The corresponding data for the 18-montholds are presented in figure 2. Again, filled
and unfilled circles stand for V:C and VC:
60

Thus, table 4 shows that the average V/C ratios are 1.70 and 1.23 for the 18-monthers
voiced and voiceless consonant conditions,
respectively; the corresponding ratios for the
24-monthers are 2.01 and 1.31, respectively.
Thus, the voicing effect appears to be more
prominent in the older children.
The voicing-induced effects on V/C ratios
do not contrast grossly between the two language groups. However, the underlying durational patterns could be assumed to differ. In
Swedish, voicing is known to have a considerable effect on consonant duration, as noted
above. Thus, voiceless postvocalic consonants exhibit substantially greater durations
than do voiced consonants in the same position. Durational effects on preceding vowels
are, however, moderate (Elert 1964). This
contrasts with the English pattern where, as
mentioned, vowel durations are affected more
strongly. Thus, given an ambient language
influence, these contrasting patterns would
also be expected in the present Swedish and
American 18- and 24-month-olds. This prediction is partly borne out by the durational
data shown in tables 5 and 6. The data represent averages of all instances of voiced and
voiceless stops pooled over the respective
language and age groups. Stop consonants
were chosen here to avoid irrelevant manner
effects.
The figures shown in table 5 display a tendency for consonant durations to increase as
an effect of voicelessness in the Swedish children of both ages. This is particularly notable
in the VC: patterns. For the 18-monthers, for
example, the mean duration is 161 ms in the
voiced condition and 282 ms in the voiceless
condition. Also note that the consonant duration effect tends to be accompanied by an opposite but smaller vowel duration effect.
In the American data, the voicing effect
on stop consonant durations tends to be
slighter. Thus, voiced and voiceless stop durations are 116 and 152 ms, respectively, for
the 18-monthers and 115 and 143 ms, respectively, for the 24-monthers. The 24-monthers
also display an exptected vowel duration increase prior to voiced consonants (199 and
161 ms for the voiced and voiceless conditions, respectively). However, this effect is
relatively modest compared to what is frequently found in adult speech. For the 18month-olds, the average vowel duration runs
counter to the expectation being even greater
structures, respectively. Here too, the data

sets are fairly well separated with marginal
overlaps only.
The effect of consonant voicing on the
Swedish V/C ratios is shown in table 3. For
the 18-month-olds, long vowels are, on average, about twice as long as the following
voiced consonant (ratio=2.08). When the
postvocalic consonant is voiceless, the vowel
and consonant durations are about equal (ratio=1.18). For the short vowels, the corresponding ratios are 1.28 and 0.51, respectively. Thus, the effect of voicing points in
the same direction in the long and short vowel
conditions. The 24-month-olds evidence similar effects of voicing conditions and, again,
these patterns are reminiscent of what can be
found in adults speech.
Table 3. Mean vowel-to-consonant ratios for disyllabic words containing long or short vowels
followed by voiced or voiceless consonants. Subjects: Swedish 18- and 24-month-olds.
Age
Vowel Cons.
(mos.) length type
Long
Vd
18
Vl
Short
Vd
Vl
Long
Vd
24
Vl
Short
Vd
Vl
Ratio
2.08
1.18
1.28
0.51
2.29
1.53
1.11
0.70
Std.
0.67 30
0.54 13
0.65 75
0.18 72
0.62 65
0.59 50
0.16 103
0.31 105
Table 4. Mean vowel-to-consonant ratios for disyllabic words containing voiced or voiceless
postvocalic consonants. Subjects: American 18and 24-month-olds.
Age
Cons.
(mos.)
type
18
Voiced
Voiceless
24
Voiced
Voiceless
Ratio
1.70
1.23
2.01
1.31
Std.
0.68
0.37
0.91
0.79
N
83
51
128
77
English differs from Swedish in that vowel

quality differences are more prominent than
durational differences. In English, however,
consonant voicing is known to have a marked
effect on preceding vowel durations (e.g.,
House and Fairbanks 1953). The data presented in table 4 is compatible with the hypothesis that this durational relationship is
acquired by American 18- and 24-monthers.
61

prior to voiceless as compared to voiced consonants. This result, however, is clearly statistically non-significant.
fore voiced consonants. This, however, was

most likely due to chance.
It was hypothesized above that most
Swedish children develop an adult-like command of quantity-related durational patterns
between 18 and 24 months of age. However,
the present results suggest that these patterns
are essentially in place in most 18-monthers,
and that they change only slowly up to the
age of 24 months. To the extent that this conclusion is confirmed in the final analysis, the
first steps on the childs path to the quantity
contrast must be taken well before 18 months
of age.
Table 5. Mean duration values (ms) for long and

short vowels and the following voiced or voiceless
stop consonants as produced by Swedish 18- and
24-month-olds.
Voiced
Age
18
24
Mn
Std
N
Mn
Std
N
V:
266
83
C
126
68
Voiceless
V:
253
77
14
289
83
Voiced
V
202
40
9
117
29
36
C
171
61
245
73
Voiceless
V
119
47
6
192
73
32
C:
161
74
200
54
C:
282
112
51
175
70
16
145
55
266
140
Acknowledgment
65
This work was supported by grant 421-20031757 from the Swedish Research Council
(VR) to O. Engstrand.
Table 6. Mean duration values (ms) for vowels

and the following voiced or voiceless stop consonants as produced by American 18- and 24month-olds.
Age
(mos.)
18
24
References
Voiced Voiceless
V
C
V
C
Mean 153 116 166 152
Std
35 37 83 51
N
22
17
Mean 199 115 161 143
Std
61 34 46 49
N
25
28
Elert C.-C. (1964) Phonologic Studies of

Quantity in Swedish. Uppsala: Almqvist
& Wiksell.
Engstrand O., Williams K. and Lacerda F.
(2003) Does babbling sound native? Listener responses to vocalizations produced
by Swedish and American 12- and 18month-olds. Phonetica 60, 17-44.
Engstrand O. and Bonsdroff L. (2004) Quantity and duration in early speech: preliminary observations on three Swedish children. Papers from the 17th Swedish Phonetics Conference, Department of Linguistics, Stockholm University, May 26-28,
64-67.
House A.S. and Fairbanks G. (1953) The influence of consonantal environment upon
the secondary acoustical characteristics of
vowels. Journal of the Acoustical Society
of America 25, 105-113. Reprinted in Lehiste E. (ed), Readings in Acoustic Phonetics, Cambridge: The MIT Press, pp.
128-136.
Summary and conclusions

In general, durations produced by the present
Swedish 18- and 24-month-olds tended toward values found in adults speech. Thus,
the data suggest that these children have acquired a basic command of the durational correlates of the Swedish quantity contrast, including the complementarity governing internal segment durations in V:C and VC: structures. It was also found that consonant voicing had an adult-like effect on the duration of
postvocalic consonants at both ages. Considering the smaller effect in the American data,
this, again, suggested a near-completed acquisition of a language-specific phonetic pattern.
The effect of voicing on preconsonantal
vowel duration was relatively moderate, as
expected. The same effect was present in the
American 24-monthers, but was not as dramatic as commonly observed in adult English
speakers. However, the American 18monthers displayed the opposite tendency,
vowels being longer before voiceless than be62
/r/ realizations by Swedish two-year-olds: preliminary

observations
Petra Eklund, Olle Engstrand, Kerstin Gustafsson, Ekaterina Ivachova and sa Karlsson1
of the first attempts to produce /r/ sounds is
scarce. It is expected, though, that, at 2 years of
age, most children will display an incipient influence of the ambient dialect. This seems to be
the case with, e.g., comparable north German
children who have been shown to produce glottal ([h]-like) and velar ([x]-like) /r/ substitutions, i.e., back articulations in rough agreement with the regional adult norm (Fox and
Dodd 1999). In particular, 2-year-olds reared in
a central Swedish linguistic environment would
be expected to produce predominantly coronal
/r/ approximations with variable manners of
articulation. On the other hand, informal observations suggest that a [j]-like palatal approximant may be a frequent /r/ substitution in many
young children. Importantly, however, a large
amount of inter-subject variability can be expected (cf. Vihman 1993). In this paper, some
of these expectations are evaluated on the basis
of observations on /r/ approximations produced
by Swedish 2-year-olds.
The choice of /r/ approximation may also
be determined by phonetic context as has been
shown to be the cases in adults speech (cf.
Muminovic and Engstrand 1991). However,
this aspect of childrens /r/ realizations will not
be discussed in the present report.
Abstract
We report auditory observations on /r/ approximations produced by 11 Swedish 2-yearolds. About half of the 1291 expected instances
of /r/ were either realized as vocoids or just
dropped. Most of the contoid realizations were
approximants or fricatives whereas taps, flaps,
trills, laterals, nasals and stops occurred marginally. This was roughly consistent with the
phonetic norms for the ambient language. The
most frequently occurring places of articulation
were coronals, palatals and, to some extent,
glottals. Some of this place variation could be
explained in terms of number of attempted
word types suggesting that both vocabulary
size and ambient-languge-like /r/ productions
constitute different aspects of linguistic maturity in young children.
Introduction
Rhotics (r sounds) are well known for their unusually wide range of variation in terms of
manner and place of articulation (e.g., Lindau
1985, Ladefoged and Maddieson 1996, Demolin 2001). More or less common variants are
voiced or voiceless vocoids, approximants,
fricatives, trills, taps and flaps produced at various places of articulation. Whereas the rhotics
tend to occupy the liquid slot adjacent to the
syllable nucleus and, thus, have a common distribution in terms of the sonority hierarchy
(Jespersen 1904), they clearly lack an invariant
acoustic basis (such as a lowered F3; see, e.g.,
Lindau 1985). To be sure, the rhotics may be
phonetically related in terms of a Wittgensteinian family resemblance such that /r/1 resembles /r/2 that resembles /r/3 an so on up to
/r/n; however, /r/1 and /r/n may not have a single
phonetic property in common (Lindau 1985).
But even though the family resemblance metaphor may characterize relationships between
category members in an interesting way, it does
not serve to delimit the category in the first
place.
The apparent lack of unity among the rhotics is bound to cause trouble on childrens path
to spoken language. However, our knowledge
Methods
The subjects used in this study were 11 normally developing Swedish 24-monthers, 6 girls
and 5 boys. The children were drawn from a
larger group of approximately 60 Swedish and
60 American English children at the ages of 6,
12, 18, 24 and 30 months. Subjects and recordings were described in detail in Engstrand
et al. (2003). In summary, audio- and videorecordings were made in nursery-like, soundtreated rooms in Stockholm and Seattle. All
children were accompanied by a parent (usually
the mother). Both parents were representative
of the regional standard spoken in the Stockholm and Seattle areas, respectively. The audio
and video signals were subsequently digitized
and stored on dvd disks.
63

All utterances expected to contain one or
more /r/ were identified. The video films were
used to aid in interpreting the childrens utterances. In total, 1291 instances of expected /r/
were found, classified auditorily and transcribed. To establish a common transcription
standard, the first part of the analysis was carried out by the authors jointly. The remaining
material was divided up into equal parts and
transcribed individually.
Results
Out of the 1291 expected instances of /r/, 613
(47 percent) were realized as contoids. Whether
the remaining /r/s had a vocoid realization or
were just dropped was often hard to determine
reliably. At a rough estimate, however, approximately 10 percent were vocoids and 43
percent were dropped. The distribution of manners and places of articulation for the contoid
realizations is shown in table 1.
Table 1. Distribution of manners and places of articulation across the subject group (percent of all contoid
/r/ realizations, N=613).
N=613
Approx
Labio-dental
0.0
Fricative
0.3
Tap or
flap
0.0
Lat.
approx
0.0
Detal/alveolar
23.2
5.2
4.2
Retroflex
Palatal
Velar/uvular
Glottal
Total
5.7
25.8
0.2
0.0
54.8
15.3
0.8
1.3
9.3
32.3
0.0
0.0
0.0
0.0
4.2
The (non-lateral) approximants represented the

most frequent manner type occurring in 54.8
percent of the /r/ instances (bottom row, leftmost column). Among the approximants, the
palatals, i.e., [j]-like sounds, were the most
common (25.8 percent of the cases). Most of
the remaining approximants were coronal, i.e.,
dentals, alveolars or retroflexes. Fricative /r/
realizations were also common with 32.3 percent of all instances. Most of these (20.5 percent) were coronal, but there were also several
occurrences of an [h]-like, glottal fricative. The
fricatives represented the only manner of articulation that was produced at all places. Taps
or flaps were infrequent (4.2 percent) and were
always dental or alveolar. The incidence of lateral approximants, i.e., [l]-like sounds, was also
low (3.1 percent) as was the case of stop realizations (3.8 percent, either coronal or glottal).
Nasal realizations occurred marginally (1.3 percent) as did the trills (0.5 percent).
For places of articulation, there were very
few instances of labio-dentals (0.3 percent).
Together, dentals, alveolars and the retroflexes
formed a dominating group of coronals representing 60 percent of the entire material. Palatals also occurred rather frequently (26.6 per-
Stop
Nasal
Trill
Total
0.0
0.0
0.0
0.3
3.1
1.3
1.3
0.3
38.6
0.0
0.0
0.0
0.0
3.1
0.2
0.0
0.0
2.3
3.8
0.0
0.0
0.0
0.0
1.3
0.2
0.0
0.0
0.0
0.5
21.4
26.6
1.5
11.6
100
cent). With palatals and velars/uvulars taken

together, this group of dorsals constituted 28.1
percent of the entire material. The glottals, finally, accounted for 11.6 percent of all observations. In sum, approximately 54 percent of
these childrens contoid /r/ realizations were
similar to /r/ allophones normally found in central Swedish adults. These are the coronal approximants, fricatives, taps/flaps or trills.
Table 2 shows mean percentages, number
of subjects and ranges of variation for the respective manner/place combinations (phonetically unlikely or impossible sound types are
marked with asterisks). For each child, the incidence of each /r/ type was calculated as a percentage of all /r/ approximations produced by
that child. The mean values shown in the table
are group averages of these percentages. For
example, the second row of the second column
shows that 5 children produced alveolar or dental fricatives (whereas the remaining 6 children
did not use this type of /r/ approximation). The
proportions of fricatives in these 5 children
ranged from 2.4 to 40.9 percent. The mean of
the 5 positive values and the 6 zeros (i.e., non64

glottal stops (7). However, the extent to which
these sound types were used differed widely
between individual children (as seen from the
ranges of variation). For example, alveolar or
dental approximants were used by 8 children,
as noted. However, for one of these children,
the type accounted for a mere 2 percent of all
/r/ approximations whereas, for another child,
the corresponding figure was 44 percent. For
the palatal approximants and the glottal fricatives, these ranges were even greater.
production of this sound type) amounted to 5.4

percent.
The table shows a great amount of variability across both children and sound types. It
shows, for example, that 5 of the 42 manner/place combinations were used to some extent by more than half of the children. Those
sound types were the alveolar or dental approximants (8 children), the palatal approximants (9), the glottal fricatives (10), the alveolar or dental lateral approximants (7), and the
Table 2. Mean percentages, number of subjects and ranges of variation for the respective manner/place
combinations. Phonetically unlikely or impossible sound types are marked with an asterisk.
Approx
Fricative
Mean
0.0
0.4
# subj.
0
1
Range
0
0
12.6
5.4
Alveolar/ Mean
# subj.
8
5
dental
Range
2.3-4.2
2.4-40.9
Mean
2.1
5.5
4
2
Retroflex # subj.
Range
1.7-17.6 28.3-32.1
Mean
34.8
1.0
# subj.
9
2
Palatal
Range
0.5-71.1
1.7-9.8
Mean
0.2
1.5
Velar/
# subj.
1
2
uvular
Range
0
7.3-9.1
Mean
*0.0
16.3
# subj.
0
10
Glottal
Range
0
0.8-83.3
*Unlikely or impossible sound types.
Labiodental
Tap or
flap
0.0
0
0
2.0
3
2-4-16.7
0.0
0
0
0.0
0
0
0.0
0
0
*0.0
0
0
Lat.
appr
*0.0
0
0
3.7
7
1.6-7.9
0.0
0
0
0.0
0
0
0.0
0
0
*0.0
0
0
Stop
0.0
0
0
1.4
4
1.1-6.7
0.5
1
0
0.0
0
0
0.0
0
0
5.4
7
0.8-33.3
Nasal
Trill
0.0
*0.0
0
0
0
0
6.8
0.1
3
2
3.3-66.7 0.5-0.8
0.0
0.1
0
1
0
0
0.0
*0.0
0
0
0
0
0.0
0.0
0
0
0
0
*0.0
*0.0
0
0
0
0
lars and retroflexes), glottals and dorsals (palatals, velars and uvulars), respectively. The
straight lines, which are linear approximations to the data points for the respective
types, suggest an increase in the number of
coronal /r/ realizations as a function of the
number of produced word types (r = 0.70). In
contrast, the glottal realizations exhibit the
opposite, negative trend (r = -0.63). For the
dorsals, the effect is negligible (r = -0.18). In
this sense, then, children who displayed a larger /r/ vocabulary seemed to conform phonetically to the ambient language to a higher
degree than did children with a smaller /r/ vocabulary.
It should also be noted that the percentages

given in table 2 are based of very different
numbers of total /r/ approximations. For example, one of the children produced a total of
6 /r/ approximations only. Two of these were
classified as glottal stops and, thus, accounted
for approximately 33 percent of the 6 cases.
In contrast, another child produced 187 /r/
approximations. Sixty-seven of these, i.e., approximately 36 percent, were alveolar or dental approximants. Thus, similar percentages
may differ widely in statistical stability. The
variable amounts of /r/ approximations is illustrated in figure 1
Figure 2 shows percent occurrence of coronal, dorsal and glottal /r/ realizations by
number of word types. Circles, squares and
diamonds represent coronals (dentals, alveo65

size and ambient-languge-like /r/ productions
provide two independent indices of linguistic
maturity in young children. It is also possible
that vocabulary growth forces an increasing
attention to phonetic detail in production. It
should be noted, however, that these conclusions are wholly based on auditory judgments
which will, to the extent possible, require instrumental verification.
Acklowledgments
This work was supported by grant 20038460-14311-29 from the Swedish Research
Council (VR) to O. Engstrand.
Figure 1. Frequency distribution showing the variable amounts of /r/ approximations.
Notes
1. Names in alphabetical order.
References
Demolin D. (2001) Some phonetic and phonological observations concerning /R/ in
Belgian French. In H. Van de Velde and
R. van Hout (eds.), r-atics. Sociolinguistic, phonetic and phonological characteristics of /r/. Etudes & Travaux, Institut des
langues vivantes et de phontique, Universit Libre de Bruxelles, 63-73.
(2003) Does babbling sound native? Listener responses to vocalizations produced
by Swedish and American 12- and 18month-olds. Phonetica 60, 17-44.
Fox A.V. and Dodd B.J. (1999) Der Erwerb
des phonologischen Systems in der
deutschen Sprache. Sprache-StimmeGehr 23, 183-191.
Jespersen, Otto. 1904. Phonetische Grundfragen. Leipzig and Berlin: Teubner.
Ladefoged P. and Maddieson I. (1996) The
Sounds of the Worlds Languages. Oxford: Blackwell.
Lindau M. (1985) The story of /r/. In Fromkin
V.A. (ed) Phonetic Linguistics: Essays in
Honor of Peter Ladefoged, 157-168. Orlando: Academic Press.
Muminovic D. and Engstrand O. (2001) /r/ in
some Swedish dialects: preliminary observations. Working Papers (Dept.
Linguistics, Lund University) 49, 120123.
Vihman M.M. (1993) Variable paths to early
word production. Journal of Phonetics 21,
61-82.
Figure 2. Percent occurrence of coronal, dorsal

and glottal /r/ by number of word types. Circles,
squares and diamonds represent coronal, glottal
and dorsal realizations, respectively.
Conclusions
Auditory observations on 11 Swedish 2-yearolds have shown a high degree of variation in
the phonetic realization of /r/. On the whole,
however, approximants and fricatives constituted the dominating manners of articulation,
whereas taps, flaps, trills, laterals, nasals and
stops occurred marginally. This is roughly in
accordance with the phonetic norms for the
ambient language (cf. Muminovic and Engstrand 1991 for similar patterns in a number
of Swedish dialects). The most frequent
places were coronal, palatal and, to some extent, glottal. The glottal articulations were
counter to expectations since they are foreign
to central Swedish. So are [j]-like /r/ realizations, but these were nevertheless expected
from informal observations. Some of the
place variation could be explained in terms of
vocabulary size in the sense of number of
attempted /r/ words. It may be that vocabulary
66
Tonal word accents produced by Swedish 18- and 24month-olds

Germund Kadin and Olle Engstrand
accent contrast and that, in consequence, Swedish word accent acquisition typically takes
place during the 18-24 months age interval. In
the present study, this assumption was tested
using a group of 18-month-olds. To verify the
language-specificity of observed accent patterns, comparable groups of American English
18- and 24-month-olds were used as controls.
Abstract
F0 measurements were made of disyllabic
words produced by several Swedish and American English 18- and 24-month-olds. The Swedish 24- and 18-monthers produced accent contours that were similar in shape and timing to
those found in adult speech. The Swedish 18monthers, however, produced very few words
with the acute accent. It is concluded that most
Swedish children have acquired a productive
command of the word accent contrast by 24
years of age and that, at 18 months, most children display clear tonal ambient-language effects. The influence of the ambient language is
evident in view of the F0 contours produced by
the American English children whose timing of
F0 events tended to be intermediate between the
Swedish grave and acute contours. The relative
consistency with which grave accent contours
were produced by the Swedish 18-monthers
suggest that some children are influenced by
the ambient language well before that age.
Methods
Subjects were drawn from a larger group of approximately 60 Swedish and 60 American English children at the ages of 6, 12, 18, 24 and 30
months. Audio and video recordings were made
as described in Engstrand et al. (2003). The
present study is based on recordings of
11 Swedish 24-month-olds (6 girls, 5 boys)
13 American 24-month-olds (6 girls, 7
boys)
11 Swedish 18-month-olds (6 girls, 5 boys)
16 American 18-month-olds (9 girls, 7
boys),
i.e., a total of 51 children, including the 24monthers used in Engstrand and Kadin (2004).
All disyllabic words with stress on the first syllable were analyzed according to criteria described in Engstrand and Kadin (2004). F0 was
measured at five points in time: at 1) the acoustic onset of V1, 2) the F0 turning-point in V1 (if
the F0 contour was monotonic throughout the
vowel, the turning-point was assigned the value
of the onset), 3) the acoustic offset of V1, 4) the
acoustic onset of V2, and 5) maximum F0 in V2
(if F0 declined throughout the vowel, maximum
F0 was assigned the value of the onset). A Fall
parameter was defined as the F0 difference between V1 turning-point and offset, and a Rise
parameter was defined as the F0 difference between V2 maximum and V1 offset. All measurements were made using the Wavesurfer program package.
Introduction
Swedish has a contrast between a grave and
an acute tonal word accent. The acute accent
is associated with a simple, one-peaked F0 contour. The grave accent typically has a twopeaked F0 contour with a fall on the primary
stress syllable and a subsequent rise towards a
later syllable in the word (Bruce 1977, Engstrand 1995, 1997).
A preliminary report on Swedish childrens
acquisition of the word accents was presented
in Engstrand and Kadin (2004). The results,
which were based on 6 Swedish and 6 American English 24-month-olds, suggested that, at
that age, Swedish children are well on the way
to establishing a productive command of the
accent contrast. The present study was carried
out to test this preliminary conclusion using
additional subjects. In addition, previous studies (Engstrand et al. 1991, Engstrand et al.
2003) have suggested that most 17-18-montholds display a much less consistent use of the
Results
Auditory judgments first suggested that a majority of the words produced by the Swedish
children (both 18- and 24-month-olds) had a
grave-like tonal contour and that, in general,
67

grave and acute accents were assigned to
words in accordance with the adult norm. In
contrast, none of the American English word
productions sounded convincingly grave.
this fall was, on the whole, less marked than for

the older group (the grand means were 76 and
50 Hz for the 24- and 18-monthers, respectively). The Rise parameter, too, was positively evaluated for most 18-monthers but,
again, less consistently than for the 24monthers (grand means 111 and 55 Hz, respectively). Two of the 18-monthers (SW18F3 and
SW18M5) did not evidence any rise into the
second vowel.
Measurement results are summarized in table 1-5 (a full statistical treatment will be reported elsewhere). The tables present means
and standard deviations for the Fall and Rise
parameters. The bottom line of each table
shows grand means and standard deviations. In
the left column, SW and AM represent Swedish
and American English, respectively, 18 and 24
indicate the respective ages in months, and F
and M stand for sex (female or male). The last
figure is a reference number that identifies the
individual child. Thus, for example, SW24F1
stands for a Swedish 24 months old girl with
the reference number 1.
Table 2. Means for the Fall and Rise parameters in

grave accent words produced by the Swedish 18monthers.
Child
SW18F1
SW18F2
SW18F3
SW18F4
SW18F5
SW18F6
SW18M1
SW18M2
SW18M3
SW18M4
SW18M5
Grand
mean

grave accent words produced by the Swedish 24monthers.
Child
SW24F1
SW24F2
SW24F3
SW24F4
SW24F5
SW24F6
SW24M1
SW24M2
SW24M3
SW24M4
SW24M5
Grand
mean
GRAVE ACCENT
Fall (Hz)
Rise (Hz)
Mean SD Mean SD
69
62
102 121
60
29
69
40
62
45
70
66
113
88
93 129
72
41
99 124
77
38
127 125
102 141
392 286
59
42
70
84
101 122
106 136
70
59
42
78
60
35
50
44
76
64
111
112
N
21
22
19
26
23
12
2
23
23
21
23
GRAVE ACCENT
Fall (Hz)
Rise (Hz)
Mean SD Mean SD
41 44
139 180
59 41
94
67
30 28
1,0
65
64 17
33
24
114 107
52 219
27 18
70
98
43 46
58
85
23 38
45
44
39 33
38 134
35 29
83
34
80 93
-3,0
27
50
45
55
89
N
2
14
4
4
7
2
17
4
6
13
2
75
The Swedish acute productions were relatively

few 37 in all produced by 6 of the 11 24month-olds. Thus, 5 of the Swedish 24monthers lacked acute productions altogether
(table 3). Again, the Fall parameter had positive
values which were, however, smaller than in
the grave productions (and, as we shall see,
with a different timing of the F0 peaks). The
Rise parameter values differ markedly from
those pertaining to the grave words in that they
were consistently negative. This means that the
acute words displayed a continuous F0 decline
from an early peak in the first syllable. In other
words, the acute words had a clearly onepeaked F0 contour.
215
Swedish grave words produced by the 24month-olds consistently displayed positive values for both the Fall and the Rise parameters
(table 1). This means that 1) F0 declined from a
turning-point in the primary stress vowel reaching a relatively low value at the end of that
vowel, and 2) rose to resume a relatively high
position in the second vowel resulting in a
two-peaked F0 contour.
Acute productions by the Swedish 18monthers were too few to provide a basis for
reliable generalizations. However, parameter
values tended to differ from those for the grave
productions and to resemble those pertaining to
the 24-monthers.
The Swedish 18-month-olds show a similar,

but somewhat less consistently two-peaked
contour (table 2). The positive values of the
Fall parameter indicate that F0 declined from a
turning-point in the primary stress vowel, but
68

words produced by the American 18-monthers.

acute accent words produced by the Swedish 24monthers.
Child
SW24F1
SW24F2
SW24F3
SW24F4
SW24F5
SW24F6
SW24M1
SW24M2
SW24M3
SW24M4
SW24M5
Grand mean
ACUTE ACCENT
Fall (Hz)
Rise (Hz)
Mean SD Mean SD
29 11
46 78
96 49
11 2,0
42 9,5
-3,7 25
-79 106
-29 22
-52 1,7
16
2,7
35
31
-73
57
43
15
-37
27
AMERICAN ENGLISH
Fall (Hz)
Rise (Hz)
Mean SD Mean SD N
AM18F1
41 68
-47 67 5
AM18F2
19 27
84 171 6
AM18F3
0
-11
1
AM18F4
39 49
-33 190 10
AM18F5
30
52
1
AM18F6
27 24
-48 102 12
AM18F7
28 22
-27 18 4
AM18M1
11
159
1
AM18M2
36 36
24 73 8
AM18M3
36 25
25 83 10
AM18M4
44 78
4,8 79 5
Grand mean
28 41
17 98 63
Child
N
0
6
7
3
9
0
0
7
0
5
0
37
The above tables have shown differences between accent types and ambient languages in
terms of the Fall and Rise parameters. Thus,
timing has so far been disregarded. However,
timing of F0 events in relation to segmental
structure is crucial as illustrated in figure 1. The
figure shows mean data for the Swedish and
American English children in both age groups
(symbols are explained in the figure legend).
Grave and acute productions are shown for the
Swedish 24-monthers. F0 values are timealigned to the first measurement point, and the
data points are connected by smoothed lines
that bear a certain resemblance to authentic F0
contours. The measurement points correspond
to acoustic events as described above.
F0 values pertaining to the American children

tended to resemble those for the Swedish acute
productions (see tables 4 and 5; five of the 16
American 18-monthers did not produce any usable utterances and have been left out). The
Fall parameter values were moderately positive
and the Rise parameters tended to be on the
negative side. Thus, the American data suggest
a moderate F0 decline in the first vowel which,
on average, continued into the second vowel.
words produced by the American 24-monthers.
AMERICAN ENGLISH
Child
Fall (Hz)
Rise (Hz)
Mean SD Mean SD N
AM24F1
69 82
-53 175 12
AM24F2
58 41
-4,5 45 16
AM24F3
33 23
-2,8 57 13
AM24F4
55 71
-80 81 15
AM24F5
32 31
-28 57 13
AM24F6
51 24
-2,9 76
7
AM24M1
20 27
-6,4 58 16
AM24M2
50 75
5,6 27 16
AM24M3
29 36
-40 82 16
AM24M4
93 149 -0,71 90
7
AM24M5
26 21
31 45 10
AM24M6
27 26
2,6 47
9
AM24M7
34
17
1
Grand mean
45 51
-10 70 151
The main timing effect is found in the first
F0 maximum which, for the Swedish children,
appears early in the stressed vowel of the grave

words and near the following vowel-consonant
boundary in the acute words. The productions
pertaining to American children are intermediate with an F0 maximum approximately at or
after the middle of the stressed syllabe. Also
note the differences in the second vowel (mesurement points 4 and 5). The contour describes
a steady slope in the Swedish acute productions, whereas the grave contour reaches as
new peak in the second vowel. Thus, the acute
contour differs from the grave contour both in
having a late turning-point in the first vowel
and in lacking a secondary peak in the following vowel. Even though the American English
contours reflect occasional rises toward the
69

second vowel, these rises are smaller and less
systematic than those of the Swedish grave
contours.
months, many Swedish children begin to produce grave-like F0 contours and to mark the appropriate words with these contours. Engstrand
et al. (2003) reached a similar conclusion on
the basis of listening tests. Based on those studies as well as preliminary analyses of the present material, Engstrand and Kadin (2004) hypothesized that acquisition of the Swedish tonal
word accents typically takes place in the 18-24
months age interval. However, the relative consistency with which grave accent contours were
produced by the present 18-monthers would
suggest that some children are influenced by
the ambient language well before this age. This
is in agreement with results of listening tests
suggesting occasional grave-like tone contours
as early as at 12 months of age (Engstrand et al.
2003).
Time (ms)
Figure 1. Average F0 contours derived from mean

parameter values shown in tables 1 - 5. Symbols:
filled diamonds=SW24 grave, unfilled diamonds=SW18 grave, gray diamonds=SW24 acute,
filled squares=AM24, unfilled squares=AM18.
Acknowledgment
This work was supported by grant 2003-846014311-29 from the Swedish Research Council
(VR) to O. Engstrand.
Auditory judgments have shown that disyllabic

words produced by Swedish 24- and 18-montholds mainly carry the grave tonal word accent.
This is an expected influence of the ambient
language, since a majority of disyllabic Swedish words are characterized by that accent.
Whereas the Swedish 24-monthers also produced a significant number of acute words, the
acute accent occurred very rarely in the
younger group. F0 contours were usually
shaped and timed according to the adult norm.
This was the case in all 24-month-olds and in
all but two of the 18-month-olds. This suggests
that most Swedish children at 24 months of age
have established a productive command of the
word accent contrast, and that many 18-montholds are in a fair way to acquiring the grave accent. However, the virtual absence of acute
words in the 18-month-olds makes it hard to
determine whether a systematic accent contrast
has been established at that age.
References
Bruce G. (1977) Swedish Word Accents in
Sentence Perspective. Lund: Gleerup.
Engstrand O. (1995) Phonetic interpretation of
the word accent contrast in Swedish. Phonetica 52, 171-179.
Engstrand O. (1997) Phonetic interpretation of
the word accent contrast in Swedish: Evidence from spontaneous speech. Phonetica
54, 61-75.
Engstrand O., Williams K. and Strmqvist S.
(1991) Acquisition of the tonal word accent
contrast, Actes du XIIme Congres International des Science Phontiques, Aix-enProvence, vol. 1, pp. 324-327.
(2003) Does babbling sound native? Listener
responses to vocalizations produced by
Swedish and American 12- and 18-montholds. Phonetica 60, 17-44.
Engstrand O. and Kadin G. (2004) F0 contours
produced by Swedish and American 24month-olds: implications for the acquisition
of tonal word accents. Proceedings of the
Swedish Phonetics Conference held in Stockholm 26-28 May 2004, pp. 68-71.
The influence of the ambient language is

even more evident in view of the F0 contours
produced by the American English controls.
The timing of those contours tended to be intermediate between the Swedish grave and
acute contours.
Out of the five Swedish 17-18-month-olds
observed in a similar study (Engstrand et al.
1991), three showed a two-peaked, grave-like
F0 contour in grave words (even though a rise
was consistently present on the second syllable). It was tentatively concluded that, at 17
70
Development of adult-like place and manner of articulation in initial sC clusters

Fredrik Karlsson
to acoustic manipulation of stimuli in the
plosive- nasal and labial- dental ranges ,
Miller and Eimas (1976) presented a simi lar pattern of categorical perception for
both investigated feature manipulations.
Furthermore, the results from Miller and
Eimas (1976) provided evidence for a right
ear predominance for both place and
manner processing. Miller and Eimas in terpreted their findings as indications of
similar processing of the phonetic features
place and manner. In addition , the per ceptual processing of the features place
and manner of articulation have been
shown to be stable, compared to other
features, even with a reduced set of acous tic cues. Singh (1971) investigated the per ceived closeness of consonants in six con ditions of reduced acoustic information.
The results provided evidence of nasality
being the most resistant feature across
conditions . Singh and Black (1966) pro posed that place of articulation was sec ond only to nasality in feature strength.
Place of articulation is therefore hypothe sized to be mastered first in the progres sion towards an adult- like production of
an initial consonant cluster.
Regarding the structure of the output
form, sC clusters may, in the early stages
of acquisition, be reduced to a single con sonant. This change in output form of the
child's production has been described as
an application of a phonological process
of cluster reduction, where the cluster is
reduced to one of it's elements, and clus ter simplifying processes, where the con sonant cluster is substituted by a different
consonant combination or by a segment
that is not an element of the target conso nant cluster (coalescence). McLeod et al.
(2001) noted that a trading relation had
been observed in the literature between
the frequency of consonant cluster reduc tion and simplification: cluster reduction
has been reported more frequently in ear ly intermediate forms and replaced by
simplification in forms produced in the
Abstract
Previous investigations have proposed that
nasality in consonants are more perceptu ally stable than place of articulation in
constrained conditions. This paper investi gates the progression of initial consonant
clusters from a reduced to an adult- like
form in terms of manner and place of articulation in the speech of children between the age of 1;6 and 3;5. The results
show an earlier onset of stable production
of manner compared to place for in both
full clusters and in the reduced form. The
results are interpreted as evidence of the
importance of perceptual salience of segmental properties in the acquisition initial
consonant clusters.
Introduction
Initial sC clusters occur frequently in
spontaneous Swedish (Bannert & Czigler
1999) and are the refore a predominant
feature of the ambient language of children learning Swedish. Previous reports
concerning children's productions of
word- initial word- initial sC clusters have
shown that early output forms of the sec ond consonant of the cluster may have a
deviating phonetic quality com pared to
the adult model form (see McLeod, van
Doorn and Reed (2001) for a summary of
discovered trends in consonant cluster acquisition).
In clusters with a plosive as the second
element, the reduced form may involve an
change in place of articulation caused by
an application of a hypothesized fronting
rule. Non- plosive consonants may, in ad dition, be substituted by a consonant with
a different manner of articulation com pared to the target consonant, e.g.
through application of a stoping process .
For adult speakers, the articulatory features of place and manner of articulation
have been shown in the literature to be
correlated regarding their perceptual sta bility. In a study the perceptual response
71

stage before production of a complex syllable onset.
The aim of this study was to investigate
the acquisition of an adult- like production
in terms of place and manner in complex
syllable onset in word- initial position.
From the milestone of complex onset pro duction, the change in the articulatory
features of place and manner was investi gated throughout the progression of
words with an initial [sn], [st] or [sk] clus ter. Acquisition of a stable production of
manner before place of articulation would
be interpreted as further support for the
perceptual hierarchy found in the previous research. In addition, it would be interpreted as support for extending the effect of perceptual hierarchies found for
features singleton segments to the acqui sition of consonant clusters.
material.
Based on the tabulated progressions of
each target onset, the productions investi gated in this study were extracted accord ing to two criteria: 1) that the initial out put forms produced by the child was not
produced in an adult- like manner in
terms of the feature set of the consonant,
and 2) that a progression should be ob served in the data in terms of place or
manner of articulation. As a results of
these criteria, productions made by seven
subjects, three female and four male, were
selected for further analysis. The age
ranges of the investigated subjects at the
time of recording are presented in table 1.
Age of acquisition of stable and adultlike production was determined separate ly for the articulatory features place and
manners as well as for the complexity of
the syllable onset. Furthermore, the age of
adult- like production of the second element of the target cluster (stop or nasal
consonant) was established for place and
manner of articulation. For all investigat ed features, onset of a stable production
was determined to be the session at which
four out of the following five productions
was produced with the same value in the
investigated featur es.
Method
Speech material
The data set investigated in the present
paper was extracted from a corpus con sisting of 5311 productions collected in
order to investigated the development in
output forms of word- initial consonant
clusters in Swedish children between the
age of 1;6 and 3;6. In this corpus, record ings were conducted on a monthly basis
in a sound- treated recording studio. The
target words were elicited by the accom panying adult using black- and- white pic ture prompts.
Table 1 The age of each subject (in weeks) at

the first and last recording session.
Procedure
A narrow transcription of the produc tions were subsequently produced by the
author. The transcribed segment labels
were then substituted by a feature vector
describing the segment in terms of articu latory features, including place and man ner of articulation. Consonant segments
in the onset of the first syllable of the pro duction were marked by the position in
the onset and the progression of the target
words t ([t h o:]), snr ([sno:r]) and skal
([skA:l]) were tabulated according to target
word and subject's age. 159 productions
of skal , 132 productions of snor and 198
productions of st were included in the
Speaker
First session
Last session
F1
105
151
F2
109
158
F3
77
128
M1
79
130
M2
124
178
M3
90
142
M4
84
129
Results
The ages when a steady production of
place and manner in the C consonant as
well as in the full sC cluster are presented
in figure 1. For subjects F1, F2 and M3,
place and manner of articulation was sta ble in singleton consonants from the on set of sampling. The bottom circles of F1
72

and F2 and the bottom row of squares for
the subject M3 therefore indicate that cor rect production had been acquired before
or at this point in development. There are
no data available on the delay of this point
compared to the real onset of adult- like
productions; these data points are there fore not be included in calculations below.
Table 2 presents the differences in
weeks between the time of stable adultlike production of manner or place of ar ticulation in simple and complex syllable
onsets. The mean delay in manner was 24
weeks (27 weeks for [sk], 20 weeks for [st]
and 23 weeks for [sn]). For place of articu lation, the mean delay in onset of stable
adult- like production was 29 weeks (8
weeks for [sk], 34 weeks for [st] and 30
weeks for [sn].
Table 3 Difference in weeks between the the

onset of stable productions of complex onsets
and stable production of adult- like manner or
place. Empty cells indicate that stable produc tion was not acquired in the investigated time
window
Speake r
F1 F2 F3 M1 M2 M3 M4
SkalCC man
0 13
0 10
Sto CC man
6 10 13
Sno CC man
10
SkalCC place
StoCCplace
0 35 29
SnoCC place
0 38
0
4 15
Table 2 Difference in weeks between the time

of stable adult- like production of adult- like
manner or place in simple and complex syllable onsets. An asterisk indicates an excluded
value due to an adult- like production in the
singleton at the onset of recording. Empty cells
indicate that stable production was not acquired in the investigated time window.
Speaker
F1 F2 F3 M1 M2 M3 M4
Skal ma nner *
8 41 24 *
Sto: manner *
15 23 21
Snormanner *
19 30 21
Skal place
Sto: place
51 28 23
Snors place
24 30 36
34
The differences in weeks between the the

onset of stable productions of complex
onsets and stable production of adult- like
manner or place are presented in table 3.
The mean delay for manner was 5 weeks
(4 weeks for [sk], 7 weeks for [st] and 4
weeks for [sn]). The mean delay for place
was 10 weeks (0 weeks for [sk], 13 weeks
for [st] and 14 weeks for [sn]).
A Pearson correlation matrix of the
data presented in figure 1 showed a strong
correlation between the age of stable pro duction of manner (r=0.98) across target
words.
Figure 1The age of stable production of place

and manner of articulation are presented
above for each investigated speaker. Each cell
shows the progression of output forms of the
target words skal , snr and st. For subjects
F1, F2 and M3, place and manner of articula tion was stable in singleton consonants in the
initial recording session.
73

A high correlation was obtained for
place of articulation between the clusters
[sn] and [st] (r=0.94). The production of
complex syllable onset in target words
were also highly correlated (r>0.95 for all
correlations).
showed that manner does meet a criterion

of 75% in that session. Therefore, the ap parent reversal of the age acquisition of
place and manner found in M1 may be an
artifact of the chosen criterion.
Therefore, it is concluded that the per ceptually established ordering of manner
before place of articulation in terms of
feature strength is in agreement with the
age of stable acquisition of these features
in complex syllable onsets. The develop ment of complex syllable onsets is there fore viewed as strongly influenced by con straints in perception.

The previous research done in on the
strength and stability of the feature place
and manner of articulation have estab lished a similarity in the perceptual pro cessing of these features (Miller and Eimas
1977). However, the work done by Singh
and Blank (1966) and Singh (1971) sug gests that nasality, as an instance of man ner of articulation, is more perceptually
stable than place of articulation in a noisy
environment.
The obtained age of acquisition of a
stable production of sk, st and sn clusters
presented in figure 1 is in close agreement
with the perceptual hierarchy proposed
by the Singh (1971). Manner was generally
acquired before, or at the same time as,
place of articulation in productions that
was reduced to a singleton consonant.
Syllable onsets with more than one
member were generally produced in a sta ble way after the adult- like production of
place had been achieved. Following the
acquisition of complex syllable onset pro ductions, manner of articulation was sta bilized first. In the seven instances when
place of articulation was acquired, the on set of a stable and adult- like production
occurred after the acquisition of manner.
The same pattern is observed in com plex syllable onsets. Mean delay in acqui sition from the onset of complex structure
in the syllable onset (table 3) is greater for
place than for manner. In fact, successful
acquisition of a stable production of
adult- like place is never achieved before
manner for the full cluster.
One exception to the above described
pattern of progression exists in the data:
the onset of a stable production of place
in the reduced form of the target word
skal produced by M1. In this stage or pro duction, stable production of place were
acquired before manner. Inspection of the
production made in manner age the age of
acquisition of place plotted in figure 1
Acknowledgements
The author would like to thank the children who participated in this study and
the children's parents for bringing the
children to the recording studio and for
participating in the elicitation of the target
words.
References.
Bannert, R. and Czigler, P. E. (1999) Variations in consonant clusters in Standard
Swedish. PHONUM 7.
McLeod, S., van Doorn, J. and Reed, V. A.
(2001) Consonant Cluster Develop ment in Two- Yeas-Olds: General
Trends and Individual Difference.
Journal of Speech, Language and Hear ing Research 44. 1144- 1171.
Miller, J. L. and Eimas, P. D. (1977) Studies
on the perception of place and manner
of articulation: A comparison of the
labial- alveolar and nasal- stop distinc tions. Journal of the Acoustical Society
of America 61(3), 835-845.
Singh, S. (1971) Perceptual similarities and
minimal phonemic differences. Journal
of Speech and Hearing Research 14,
113-124.
Singh, S. and Black, J. W. (1966) Study of
twenty- six intervocalic consonants as
spoken and recognized by four lan guage groups. Journal of the Acoustical
Society of America 39, 372-387.
74
Phonological Interferences in the Third Language

Learning of Swedish and German (FIST)
Robert Bannert
Abstract
Aim
In general, the teaching of pronunciation has

no high priority. This is also true of the beginner courses of German and Swedish at the academic level. A tailored pronunciation programme must be constructed upon empirically
founded knowledge about the difficulties the
learners encounter. A large material of both
target languages compiled during three terms
from approximately 30 students per language
will be collected in the data base FIST and all
deviations from the pronunciation norm will be
marked systematically. Thus quantitative statements about the learning problems will be possible. Difficulty profiles are obtained for each
individual, each group and each text type.
The aim of the project is to study the pronunciation problems of the students. The following
questions, among others, will be answered:
(1) What role does the first language (L1) play
for the learning of the target languages pronunciation? Special concern will be given to
each learners dialect or regional variety of the
Standard languages.
(2) What role does the pronunciation of the
second or third language play?
(3) Which phonological and phonetic targets
are easy and which are difficult related to the
structural similarities in both languages?
(4) Which interplay between the various difficulties is to be discovered? Which implications
can be observed?
The answers to these questions will constitute the scientific basis on which new and well
adopted learning material can be developed
later by didactic and pedagogical experts for
both language groups.
Introduction
In second language learning research of adults
it is agreed that in the area of phonology and
phonetics a clear negative transfer, interference,
can be observed. This is due to the influence of
the first language (L1). However, this is not the
only reason for foreign accent; interlanguage,
too, plays an important rle. Only recently research has paid attention to the special case of
learning in a multiple language setting.
A few years ago, beginner courses in German were introduced at the academic level in
Sweden. In the German speaking countries, due
to a long tradition, beginner courses in Swedish
attract many students. For both groups, the
teaching of pronunciation is allotted only a
small amount of time. As a consequence of this,
the learners' target language is characterized by
a strong accent which in most cases becomes
fossilised. In order to prevent this, a pronunciation programme should be constructed that is
tailored just for the special preconditions of the
learners, namely their first language (L1) and
the multiple contexts: all learners have already
learned at least one foreign language. An extraordinary challenge lies in the fact that German
and Swedish are linguistically very close to
each other. Therefore it should be rather easy to
help the learners to a good pronunciation right
from the start.
Research situation
Research on second language learning has centred around the question whether or not the first
language affects the target language. Experience tells us that transfer and interference do
occur when pronunciation, i.e. the learning of
phonology and phonetics, is concerned. While
the literature is abundant with studies of syntax
and morphology, the learning of pronunciation
has not been studied to a greater extent. Hammarberg has studied certain aspects of Swedish
as a second language (1985, 1988, 1997). He
(1993) made a study of third language acquisition investigating his co-author. Since the middle of the seventies, Bannert has done research
on several aspects of learning Swedish pronunciation (1979a, b; 1980, 1984, 1990) and on the
German sound systems and prosody (1983,
1985, 1999).
A large and extensive survey "Optimizing
Swedish Pronunciation" (Bannert 2004) was
carried out in the late seventies in Lund. Swedish being the target language, the pronunciation
difficulties of 70 adult learners representing 25
75
Consonants
Swedish: voiceless fricatives [, ] tjugo
(twenty), sju (seven), retroflexes [, , , =, ]
fort (fast), bord (table), mars (March), barn
(child), Karl (Charles).
German: voiceless fricatives [S, ] Schuh
(shoe), mich (me), glottal stop [/] Theater
(theatre).
first languages were studied. The survey included also German represented by three learners from different regions: Northern Germany,
Bavaria and Switzerland.
Due to a grant from Vetenskapsrdet, it was
possible to conduct several initiating pilot studies for the project. Recordings of students in
Ume and Freiburg were made and analysed.
Students were interviewed about their introspection of their pronunciation difficulties and
Think-aloud protocols were written. A demo
version
of
the
database
(www.ling.umu.se/FIST) was programmed
showing the labels to be used. Socio- and psycholinguistic background variables were collected. Thus the project is based on a secure
and safe ground.
Prosody
Swedish: two word accents [acute, grave] 'buren - 'bren (the cage - borne), focus accent
manifested separately, complementary length
of stressed vowel and consonant, stress pattern
(speech rhythm)
German: short consonants, word accentuation,
stress pattern (speech rhythm).
Theoretical approach
Phonological processes
Swedish: retroflexation of [r] + [, , , =, ]
across morpheme and word boundaries: mer tid
(more time), har du (do you have), nr som
(whenever), har nu (have now).
German: final devoicing of [b, d, g] to [p, d, k]:
Sieb (strainer), Rad (wheel), Steig (path); initial
[s] to [z] See (lake); [s] to [S] in consonant clusters initially: Stein (stone), springen (jump);
voicing of medial [s] to [z]: lesen (read); deletion of unstressed [e] in endings -el, -en:
Himmel (heaven), Zeiten (times); vocalisation
of post vocalic [r] to []: Wasser (water); assimilation of place of articulation of [n] to [m],
[N] after deletion of unstressed [e]: Lippen
(lips), Banken (banks).
From long experience we know that phonological transfer is typical of the language learning
processes, especially of adult learners. This
characteristic phenomenon of foreign accent is
caused by the phonological system, including
orthography, of L1. However, with our student
groups, interferences from L2 and L3 must also
be responsible for the deviating pronunciation.
Furthermore, contributions of the learners interlanguage (Selinker 1972) are to be expected.
Therefore each deviating feature in the performance of the students will be coded. Each
deviating sample in the speech signal will be
labelled according to these codes. Thus it is
easy to cross search the whole material and do
a thorough inspection and statistics of the observations. This will allow us to make quantified statements about the learning processes.
Grapheme-Phoneme-Relationships
Swedish: <o> signifies [u] and [o] skola
(school), sova (sleep); <> signifies always [o]
mla (paint), tta (eight); <g, k, sk> becomes
palatalised to [J, , ] gift (poison), klla (well),
skinka (ham); initial <h> is not pronounced in
<dj-, gj-, hj-, lj-> djup (deep), gjuta (pour), hjul
(wheel), ljuga (lie).
German: <o> signifies always [o] Sohn (son),
Sonne (sun); <z> signifies [ts] Zahn (tooth);
final <b, d, g> become [p, t, k] (final devoicing): Sieb (strainer), Rad (wheel), Steig (path).
Contrastive aspects
The phoneme systems of vowels and consonants, phonological processes are rather similar
in both languages; however, prosody and the
grapheme-phoneme relationships show some
differences. Only the salient differences will be
pointed out.
Vowels
Swedish: long [A:] gata (street) and [:] duk
(cloth) , short [] hundra (hundred).
German: lax and short [I, Y, U] Mitte (middle),
Htte (hut), Mutter (mother), long [a:] Vater
(father), diphthong [aO] Haus (house).
Pronunciation norms
The impression of foreign accent, to the greatest extent, is caused by segmental and prosodic
deviations from the pronunciation norm of the
target language. This is spoken with parts of the
76

phonology of the first language. The notion of
norm, however, stands for a very complex phenomenon. In the pronunciation dictionaries for
both languages, the correct pronunciation is
given for isolated words spoken very distinctly.
Phrase and sentence level perspectives (assimilation and elisions) are not dealt with nor are
different speaking styles, speech rhythm and
intonation.
Swedish: Standard (Stockholm) Swedish,
described by Elert in Hedelin (1997). Problems:
While Standard Swedish has an apical trill or
fricative, Southern Swedish has an uvular trill
or fricative and therefore no retroflex consonants. As three main Standard pronunciations
are generally accepted in Sweden (Standard,
Stockholm or central Swedish, Finland Swedish), both /r/-types are included in the norm.
The /r/-type for each speaker is noted.
German: Standard German, described in the
DUDEN (Umgangslautung, 2001). The r-sound
is an uvular trill or fricative [R, ]. Problems:
The unstressed /e/ in the endings <-el, -en> is
always dropped in conversational speech and
sometimes in formal style too. This gives rise
to certain assimilations of the nasal [n]. Postvocalic [R] is always vocalised.
Material, recordings and analyses

All the material recorded, except the short
story, was well known by the students, it was
covered during their lessons. The material consisted of three texts, two descriptions of pictures and one short story. A DAT-recorder was
used. The recordings were fed into a computer
and the analysis of the material was done auditively, supported by the speech wave from the
WaveSurfer program. The portion of analysed
speech amounts to approximately 45 minutes
for the Swedish learners and 35 minutes for the
Germans.
Preliminary results
A representative choice of deviations for each
group is shown in the following tables. Group
results are presented according to their frequencies of appearance. Together with the code
number and the frequency of appearance of
each deviation, the target symbol and its replacement (deviation) is given.
Swedish
target
replacement
A
a
Er
P
u
V
Vr
stressed
wrong
syllable
ys
yt
Vr
0
o
y
-d#
t
u
u <o>
Os
Ot
<g>
g
-v#
f
e
E
-b#
p
V:
V
V
V:
S
<o>
u
s
z
sk
S
-g
Coding system
Each deviant pronunciation from the norms defined above, due to different causes, is labelled
by a special mark, a code number, separately
for each language. Code numbers are listed for
different areas of interest: vowels, single consonants, consonant clusters, phonological processes, prosody, grapheme-to-phoneme relationships and use of first language forms. Although
a number of deviations is identical in both languages, language users show a great variety of
different labels. Most of the observed code
numbers and their labels are presented below
(results). The coding system allows different
statistical treatments of the data, especially the
quantifying of deviations. Thus it is possible to
calculate each learner's profile of pronunciation
difficulties, those for each kind of material:
read aloud texts, descriptions of pictures and a
narrative, for the male and female learners as
sub groups and finally for all the learners together. This will be an important aspect and the
basis for the construction of a self-going program for the learning of pronunciation.
77
code
frequency
S114
S308
S110
S309
S501
85
83
47
39
36
S104
S231
S112
S107
S302
S108
S111
S105
S409
S304
S102
S301
S116
S117
S305
S213
S113
S212
S416
S415
S404
33
27
27
25
23
18
15
14
11
11
11
10
10
8
7
6
6
5
4
4
3

p
A
a
P
s/Vpal
/<k>Vpal
-t-t=
b
e
d
-g
<sk->
rt
p
A
y
S
k
ts
d
rn
S
p
ei
t
0
rs
sk
dZ
S221
S201
S118
S115
S109
S418
S408
S236
S234
S223
S211
S204
S119
S235
S232
S224
S414
S216
3
3
3
3
3
2
2
2
2
2
2
2
2
1
1
1
1
1
g/Vpal
-er#
S
e
Obstr
+Voice
S
s
a
-ieht
ie
ie
-zts
-er
-ert
d
-er9
x
E
Obstr
-Voice
s
x
z
A
aE
ie
S
z
-r
-et
t
T402
T317
T222
T111
T218
3
3
3
3
2
T215
T214
T212
T112
T418
T415
T414
T223
T220
T219
T216
T209
2
2
2
2
1
1
1
1
1
1
1
1
Acknowledgement
German
target
replacement
"zs
<-er>
-er
-V
-Vr
-z-sts
s
SC
sC
V:
V
yt
ys
-t#
d
-(e)l
-el
C
C:
V
V:
e
E
/
0
Pt
Ps
E/-r
-p#
-b
SCC
sCC
o
O
y
u
-k#
g
u
-ig/
g
u
y
stressed
wrong
syllable
rs
h
h
code
frequency
T206
T308
T314
T203
T207
T202
T304
T104
T101
T302
T306
T505
T105
T110
T504
T113
T102
T301
T305
T103
T411
T303
T208
T412
T409
T405
T501
170
147
114
113
58
43
35
35
32
27
24
15
15
13
12
12
12
11
9
9
8
8
8
6
5
5
4
T320
T413
4
3
The expert and skillful technical support by

Thierry Deschamps is gratefully acknowledged.
References
Bannert Robert. 1979a. Ordstruktur och prosodi. I: Svenska i invandrarperspektiv, pp.
132-173. Hyltenstam Kenneth (ed.). Lund.
Bannert Robert. 1980. Phonological strategies
in the second language learning of Swedish
prosody. PHONOLOGICA 1980, pp. 29-33.
Dressler W.U., Pfeiffer O.E. and Rennison
J.R. (ed.). Innsbruck.
Bannert Robert. 1984. Prosody and intelligibility of Swedish spoken with a foreign accent.
Acta Universitatis Umensis 59, pp. 8-18.
Elert Claes-Christian (ed.).
Bannert Robert. 2004. P vg mot svenskt uttal
(including CD-rom). Lund: Studentlitteratur.
Bannert Robert. 1999. (with Johannes
Schwitalla). usserungssegmentierung in
der
deutschen
und
schwedischen
gesprochenen Sprache. Deutsche Sprache.
Zeitschrift fr Theorie und Praxis 4, pp.
314-335.
Duden. 2001. Aussprachewrterbuch. Mannheim: Dudenverlag.
Hedelin Per. 1997. Norstedts svenska uttalslexikon. Stockholm: Norstedts.
Selinker Larry. 1972. Interlanguage. International Review of Applied Linguistics 10,
20978
Word accents over time: comparing present-day data

with Meyers accent contours
Linna Fransson and Eva Strangert
Department of Philosophy and Linguistics, Ume University, SE-901 87 Ume
variations between neighboring dialects and,

complemented with recent data from the
SWEDIA 2000 project (http://www.swedia.nu)
to shed light on more recent phonetic
developments of the word accents.
Abstract
A comparative analysis of new dialect data on
word accents in Dalarna and accent contours
published by Meyer (1937, 1954) revealed
differences indicating a change, primarily in
the realization of the grave accent. The change,
a delayed grave accent peak, is tentatively seen
as a result of a spread towards north-west of
word accent patterns formerly characterizing
dialects of south-eastern Dalarna.
Figure 1. Accent contours for acute (left) and grave

(right) representing typical Dalarna dialects.
Background
To that end they performed measurements on

digitized versions of the stylized Meyer curves
and reported data on the location in time of
acute and grave tonal peaks relative to the VC
boundary as indicated in Meyers contours.
(Engstrand and Nystrm used arbitrary time
units, as Meyers contours had no time scale.)
Their analysis suggested a specific pattern in
the grave accent; the timing of the tonal peaks
tended to vary systematically from south-east
(relatively late peaks) to north-west (relatively
early peaks), see Table 1 and map in Figure 2.
Much of the inspiration for research on

Swedish word accents can be traced back to the
work by Ernst A. Meyer, published in two
volumes 1937 and 19541, respectively. Meyer
collected his material from one or several
speakers for each selected dialect, and in these
volumes, original contours from each speaker
can be found beside time-normalized and
averaged stylized contours for each speaker and
dialect. Meyers data underlie work on accentbased dialect typologies by Grding and
Lindblad (1973), Grding (1977) and Bruce
and Grding (1978).
The typologies differentiate a number of
dialect areas by the number and timing of tonal
peaks. The typical Dalarna dialects, the focus
of the present study, are characterized as onepeak accents, both the acute and grave accent
having one peak, but with a later timing of the
grave accent peak. The typical pattern is
illustrated for a two syllable-word in Figure 1,
were the accent peak occurs before the VC
(syllable) boundary in the acute and at, or very
close to, the boundary in the grave accent, the
VC boundary indicated by the vertical line in
the curve.
Engstrand and Nystrm (2002) set out to
study variations within the Dalarna dialects
basing themselves on the stylized contours in
Meyer (1954). Their specific purpose was to
look for continuous variation within the broad
categories, as this would provide a possibility
to observe small but systematic accentual
Table 1. Timing of tonal peaks in Dalarna dialects.

Negative values represent peaks before and positive
values peaks after VC boundary. (From Engstrand
and Nystrm, 2002).
79

similar to those found in more southern dialects
like l and Djura by Meyer, see Table 1. This,
and previous dialect influences progressing
from south-east to north-west in Dalarna (see
discussion in Engstrand and Nystrm, 2002)
gave the inspiration to test the idea of an
ongoing change in accent patterns. Thus, to
shed light on the possibility of such a change
and also to confirm the previous (pilot study)
results, an extended study (see also Fransson,
2004) was undertaken including Leksand
speakers and in addition speakers from Rttvik
some 20 kilometers north of Leksand, see the
map in Figure 2. In addition a more controlled
material was chosen with target words differing
only as to their word accent pattern.
Figure 2. Map of the province of Dalarna.
They further hypothesized that this pattern

reflected a historical spread from south-east to
north-west.
Extending the material Accent

patterns in Leksand and Rttvik
A pilot study
Method
New recordings were made of speakers
between 20 and 50 years of age, all having
lived in Leksand and Rttvik for most of their
lives. Data were collected from both female and
male speakers, in total 11 from Leksand and 13
from Rttvik.
The speakers were recorded in their homes
(or at work or school). The material consisted
of two words produced in isolation, Polen
/'po:len/ Poland (acute), and plen /'p:len/
the pole (grave). They were elicited in
random order (together with other words, not
reported on here) through cards with the
respective words written on them. Each speaker
produced at least five repetitions of each word,
but some produced as many as eight and even
more.
Digitized versions of the material were
analyzed and the location of the f0-peak was
measured (in msec) relative to the VC
boundary. A position of the peak before and
after the boundary resulted in negative and
positive values, respectively. In addition to
absolute durations, percentages were calculated
(the distance (in msec) of the peak from the VC
boundary relative to the duration of the
segment before or after the boundary) to
neutralize speaking rate variation. Peak
positions were sometimes difficult to identify;
many contours had plateaus rather than peaks,
and laryngealizations and other voice quality
features added to the problems. Dubious cases
were therefore eliminated and the reported data
Data collected in the SWEDIA 2000 project

were used to study possible age and gender
differences in accent patterns in the Dalarna
dialect of Leksand (see Fransson, 2004). The
material were two two-syllable words, one
acute, dollar /'dol:ar/ dollar, and one grave,
kronor /'kr:nor/ Swedish crowns, produced
in final position in a sentence context. The
analysis revealed minor age and gender
differences; the younger generation tended to
have a greater contrast (a greater separation in
time) between the two accent peaks. What was
surprising, however, was that the majority of
the speakers (15 out of 16) had a different
timing of the grave accent peak as compared
with Meyers (1937) contours for the Leksand
dialect. While in the Meyer curves the acute
and grave accent peaks though separated in
time were both located before the VC
boundary, the acute accent peak occurred
before, and the grave accent after, the VC
boundary in the SWEDIA material. That is, the
grave accent peak appeared to be delayed
compared to what had previously been
reported.
This result raised several questions. Were
the results (based on a rather restricted
material) reliable, and if so, was the change
isolated to the Leksand dialect, or part of a
more general shift of accent patterns in
Dalarna? What we saw in the data of the more
northern dialect of Leksand was a pattern
80

were reduced to include five speakers from
Leksand, two of them female, and six speakers
from Rttvik, one of them female.
Leksand and Rttvik speakers behave very

much in the same way; the acute tone peak
occurs well before the VC boundary and the
grave after.
Thus, the data reported here indicate that a
change in the realization of the grave accent
has taken place since Meyer (1937, 1954)
collected his material, a shift towards the
pattern of more southern dialects e.g. Djura and
Grangrde. In Figure 3 containing the contours
presented by Meyer (1954), the arrows indicate
the average grave-accent peak locations by our
speakers. These locations were calculated as
percentages relative to the entire durations of
the second syllable and were 17% and 20%
respectively for Leksand and Rttvik.
Results
Table 2 shows the individual results (absolute
mean durations and standard deviations) for the
two target words produced by the Leksand and
Rttvik speakers. (The number of each word
analyzed for each speaker varied between 5 and
14.) Apart from individual durational
differences, the pattern is the same for all but
one of the speakers; the acute word has its peak
located before and the grave word after the VC
boundary. (Although measurements were made
of peak positions both in terms of absolute
durations and percentages relative to the VC
boundary, only absolute durations are reported
here, as very similar patterns resulted from the
two types of measurement.)
Table 2. Timing of grave and acute accent peaks
(means and standard deviations) by 5 Leksand (L1L5) and 6 Rttvik (R1-R6) speakers. Negative
values represent peaks before and positive values
peaks after VC boundary.
Speakers
L1
L2
L3
L4
L5
R1
R2
R3
R4
R5
R6
Figure 3. Accent contours (from Meyer, 1954).

Arrows indicate the positions of the grave-accent
peak in the present data.
Such a change among the more northern

dialects is supported by an analysis of
SWEDIA material (dollar and kronor) from
speakers of Malung, a location west of Leksand
and Rttvik. Similarly as for Leksand and
Rttvik, the grave-accent peak had moved to
a position after the VC boundary, while in
Meyers collected data it was located before it.
Peak time in msec

from VC boundary
Acute
Grave
sd
mean
sd
mean
30
29
-60
16
56
27
-52
19
64
19
-66
25
68
13
-54
13
76
36
-80
30
-4
43
65
79
94
120
31
37
23
21
11
19
-93
-56
-84
-66
-42
-21
General discussion Are accent

realizations changing?
31
31
34
17
22
6
Although there are no time scales in the

contours supplied by Meyer (1937 and 1954), it
is possible to conclude from our data that the
grave tonal peaks today take another position
within the word. The vertical line in Meyers
contours gives the position of the VC boundary,
and we can clearly see that the acute accent
remains within the syllable before the
boundary, while the grave has crossed it.
Moreover, the change does not appear to be
restricted to one single dialect, but occurs in all
the three studied here. This speaks for a more
general trend among the dialects of Dalarna, or
at least those in the region between the northern
and the southern part. We might conjecture a
spread of the more southern type of realization
Although the Rttvik speakers have a greater

spread of means, the grand means for the grave
accent word are very similar for the Leksand
and Rttvik speakers, 59 msec and 67 msec,
respectively. The corresponding grand means
for the acute word are -63 msec and -60 msec.
Discussion
Thus, the new material confirms the results of
the Leksand pilot study. In addition, the
81

of the grave accent to the north and north-west.
Thus, in the light of the varying grave accent
peak locations as demonstrated by Engstrand in
Nystrm (2002), the southern type of accent
realization (represented by Djura, l and
Grangrde) would have progressed further to
the north and north-west.
That a change has taken place seems
obvious when comparing the present-day data
with the stylized contours in the second volume
of Meyers work (1954). However, the contours
appearing in the 1937 volume are not exactly
the same as the later ones. Though stylized, too,
they are somewhat more detailed and show
alternative peak locations for acute as well as
grave accents. The grave-accent peaks tend to
either co-occur with the VC boundary or are
located to the right of the boundary. That is,
some of the peaks have a similar position as in
the present-day data. This occurs both for the
Leksand and Rttvik dialect.
The variation in the 1937 stylized contour
drawings corresponds to variation in the
original tone curves as registered and measured
by Meyer. Figure 4 shows examples of this
variation of grave accent curves by one speaker
from Leksand. The stylized drawings in the
1954 volume thus hide some of the variation
appearing in the raw data (and in the less
simplified stylizations in the 1937 volume).
innovations having spread from south-east to

north-west in Dalarna, see discussion in
Engstrand and Nystrm (2002).
Conclusions
A comparative analysis of present-day dialect
data on word accents in Dalarna and accent
contours published by Meyer (1937, 1954) has
revealed differences indicating a change in the
realization of the grave accent. This change, a
delayed grave-accent peak, is tentatively seen
as a result of a spread towards north-west of
accent patterns formerly characterizing dialects
of the south-east of Dalarna. Clearly, however,
this assumption has to be confirmed by
extending the material for analysis.
Acknowledgements
We are grateful to Olle Engstrand and Gunnar
Nystrm for allowing us to include figure 2 in
this study. This work has been supported by a
grant from the Bank of Sweden Tercentenary
Foundation, 1997-5066.
Notes
1. This volume was published posthumously.
References
Bruce G. and Grding E. (1978) A prosodic
typology for Swedish dialects. In Grding
E., Bruce G. and Bannert R. (eds) Nordic
Prosody, 219-228. Department of Linguistics, Lund University.
Engstrand O. and Nystrm G. (2002) Meyers
accent contours revisited. TMH-QPSR 44,
17-20.
Fransson L. (2004) Fyra daladialekters ordaccenter i tidsperspektiv: Leksand, Rttvik,
Malung och Grangrde. Thesis work in
phonetics, Ume University.
Grding E. (1977) The Scandinavian word
accents. Malm: CWK Gleerup.
Grding E. and Lindblad P. (1973) Constancy
and variation in Swedish word accent
patterns, Working Papers, 7. Phonetics
Laboratory, Lund University.
Meyer E. A. (1937) Die Intonation im
Schwedischen I. Stockholm: Fritzes frlag.
Meyer E. A. (1954) Die Intonation im
Schwedischen II: Uppsala: Almqvist &
Wiksell.
Figure 4. Grave accent contours (Leksand speaker

12, Per Jonsson; Meyer, 1937).
Peaks located after the VC boundary thus occur

also in Meyers data. However, while Meyers
speakers varied in their location of the peak
(just before the VC boundary or following it at
a short distance), the speakers in the present
study are very consistent in locating the peak
after the boundary, about 60 msec on average.
(The single exception, Speaker R1, has a mean
grave accent peak 4 msec before the boundary.)
A change in accent realizations therefore seems
to have taken place among the dialects studied
here. A reasonable assumption then is that
accent patterns that in the past characterized
more southern dialects have spread to the north
and north-west. Such a spread would not be
unreasonable in the light of other examples of
82
Multi-sensory information as an improvement for

communication systems efficiency
Lacerda F., Klintfors E., Gustavsson L.
multi-sensory information available to the
young infant (Lacerda et al., 2004a; Lacerda,
2003; Lacerda & Lindblom, 1997). In the early
stages of language acquisition the infant and
the adult tend to communicate about objects or
occurrences in the infants immediate
neighbourhood. Although the speech signal that
the infant is exposed to may indeed refer to absent objects or abstract concepts, the gist of the
infant-directed speech tends to be focused on
very tangible objects that the adult assumes to
be in the infants focus of attention. Under such
ecologically relevant scenarios of adult-infant
interaction, there is an almost inevitable correlation between what the infant hears and its
visual, tactile or gustative sensations. In other
words, because spoken language is used to refer
to objects or events, the sound structure of the
speech signal representing those referents must
be highly correlated to at least some of the sensory representations of the objects or events it
refers to. As a consequence, the very cooccurrence of certain sequences of speech
sounds and sensory representations of the objects they are associated with conveys implicit
information on the speech signals linguistic
referential function. To be sure, the relationship
between the auditory representation of the
speech signal and the sensory representations
of its referents is far from being deterministic.
There is no guarantee that a given instance of
speech signal will be referring to the object that
happens to be in the young infants field of attention. On the other hand, because the assumption of poverty of stimulus implies that probability of recurrent matches between the auditory representation of the speech signal and the
sensory representation of its referent is vanishingly small given languages potentially unlimited combinatorial possibilities, even a barely
resembling repetition of the co-occurrence of
the speech signal with its referent is extremely
significant. Indeed repetition of shorter or
longer utterances is the hall mark of speech directed to very young infants (Lacerda et al.,
2004b).
Abstract
The paper addresses the issue of extraction of
implicit information conveyed by systematic
audio-visual contingencies. A group of adult
subjects was tested on a simple inference task
provided by short film sequences. The video
materials were encoded and submitted to processing by two neural networks (NN) that simulated the results of the adult subjects. The results indicated that the adult subjects were extremely efficient at picking up the underlying
information structure and that the NN could
also perform acceptably on both classification
and generalization tasks.
Introduction
Language acquisition can be described as a
process through which infants derive the underlying linguistic structure of their ambient language. In spite of the complexity and variability of the language input it is an undeniable fact
that within about two years of life, typical infants are able to pick up the linguistic regularities of the ambient language. Making sense of
linguistic information that is implicitly conveyed in a diversity of speech communication
situations appears to be such an insurmountable
task that researchers are prone to consider that
some sort of initial guidance is necessary to
home in on the ambient languages underlying
principles (Chomsky, 1968; Pinker, 1994). The
present paper attempts to challenge this established view by sketching a scenario in which
linguistic information may be derived in the
absence of pre-knowledge or dedicated linguistic biases. Indeed language can be seen as an
emergent consequence of the interplay between
the infant and its environment, where the richness and structure of the sensory flow may contain enough information to trigger language development (Jusczyk, 1985; Elman, Bates, Karmiloff-Smith, Parisi, & Plunkett, 1997). More
explicitly the language acquisition hypothesis
to be tested in this paper relies on the assumption that linguistic structure is implicit in the
83

ously to the embedded meanings of the nonwords. After this first phase the video sequence
that the subject just had seen was played in
loop at the same time that the subject attempted
to answer multiple choice questions concerning
the nonsense names of the colors and the
shapes of the objects. Whatever responses produced by the subjects were used as indications
of the learned sound-object attribute pairing.
The task was obviously very simple for the
adult subjects and it turned out that the vast majority of the responses simply reflected the
built-in contingencies embedded in the stimuli.
The results of 21 subjects showed 97.6% correct generalization (see figure 2). The errors
made by two subjects were incorrectly named
shapes of the objects. The colour attribute generalization was 100% correct.
Combining Auditory and Visual Information in a neural network (NN) model

The NN model in this study combines visual
and auditory information. The model is based
on data used to test adults spontaneous propensity to extract latent information from a
short video sequence. Tests on young infants
ability to extract referential information from
audio-visual contingencies are also being conducted and will be reported later.
Nela
Dule
(Red
Cube)
Nela
Bima
(Red
Circle)
Lame
Dule
(Yellow
Cube)
Lame
Bima
(Yellow
Circle)
% correct generalizations (Adults)
Figure 1. Illustration of the film sequence shown to

the adults. The four pictures demonstrate 6 sec long
film sequences each. The objects moved smoothly
across the screen while sound track played two
repetitions of the two-word sentences formed by the
non-words that had been arbitrarily assigned to represent the colors and the shapes of the objects.
1.00
0.80
0.60
0.40
0.20
The videos shown to the adults (figure 1) were

about 24 seconds long and consisted of four
sequences: first a red cube was shown moving
smoothly across the screen, then a red circle
performed the same motion, the third sequence
showed a yellow circle also following the standard path. Two-word nonsense sentences, created by concatenating nonsense words that
were arbitrarily assigned to represent the color
and the shape of the objects, were played along
these figures. After exposure to the materials
the adult subjects were presented with answer
sheet were questions concerning the meaning of
the nonsense words used in the videos were
embedded in a number of foils containing spurious words and situations.
0.00
Nela (Red)
Dule (Cube)
Lame (Yellow)
Bima (Circle)
Figure 2. Reference data provided by the adult subjects. Percentage correct discoveries of the meaning
of the non-words representing the colours and the
shapes of the objects. Only in two cases were there
errors made by the subjects.
The auto association model

To carry out NN simulation, two types of feedforward architectures were constructed. The
first an auto association model (figure 3)
was intended to simulate a priori knowledge,
the task of the NN simply being to associate the
colours and the shapes of the visual objects to
the non-words corresponding to these two attributes, i.e. to reproduce its input at the output
level. Exactly the same set of data formed the
input and the output patterns of the NN. The
96-bit input vectors encoded the visually shown
colour (bits 1 to 24), the visually shown shape
(bits 25 to 48), the auditorily presented nonword standing for the colour (bits 49 to 72),
and the auditorily presented non-word standing
Reference data from the adult subjects

A group of 21 adult subjects participated in the
simulated language learning experiment described above. The subjects were asked to sit in
front of a blank computer screen and without
further instructions the video sequences were
started. After this exposure the subjects were
asked to describe what they just had seen and
heard. Most of the subjects referred spontane84

for the shape (bits 73 to 96). The hidden node
layer consisted of two hidden units.
96
Output
48
2
96
Output
Hidden
Visual Input
Hidden
48
Auditory Input
Hidden
Figure 4. Schematic architecture of the NN with

two sensory input channels aimed at simulating
learning. The task of the NN was to generalize its
knowledge of colours/shapes of figures and show it
via recognizing familiar colours despite of novel
figures, as well as recognizing novel-coloured familiar figures.
Input
Figure 3. Schematic architecture of the auto association NN aimed at simulating a priori knowledge.
The task of the NN was to reproduce its input at the
output level. The number of units in each layer is
indicated by the figures to the left of rectangles.
The performance of the NN was tested with

help of data not previously shown to the NN.
The novel data consisted of a blue cube and a
green circle (new colours for familiar shapes)
as well as of a red cone and a yellow cylinder
(new object shapes for familiar colours), as illustrated in figure 5.
In the first run of the NN, the information from

the input layer was separated in two different
receptive fields one of them corresponding to
vision and the other to audition. In the second
run of the NN all the information from the input layer was passed to each of the two hidden
nodes. Training was done sequentially by presenting the model with a visual colour and
shape parameter set and its associated nonword auditory set. The NN was able to find redundancies in the distributed data and accordingly all the input patterns were correctly categorized. The performance of the NN was stable
regardless of the way of passing input information to the hidden units.
Nela
Duma
(Red
Cone)
Lame
Guma
(Yellow
Cylinder)
Neme
Dule
(Blue
Cube)
Gale
Bima
(Green
Circle)
Figure 5. Illustration of the novel data used to test

the NNs ability to generalize its knowledge in new
context. Before the experiment these non-words
were arbitrarily assigned to represent the novel colours and the novel shapes of the objects.
The generalization model

The second architecture of the NN (figure 4)
attempted to simulate the fact that the adult
subjects in addition to being able to discriminate the colours and the shapes of the objects,
also learned the concepts conveyed by the nonwords and were readily able to apply them to
new contexts. This NN also had 96 bit-vectors
as input. The hidden node layer consisted of
four hidden-nodes, the first and the second of
them receiving information from the visual half
of the input nodes, and the third and the fourth
of them receiving information from the auditive
half of the input nodes. The NN was in this way
structured in two input channels so to simulate
two kinds of correlated sensory input.
The output activations showed that the NN was

able to not only associate the non-words corresponding to visually presented colours and
shapes of the objects but also to generalize its
knowledge to a new context. In a second run of
the NN all the input information was passed to
each of the four units at the hidden-layer. Just
as in the above auto association simulation, the
results of this run were not affected by the
change of the way of passing information to the
hidden units.
Discussion
The ultimate goal of this study is to investigate
how human infants might be able to extract im85

plicit information from their experience with
audio-visual stimuli. At this stage we simply
ran a pilot study using adult subjects that were
asked to watch short video sequences and subsequently requested to answer questions related
to the implicit information potentially conveyed
by the audio-visual stimuli. Although the adult
subjects did not receive any instructions (in an
attempt to make the situation more comparable
to that of the infants) the subjects had no difficulties in inferring the underlying structure
right on the first inquire. The situation created
by these stimuli was obviously too simple for
the adult subjects, who could not avoid thinking of the goals and the structure of the stimuli
as soon as they were put in the experimental
situation. Our next question concerns the extent
to which the infant subjects may also be able to
detect and generalize the implicit audio-visual
consistencies. Although we still do not have
data from the infants and we expect the infants
performance to be age-dependent, it is reassuring that NNs performance mimic so well the
adults results. This leads us to envisage the
future infant speech perception experiments as
a means to evaluate the potential significance
of NN models in accounting for the grounds of
linguistic development departing from generalpurpose association mechanisms. In general the
outcome of a NN is dependent on its architecture but our results do not suggest critical dependence on any of the two architectures
tested.
To be sure, the stimuli used in this first experiment are likely to be too simple to fully demonstrate relevant language development relying
on general-purpose associative mechanisms.
Therefore our current experiments with infants
are being conducted using audio-visual contingencies that attempt to replicate ecologically
relevant communication settings.
Acknowledgements
This work was supported by grants from the
Swedish Research Council, the Bank of Sweden Tercentenary Foundation and Birgit & Gad
Rausings Foundation.
References
Chomsky N. (1968). Language and mind. New
York: Harcourt Brace Jovanovich.
Elman J., Bates E., Karmiloff-Smith A.,
Parisi D., & Plunkett K. (1997) Rethinking
innateness. Cambridge, Massachusetts: MIT
Press.
Jusczyk P. (1985) On characterizing the development of speech perception. In Mehler J. &
Fox R. (eds), Neonate cognition: Beyond the
blooming, buzzing confusion Hillsdale, New
Jersey: Lawrence Erlbaum, 199299.
Lacerda F. (2003) Phonology: An emergent
consequence of memory constraints and sensory input. Reading and Writing: An Interdisciplinary Journal, 16, 4159.
Lacerda F., Klintfors E., Gustavsson L.,
Lagerkvist L., Marklund E., & Sundberg U.
(2004a) Ecological Theory of Language Acquisition. In Genova: Epirob 2004, 147148.
Lacerda F. and Lindblom B. (1997) Modelling
the early stages of language acquisition. In
Olofsson . and Strmqvist S. (eds), Crosslinguistic studies of dyslexia and early language development. Office for official publications of the European Communities, 14
33.
Lacerda F., Marklund E., Lagerkvist L., Gustavsson L., Klintfors E., & Sundberg U.
(2004b) On the linguistic implications of
context-bound adult-infant interactions. In
Genova: Epirob 204, 149150.
Pinker S. (1994) The Language Instinct: How
the Mind Creates Language. (1 ed.) New
York: William Morrow and Company, Inc.
% correct generalizations (Network)
1.00
0.80
0.60
0.40
0.20
0.00
Nela (Red)
Dule (Cube)
Lame (Yellow)
Bima (Circle)
Figure 6. Results of the NN performance. Percentage correct generalizations of the meaning of the
non-words representing the colours and the shapes
of the objects. The results indicate that generalization of shapes was slightly more robust than generalization of colours.
86
Effects of Stimulus Duration and Type on Perception of

Female and Male Speaker Age
Susanne Schtz
Department of Linguistics and Phonetics, Lund University
2001; Brckl and Sendlmeier, 2003). Unfortunately, these studies are often difficult to compare due to differences in the stimuli as well as
in the method. Differences concern (1) language, (2) stimulus duration, (3) type of speech
(prolonged vowels, whispered vowels, single
words, read, backward or spontaneous speech
etc.), (4) sound quality (HiFi, telephonetransmitted etc.), (5) speaker age and gender,
(6) listener age and gender, (7) recognition task
(classify into 2, 3 or 7 age groups, direct magnitude etc.), and (8) result measure (correlation,
absolute mean error, % correct etc.).
Another question concerns whether listeners
use different strategies when estimating female
and male speaker age, since women and men
age differently (Higgins and Saxman, 1991). In
a study of automatic estimation of elderly
speakers, Mller et al. (2003) successfully built
gender-specific age classifiers. The author
(2005) found differences between female and
male speakers in both human and machine age
recognition from single word stimuli. While F0
was a better cue for estimation of female age,
the formants seemed to constitute better cues
when judging male age. One possible explanation is that the characteristics of female voices
appear to be perceived as more complex than
those of male speech (Murry and Singh, 1980),
suggesting that listeners would need either a
partly different set or a larger number of phonetic cues when judging female age.
Abstract
Our abilitiy to estimate speaker age was investigated with respect to stimulus duration and
type as well as speaker gender in four listening
tests with the same 24 speakers, but with four
different types of stimuli (ten and three seconds
of spontaneous speech, one isolated word and
six concatenated isolated words) Results
showed that the listeners' judgements were
about twice as accurate as chance, and that
stimulus duration and type affected the judgements. Moreover, stimulus duration affected the
listeners judgments of female speakers somewhat more, while stimulus type affected the
judgments of male speakers more, indicating
that listeners may use different strategies when
judging female and male speaker age.
Introduction
Most of us are able to make fairly accurate estimates of an unknown speakers chronological
age from hearing a speech sample (Shipp and
Hollien, 1969; Linville, 2001). This paper adresses the question of how much and what kind
of speech information we need to make as good
estimates of speaker age as possible.
Background and previous studies
In age estimation, the accuracy depends, among
other things, on the precision required and on
the duration and type of the speech sample
(prolonged vowel, read speech etc.). The less
acoustic information present in a speech sample, the more difficult the task, but even with
very little information, listeners are still not reduced to random guessing. Speaker and listener
characteristics, including gender, age group, the
speaker's physiological and psychological state,
and the listener's experience or familiarity with
similar speakers (dialect etc.) may also influence the accuracy (Ramig and Ringel, 1983;
Linville, 2001). Consequently, some speakers
may be more difficult to judge than others.
A considerable amount of research has been
devoted to the issue of age recognition from
speech (Ptacek and Sander, 1966; Huntley et
al., 1987; Braun and Cerrato, 1999; Linville,
Purpose and questions

The purpose of this study was to determine how
stimulus duration and two different stimulus
types (isolated words and spontaneous speech)
influence estimation of female and male
speaker age by answering the following questions:
1. In what way does stimulus duration and type
affect the accuracy of listeners perception of
speaker age?
2. Is there a difference between perception of
female and male speaker age with respect to
stimulus duration and type?
87

in years, for female, male and all speakers in
the four tests. The listeners' judgements were
about twice as accurate as a baseline estimator,
which judged all speakers to be 47.5 years old
(the mean CA of all speakers) in the first three
tests. In Test 4, the shortest (1 word) stimuli
yielded results at levels approximately half-way
between the baseline and the other tests.
Material and method

Six speakers each from four different groups
older women (aged 63-82), older men (aged 6075), younger women (aged 24-32) and younger
men (aged 21-30) from the southern part of
Sweden (Gtaland) were selected randomly
from the SweDia 2000 database (Bruce et al.,
1999), which contains native speakers of Swedish. For each of the 24 speakers, four different
speech samples were extracted, normalized for
intensity, and used in the four perception tests:
Test 1: about 10 seconds of spontaneous speech
Test 2: about 3 seconds of spontaneous speech
Test 3: a concatenation of 6 isolated words: kke
(jaw), saker (things), sjlen (the soul), sot (soot),
typ (type) and tack (thanks), (dur 4 sec.)
Test 4: one isolated word: rasa (collapse),
(dur.0.65 sec.)
Four separate listening tests (one for each of

the four sets of stimuli) were carried out. Two
listener groups participated in one test each,
while a third group took part in two of the tests.
The gender and age distribution for the three
groups is shown in Table 1, along with information on which test and set of stimuli each
group was presented with. All subjects were
students of phonetics at Lund University. The
task was to make direct age estimations based
on first impressions of the 24 stimuli, which
were played only once in the same random order in all four tests using an Apple PowerBook
G4 with Harman Kardon's SoundSticks loudspeakers. The listeners were also asked to name
cues, which they believed had affected their
judgements.
Figure 1. Mean (abs) error for the 4 sets of stimuli

for female, male and all speakers.
The sum, mean and median values of the errors for all speakers in the four tests as well as
for the baseline are shown in Table 2. In all
tests, the listeners' judgements of women were
more accurate than those of men. The highest
accuracy was obtained for the female 10 second
stimuli (6.5), while the male 6 word stimuli received the lowest accuracy (15.3). Moreover,
the listeners tended to overestimate the younger
speakers, and to underestimate the older speakers.
Table 2. Sum, mean and median error values for all
speakers in the four tests and for the baseline (BL).
Table 1. Test number, stimuli set, number of listeners, and gender and age distribution of the listener
groups in the four tests.
Test (stimuli)
1 (10 sec.)
2 (3 sec.)
3 (6 words)
4 (1 word)
N
31
33
37
37
F
18
22
33
33
Test
sum
mean
median
M Age range (mean/median)

11
19-65 (27/23)
11
19-57 (25/23)
4
19-55 (26/24)
4
19-55 (26/24)
1 (10s)
196.5
8.2
7.2
2 (3s) 3 (6w) 4 (1w) BL

256.1 277.6 348.7 497.0
10.7
11.6
14.5 20.7
10.0
10.0
16.7 19.5
Stimulus and speaker gender effects

The listeners mean absolute errors were subjected to two separate analyses of variance. In
the first analysis, speaker gender and speaker
age (old or young) were within-subject factors,
and stimulus duration (short (1 word), medium
(6 words and 3 sec.), long (10 sec.)) was the
between-subjects factor In the second analysis,
the between-subjects factor was stimulus type
(spontaneous or word stimuli) instead of stimulus duration.
Results
Accuracy
Figure 1 displays the mean absolute error, i.e.
the average of the absolute difference between
perceived age (PA) and chronological age (CA)
88

Stimulus duration
Discussion
Longer stimulus durations led to significantly

higher accuracy (F(2,100)=71.059, p<.05). A
difference between the female and male
speaker judgements was also observed. Accuracy for longer stimuli improved more for the
female than for the male speakers. For the female speakers, a lower error was observed for
the 10 sec. stimuli (6.5) than for the 3 sec.
stimuli (9.7), and the error for the 6 word stimuli (7.9) was lower than for the 1 word stimuli
(13.9). The difference between longer and
shorter stimulus durations was much smaller
for the male speakers, with a error of 9.9 for the
longest (10 sec.) stimuli, higher errors for the
medium long 3 sec. and 6 word stimuli (11.6
and 15.3), and a similar error for the 1 word
stimuli (15.1). This interaction of speaker gender and stimulus duration was, however, not
significant (F(2,100)=2.171, NS).
Despite the limited number of stimulus durations and types investigated, a few interesting
results were found. These are discussed below,
along with a few suggestions for future work.
Accuracy
The listeners performed significantly better
than the baseline estimator (about twice as
good) in three of the tests, which is in line with
previous work. However, it remains unclear
what accuracy levels can be expected from listeners' judgements of age. Differences in
speakers' CA have to be taken into account as
well. A mean absolute error of 10 years could
be considered less accurate for a 30 year old
speaker (a PA of 20 could be regarded as 20/30
= 66.7\% correct), compared to an 80 year old
speaker (a PA of 70 could be regarded as 70/80
= 87.5\% correct). There is a need for a better
measure of accuracy for age estimation tasks.
The fact that three different listener groups participated in the tests may also have influenced
the accuracy.
In all of the four tests, the listeners' estimations of women were more accurate than those
of men, perhaps because the listeners were
mainly women. However, the influence of listener gender on performance in age estimation
tasks is still unclear. Although most researchers
have not reported any difference in performance between male and female listeners, some
studies have found females to perform better
than males, while others still have found male
listeners to perform somewhat better (Braun
and Cerrato, 1999). Another explanation could
be that the male speaker group contained a
larger number of atypical speakers, who consequently would be more difficult to judge, than
the female speakers. Shipp and Hollien (1969)
found that speakers who were difficult to age
estimate had standard deviations of nine years
and over. Perhaps such a measure can be used
to decide whether speakers are typical representatives of their CAs or not.
Stimulus type
Stimulus type also influenced the age estimations significantly (F(1,68)=61.143, (p<.05).
The listeners' judgments of the male speakers
were more accurate for the spontaneous stimuli
than for the word stimuli. Lower mean absolute
errors were obtained for the two sets of spontaneous stimuli (9.9 and 11.6) compared to the
two sets of word stimuli (15.3 and 15.1). This
effect was not observed for the female speakers. Here, the mean absolute error for the 6
word stimuli (7.9) was lower than for the 3 second spontaneous stimuli (9.7), but higher than
the longer spontaneous stimuli (6.5). The interaction of speaker gender and stimulus type was
significant (F(1,68)=39.296, p<.05).
Listener cues
Most of the listeners named several cues, which
they believed had influenced their age judgements. Dialect, pitch and voice quality affected
the listeners' estimates in all four tests, while
semantic content influenced the judgements in
the tests with spontaneous stimuli. A common
listener remark in the tests with spontaneous
stimuli concerned speakers talking about the
past. They were often judged as being old, regardless of other cues. Additional listener cues
included speech rate, choice of words or
phrases and experience or familiarity with similar speakers (age group, dialect etc.).
Stimulus effects
In this study, longer durations for the most part
yielded higher accuracy for the listeners' age
estimates. This raises the question of optimal
durations for age estimation tasks. When does a
further increase in duration for a specific
speech or stimulus type no longer result in a
89

higher accuracy? Further studies with a larger
and more systematic variation of stimulus duration for each stimulus type are needed to answer this question.
Significant effects for both accuracy and
speaker gender differences were found for the
two stimulus types in this study. However, isolated words and spontaneous speech can be difficult to compare in a study of speaker age.
Several listeners mentioned that the semantic
content of the spontaneous stimuli influenced
their age judgements, which may explain why
the male speaker spontaneous stimuli yielded
higher accuracy compared to the word stimuli.
Besides providing more information about the
speaker (dialect, choice of words etc.), spontaneous speech is also likely to contain more prosodic and spectral variation than isolated
words. However, for the female speakers, the
lower accuracy obtained for the 3 second spontaneous stimuli compared to the only slightly
longer 6 word stimuli cannot be explained by
stimulus type effects alone. It would be interesting to compare a larger number of speech
types in search for the types best suited for both
female and male speaker age estimation tasks.
Future work should include studies where several different speech types are compared and
varied more systematically with respect to phonetic content and quality as well as variations
and dynamics.
tion in future research, when investigating

acoustic as well as perceptual correlates to
speaker age.
References
Braun A and Cerrato L. (1999) Estimating
speaker age across languages. Proceedings
of ICPhS 99 (San Francisco), 13691372.
Brckl M and Sendlmeier W. (2003) Aging
female voices: An acoustic and perceptive
analysis. Proceedings of VOQUAL 03
(Geneva), 163168.
Cerrato L, Falcone M and Paoloni A. (1998)
Age estimation of telephonic voices. Proceedings of the RLA2C conference (Avignon), 2024.
Higgins M B and Saxman J H. (1991) A comparison of selected phonatory behaviours of
healthy aged and young adults. Journal of
Speech and Hearing Research 13, 10001010.
Huntley R, Hollien H and Shipp T. (1987) Influences of listener characteristics on perceived age estimations. Journal of Voice 1,
4952.
Linville S E. (2001) Vocal Aging. San Diego:
Singular Thomson Learning.
Mller C, Wittig F and Baus J. (2003) Exploiting speech for recognizing elderly users to
respond to their special needs. Proceedings
of Eurospeech 2003 (Geneva), 13051308.
Murry T and Singh S. (1980) Multidimensional
analysis of male and female voices. JASA
68 (5), 12941300.
Ptacek P H and Sander E K. (1966) Age recognition from voice. Journal of Speech and
Hearing Research 9, 273277.
Ramig L A and Ringel R L. (1983) Effects of
physiological aging on selected acoustic
features. Journal of Speech and Hearing Research 26. 2230.
Schtz S. (2005) Prosodic cues in human and
machine estimation of female and male
speaker age. In G. Bruce & M. Horne (Eds.)
Nordic Prosody: Proceedings of the IXth
Conference, Lund, 2004. Frankfurt am
Main: P. Lang, 215223.
Shipp T and Hollien H.(1969) Perception of the
aging male voice. Journal of Speech and
Hearing Research 12, 703710.
Bruce G, Elert C-C, Engstrand O and Eriksson
A. (1999) Phonetics and phonology of the
Swedish dialects - a project presentation
and a database demonstrator. Proceedings
of ICPhS 99 (San Francisco), 321324.
Speaker gender effects

As already mentioned in the previous paragraph, there were differences between female
and male speakers with respect to which stimulus type and durations yielded higher age estimation accuracy. One explanation for the differences between female and male speakers
may be that listeners use different strategies
when judging female and male speaker age. As
suggested in Schtz (20005), it is possible that
listeners use more prosodic cues (mainly F0)
when judging female speaker age, but that
spectral cues (i.e. formants, spectral balance
etc.) are preferred when judging male speaker
age. Consequently, the results from this study
may indicate that for male speakers, spontaneous stimuli provide the listeners with more
spectral information, while longer stimuli contain more prosodic information needed to estimate female speaker age more accurately. The
differences in perception of female and male
speaker age has to be studied further, and
speaker gender has to be taken into considera90
Effects of age of learning on VOT in voiceless stops

produced by near-native L2 speakers1
Katrin Stlten
Centre for Research on Bilingualism, Stockholm University
of verbal L2 exposure during a limited time
frame for phonetic sensitivity, or if nativelike
perception and an accent-free pronunciation is
biologically possible for any adult learner,
given the right social, psychological, and educational circumstances.
This study is part of an ongoing project on
age of onset (AO) and ultimate attainment in
L2 acquisition. The project focuses on early
and late learners of Swedish with Spanish as
their L1, who have been selected on the criterion that they are perceived by native listeners
as mother-tongue speakers of Swedish in everyday oral communication. These nativelike
candidates L2 proficiency has thereafter been
tested for various linguistic levels and skills.
The present study reports on analyses of Voice
Onset Time (VOT) in the production of Swedish voiceless stops.
Swedish and Spanish both distinguish
voiced from voiceless stops in terms of VOT
but the languages differ as to where on the
VOT continuum the two stop categories separate. In languages like Swedish and English
short-lag stops are treated as voiced, whereas
long-lag stops are classified as voiceless. In
contrast, Spanish treats short-lag stops as voiceless, while stops with voicing lead are categorized as voiced (e.g. Zampini & Green 2001).
Due to the fact that L2 learners in general show
difficulties in correctly perceiving and producing these language specific categories (see, e.g.
Flege, 1991), the analysis of VOT production
seems to be a good tool for investigating the
nativelike subjects L2 proficiency.
For the present study the following research
questions were formulated:
Abstract
This study is concerned with effects of age of
onset (AO) of acquisition on the production of
Voice Onset Time (VOT) among near-native L2
speakers. 41 L1 Spanish early and late learners
of L2 Swedish, who had carefully been
screened for their nativelike L2-proficiency,
participated in the study. 8 native speakers of
Swedish served as control group. Spectral
analyses of VOT were carried out on the subjects production of the Swedish voiceless stops
/p t k/. The preliminary results show an overall
age effect on VOT in the nativelike L2 speakers production of all three stops (answer to
Research Question 1). Among the late learners
only a small minority exhibits actual nativelike
L2 behavior (answer to Research Question 2).
Finally, far from all early L2 speakers do pass
as native speakers of their L2 regarding the
production of voiceless stops (answer to Research Question 3).
Introduction
Several studies on infant perception have
shown that first language (L1) phonetic categories are already established during the first year
of life (e.g. Eimas et al. 1971, Werker et al.
1984). Further evidence that very early exposure is of importance in L1 development comes
from children who were deprived from verbal
input due to inflammation of the middle ear
during their first year of life. Ruben (1997) reports that these children showed significantly
less capacity for phonetic discrimination compared to children with normal hearing during
infancy when they were tested at the age of
nine years. From these findings it has been concluded that there may exist a critical period for
phonetic/phonological acquisition and that this
critical period may already be over at the age of
one year (Ruben 1997).
One classical issue in the field of language
acquisition concerns whether this theory of the
existence of a critical period can be applied to
second language (L2) acquisition. More precisely, the question is if L2 learners typically
fail to acquire phonetic detail because of lack
(1) Is there a general age effect on VOT production among L2 learners who are perceived by native listeners as native speakers of Swedish?
(2) Are there late L2 learners who produce
voiceless stops with an average VOT
within the range of native-speaker VOT?
(3) Do all (or most) early L2 learners produce
voiceless stops with an average VOT
within the range of native-speaker VOT?
91

der to exclude extremely short and long word
durations from further analysis: one subject
(AO=3) as well as par for a subject with AO=9,
tal for a subject with AO=3, and kal for a subject with AO=15.
Method
Subjects
A total of 41 native speakers of Spanish (age
21-52 years), who had been selected on the criterion that native listeners perceive them as
mother-tongue speakers of Swedish, participated in this study. The nativelike subjects
mean length of residence (LOR) in Sweden was
24 years (range 12-44 years) and their age of
onset (AO) of L2 acquisition varied between 1
and 19 years. Furthermore, the subjects
showed an educational level of no less than
senior high school and they all had acquired the
variety of Swedish spoken in the great Stockholm area.
A control group was added consisting of 82
native speakers of Swedish who had been
matched with the experimental group regarding
present age, educational level and variety of
Swedish.
Results
Since it is a well-known fact that VOT varies
with place of articulation (see, e.g. Lisker &
Abramson, 1964) results for the three voiceless
stops are presented separately.
Figures 1-3 show the subjects average
VOT-values (in ms) plotted against their age of
onset (AO).
160
140
VOT (ms)
120
100
80
60
40
20
0
0
Material and procedure

Both the nativelike and the native speakers of
Swedish were tested individually in a sound
treated room. The subject was instructed to
read aloud the following three Swedish words
with /p t k/ in initial position: par ([pA:r], pair,
couple), tal ([tA:l], number, speech) and kal
([kA:l], naked, bald). Each word was pronounced ten times, with two-second pauses in
between. The experiment leader, a male native
speaker of Stockholm Swedish, indicated the
reading rate to the subject by visually keeping
count with his fingers. All readings were recorded through a KOSS R/50B headset microphone at 22,050 Hz with a 16-bit resolution.
Using the Soundswell signal analysis package measurements of VOT were conducted on
the basis of broadband spectrograms and the
oscillographic display. VOT was measured as
the time interval between the release burst of
the stop consonant and the onset of voicing in
the following vowel.
As has been reported in several studies (e.g.
Kessinger & Blumstein 1997, Miller et al.
1986) speaking rate has an effect on both VOT
and vowel duration. Findings like these thus
suggest that VOT should be treated in relation
to syllable or word duration rather than independently. Therefore, also measurements on
word length were carried out.
Furthermore, two standard deviations from
the average word length were calculated in or-
10
12
14
16
18
20
Age of onset (AO)
Figure1. VOT (in ms) for /p/ in relation to age of

onset (AO). Unfilled diamonds: native speakers of
Swedish, filled diamonds: near-native L2 speakers.
160
140
VOT (ms)
120
100
80
60
40
20
0
0
10
12
14
16
18
20
Age of onset (AO)
VOT (ms)
Figure 2. VOT (in ms) for /t/ in relation to age of

onset (AO). Symbols as in figure 1.
160
140
120
100
80
60
40
20
0
0
10
12
14
16
18
20
Age of onset (AO)
Figure 3. VOT (in ms) for /k/ in relation to age of

92

As can be seen in figures 1-2, no significant age
effects on VOT are found for either /p/ (r=0.20, df=38) or /t/ (r=-0.02, df=38). Only for /k/
(se figure 3) a weak but statistically significant
correlation between AO and VOT can be observed (r=-0.35, df=38, p<0.05). However, this
analysis did not control for possible effects of
speaking rate on VOT. In order to achieve a
neutralized measure of VOT, milliseconds were
transformed into percentages of total word duration.
Figures 4-6 again present VOT in relation to
the subjects AO, but this time with VOT
measurements expressed in terms of percentages of word length.
Figures 4-6 reveal that correlations between

AO and VOT now have become significant for
all three stops: for /p/ r=-0.38 (df=38, p<0.02),
for /t/ r=-0.36 (df=38, p<0.02), and for /k/ r=0.45 (df=38, p<0.01). Furthermore, the figures
show that most of the nativelike candidates
produce average VOT-values within the range
of native speaker VOT. However, this observation is more obvious for /p/ (29 subjects) and /t/
(31 subjects) than for /k/ (22 subjects).
In order to compare early and late learners in
a more systematic way, the nativelike candidates were divided into a pre-puberty (AO 1-11
years) and a post-puberty group (AO 12-19
years). By analyzing these two groups separately it becomes clear that a majority (17 out
of 30) of the pre-puberty group produce average VOT-values within the range of native
speaker VOT for all three stops. Among the 10
late learners this is the case for only two of the
subjects (AO 14 years for both).
Within the group of early L2 learners ten
subjects show average VOT-values within the
range of native speaker VOT for either one or
two of the stops. In contrast, three pre-puberty
learners (AO=7, 8 and 9 years) can be found
who do not produce nativelike VOTs for any of
the stops.
In the group of late L2 learners five subjects
are found who produce average VOT-values
within the range of native speaker VOT for /t/
or for both /p/ and /t/, but never for /k/. Finally,
three post-puberty learners (AO=15, 16 and 19
years) can be observed who do not produce average VOTs within the range of native speaker
VOT for any of the stops.
30
VOT (%)
25
20
15
10
5
0
0
10
12
14
16
18
20
Age of onset (AO)
Figure 4. VOT (in %) for /p/ in relation to age of

30
VOT (%)
25
20
15
10
5
0
0
10
12
14
16
18
20
Age of onset (AO)
The present study has revealed that among subjects, who have been selected on the criterion
that native listeners perceive them as mothertongue speakers of Swedish, there exists a weak
but statistically significant correlation between
AO and VOT production. In other words, these
findings confirm that there is a general age effect on voiceless stop production even among
apparently nativelike L2 speakers (Research
Question 1).
The analysis of the post-puberty group has
shown that only two (AO=14 years) out of 10
late L2 learners show average VOTs within the
range of native speaker VOT for all three stops.
Furthermore, eight late learners do not produce
VOTs within the range of native speaker VOT
regarding all three stops, and three of these sub-
Figure 5. VOT (in %) for /t/ in relation to age of

30
VOT (%)
25
20
15
10
5
0
0
10
12
14
16
18
20
Age of onset (AO)
Figure 6. VOT (in %) for /k/ in relation to age of

93

jects (AO=15, 16, and 19 years) do not exhibit
nativelike VOTs for any of the Swedish stops.
According to these data only a small minority
among late, apparently nativelike L2 learners
exhibit actual nativelike L2 behavior regarding
the production of Swedish voiceless stops (Research Question 2).
An interesting finding is that both early and
late L2 learners produce more nativelike VOTs
for /p/ and /t/ than for the velar stop /k/. Since
Spanish VOTs of voiceless stops are overall
shorter (short-lag) than Swedish VOTs (longlag), and since velars, compared to bilabials
and dentals, show the longest VOTs (e.g.
Lisker & Abramson 1964), /k/ represents the
most extreme, or the most non-Spanish, articulation concerning VOT. Therefore, the
Swedish velar stop probably generates the most
difficulty for both early and late L2 learners.
Regarding the pre-puberty group a majority
of the L2 learners do produce average VOT
within the range of native speaker VOT for all
three stops. However, 13 individuals are found
who do not produce nativelike VOT concerning
all three stops, and three of these subjects
(AO=7, 8 and 9 years) do not exhibit nativelike
VOTs for any of the stop consonants. From
these results it can be concluded that far from
all early learners pass as native speakers of
their L2 when analyzed in detail (Research
Question 3). Thus, even though early L2 learners generally sound nativelike in everyday oral
communication, an early age of onset does not
automatically result in an entirely nativelike
VOT production.
Finally, when measuring VOT in milliseconds no correlations between AO and VOT
could be observed, except for the velar stop /k/.
However, after have taken word duration into
consideration a significant overall age effect
emerged for all three stops. These findings suggest that effects of speaking rate have to be
taken into account in the analysis of VOT.
2. For further analysis the control group will

be expanded by another 7 native speakers of
Swedish.
References
Abrahamsson, N., Stlten, K. & Hyltenstam, K.
(in press), Effects of age on voice onset
time: The production of voiceless stops by
near-native L2 speakers. To appear in: S.
Haberzettl (ed.), Processes and Outcomes:
Explaining Achievement in Language
Learning. Berlin: Mouton de Gruyter.
Eimas P. D., Siqueland E. R., Jusczyk, P., and
Vigorito J. (1971) Speech perception in infants. Science 171, 303-306.
Flege J. E. (1991) Age of learning affects the
authenticity of voice-onset time (VOT) in
stop consonants produced in a second language. Journal of the Acoustical Society of
America 89:1, 395-411.
Lisker L. and Abramson A. (1964) A crosslanguage study of voicing in initial stops:
Acoustical measurements. Word 20, 384422.
Kessinger R. H. and Blumstein S. E. (1997) Effects of speaking rate on voice-onset time in
Thai, French, and English. Journal of Phonetics 25, 143-168.
Miller J. L., Green K. P., and Reeves A. (1986)
Speaking rate and segments: A look at the
relation between speech production and
speech perception for the voicing contrast.
Phonetica 43, 106-115.
Ruben R. J. (1997) A time frame of critical/sensitive periods of language development. Acta Otolaryngologica 117, 202-205.
Werker J.F. and Tees, R.C. (1984) Crosslanguage speech perception: Evidence for
perceptual reorganization during the first
year of life. Infant Behaviour and Development 7, 49-63.
Zampini M. L. and Green K. P. (2001) The
voicing contrast in English and Spanish:
The relationship between perception and
production. In: Nicol J. L. (ed) One Mind,
Two Languages. Oxford: Blackwell.
Acknowledgements
This study was in part supported by The Bank
of Sweden Tercentenary Foundation, grant no.
1999-0383:01.
Notes
1. A more detailed description and discussion
of this study will be given in Abrahamsson,
Stlten & Hyltenstam (in press).
94
Prosodic phrasing and focus productions in Greek

Antonis Botinis, Stella Ganetsou and Magda Griva
School of Humanities and Informatics, University of Skvde
and Department of Linguistics, University of Athens
Abstract
Experimental methodology
This is an experimental study of tonal correlates of prosodic phrasing and focus production in Greek. The results indicate: (1) the tonal correlates of phrasing are a rising tonal
command at phrase boundaries and a deaccentuation of the preboundary lexical stress; (2)
the tonal correlates of focus are a local tonal
range expansion aligned with the stressed syllable of the last lexical unit in focus and a
global tonal range compression, which is most
evident for the speech material after focus; (3)
phrasing and focus have significant interactions, according to which the phrasing tonal
command is suppressed as a function of focus
production at the same linguistic domain.
One experiment was designed in order to investigate distinctive phrasing and focus structures.
The speech material consists of two compound
test sentences with a phrasing distinction as
well four focus distinctions. The phrasing distinction involves the attachment of a surface
subject to either subordinate or main clause.
The focus distinctions involve one neutral production as well as three productions with focus
on different constituents of the test sentences.
The neutral production of the test sentences had
no contextual information whereas the focus
productions of the test sentences were preceded
by a question which elicited focus in different
constituents of the test sentences.
The two test sentences were {tan peze
bla, i mara javaze arxa}(When (he) was
playing football Maria was studying Ancient
(Greek)) and {tan peze bla i mara, jvaze
arxa} (When Maria was playing football (he)
was studying Ancient (Greek)). Thus, the noun
Maria is the subject of the subordinate and
main clause in pre-comma and post-comma
position respectively. With different elicitation
questions, focus was assigned on the test sentences in three different ways, i.e. on the subordinate clause, on the main clause and on the
subject Maria.
Two female students of the Linguistics Department at Athens University produced the
speech material in five repetitions at normal
speech tempo. The speech material was directly
recorded in to a computer disc and analysed
with the Waveserfer software package.
Three tonal measurements were taken at
each syllable, i.e. at the beginning, middle and
end, regardless the segmental structure of syllable. This methodology normalizes tonal
measurements with reference to temporal and
tonal alignments of produced utterances.
Introduction
This study is within a multifactor research context in linguistic structuring. We examine the
relation between sound and meaning as a function of linguistic distinctions and linguistic
structures in an integrated experimental framework, which is in the spirit of the ISCA Workshop Experimental Linguistics (see Botinis,
Charalabakis, Fourakis and Gawronska, 2005).
Phrasing and focus are abstract linguistic
categories with distinctive functions in linguistic structuring. The basic functions of phrasing
and focus are the segmentation of continuous
speech into a variety of meaningful linguistic
units and the marking of variable linguistic
units as more important than others respectively. We do have basic knowledge with reference to both phrasing and focus from earlier
research (e.g. Botinis, 1989, Fourakis, Botinis
and Katsaiti, 1999, Botinis, Bannert and
Tatham, 2000, Botinis, Ganetsou and Griva,
2004) but we do not have any knowledge with
reference to phrasing and focus interactions at
the same linguistic domains.
In this study, we present production data
whereas perception research with reference to
phrasing and focus interactions is being carried
out. In the remainder of the paper, the experimental methodology is presented next followed
by results and concluded by discussion.
Results
The results of this study, in accordance with
the experimental methodology described in the
previous section, are presented in average values of the tonal measurements in Figure 1.
95
400
300
200
100
0
tan
pe
ze
la /
ma
va
ze
ar
a/
va
ze
ar
va
ze
ar
A/
va
ze
ar
1a
350
300
250
200
150
100
tan
pe
ze
la
ma
2a
400
350
300
250
200
150
100
TAN
PE
ZE
LA /
ma
1b
400
350
300
250
200
150
100
TAN
PE
ZE
LA
MA
2b
Figure 1. Continuous next page.
96
350
300
250
200
150
100
tan
pe
ze
la/
MA
VA
ZE
AR
a/
VA
ZE
AR
va
ze
ar
A/
va
ze
ar
1c
400
350
300
250
200
150
100
tan
pe
ze
la
ma
2c
350
300
250
200
150
100
tan
pe
ze
la /
MA
1d
350
300
250
200
150
100
tan
pe
ze
la
MA
2d
Figure 1. Average values of tonal measurements as a function of prosodic phrasing (1-2), indicated
by solidus (/), and focus productions (a-d), indicated by capital letters (see text).
97

Figures 1a and 2a show tonal structures of
the test sentences as a function of neutral productions. There is a prosodic phrasing aligned
with the respective clause boundaries of the test
sentences which is a tonal command aligned
with the edge of the subordinate clause. This
phrasing tonal command is a tonal rise with no
lexical stress alignment. On the other hand, the
last lexical stress in relation to the prosodic
boundary is not correlated with any distinct tonal command.
Figures 1b and 2b show tonal structures of
the test sentences as a function of focus production on the respective subordinate clause.
No prosodic phrasing is correlated with clause
boundaries. Instead, a bidirectional tonal command is correlated with the right edge of the
subordinate clause, which is a tonal rise aligned
with the stressed syllable of the last word in
focus followed by a tonal fall aligned with the
poststressed syllable. The end of the tonal fall
spreads to the right to the end of the sentence.
Figures 1c and 2c show tonal structures of
the test sentences as a function of focus productions on the respective main clause. The
tonal structure of these productions is fairly
similar to the tonal structure of the neutral productions shown in Figures 1a and 2a, i.e. a prosodic phrasing with a tonal rise aligned with
the edge of the subordinate clause.
Figures 1d and 2d show tonal structures of
the test sentences as a function of focus production on the subject of either subordinate or
main clause. A distinct tonal command is correlated with the clause boundaries in 1d, i.e.
when the noun Maria is the subject of the
main clause, whereas no tonal command is correlated with the clause boundaries in 2d, i.e.
when the noun Maria is the subject of the
subordinate clause. On the other hand, the focus productions in 1d and 2d have fairly similar
tonal correlates, which involve a bidirectional
tonal command aligned with the subject
Maria in focus and a substantial compression
of the postfocus global tonal structure.
Phrasing and focus may have distinct tonal

correlates each in speech production. Phrasing
has thus a relative local tonal effect, which defines syntactic boundaries as a function of coherence distinctions (see Botinis, Ganetsou and
Griva, 2004), whereas focus has a global effect,
which defines semantic weighting as a function
of information structure distinctions (see
Botinis, 1989).
Each phrasing and focus may be applied on
different linguistic domains with distinct tonal
structures. However, at the same linguistic domains, phrasing tonal structures are suppressed
as a function of focus applications. This is an
indication that focus is a higher prosodic category with global rather than local prosodic effects in relation to phrasing. On the other hand,
phrasing is a higher prosodic category, which
suppresses lexical stress on the domain of its
immediate application.
The results of the present study may have
several theoretical implications. With reference
to prosodic theory, prosody is organized in a
hierarchical structure, according to which different linguistic levels are associated with different prosodic categories (see Botinis, 1989).
Higher prosodic categories are thus associated
with higher linguistic levels in the domain of
which prosodic rules operate to produce related
prosodic structures. Accordingly, the prosodic
correlates of lower and higher prosodic categories are relative local and global ones respectively, which results in variable suppressions of
lower prosodic category correlates as a function of higher prosodic category applications.
References
Botinis A (1989) Stress and Prosodic Structure
in Greek. Lund University Press.
Botinis A., Bannert R., and Tatham M. (2000)
Contrastive tonal analysis of focus perception in Greek and Swedish. In Botinis A.
(ed.), Intonation: Analysis, Modelling and
Technology, 97-116. Dordrecht: Kluwer
Academic Publishers.
Botinis A., Ganetsou S., Griva M., and Bizani
H. (2004) Prosodic phrasing and syntactic
structure in Greek. The XVIIth Swedish
Phonetics Conference, 96-99, Stockholm,
Sweden.
Fourakis M., Botinis A., and Katsaiti M. (1999)
Acoustic characteristics of Greek vowels.

In accordance with the results of the present
study, some old knowledge has been corroborated and some new knowledge has been produced. The old knowledge refers to tonal correlates of phrasing and focus whereas the new
knowledge refers to interactions between these
two prosodic categories.
98
Syntactic and tonal correlates of focus in Greek and Russian

Antonis Botinis1,2, Yannis Kostopoulos1, Olga Nikolaenkova1,3 and Charalabos Themistocleous1
1
Department of Linguistics, University of Athens, Greece
2
School of Humanities and Informatics, University of Skvde, Sweden
3
Department of Linguistics , University of Saint Petersburg, Russia
a random disposition on a piece of paper.
Experimenters were asked to compose and
write nine full sentences with the most natural
word order. The language material of this
experiment consists of 1206 (9 sentences x 134
experimenters) and 657 (9 sentences x 73
experimenters) individual sentence productions
for Greek and Russian respectively.
The second experiment was to investigate
word order of the basic language material used
in the first experiment as a function of five
different questions, which were designed to
elicit five focus distinctions, i.e. one neutral and
the remaining four with focus on subject, verb,
verb phrase and object respectively. The ten
sentences were organised in ten respective
series and each series was in turn organised in
four sets with different word order. Each set
was led by a statement followed by five
different questions, i.e. one question for the
elicitation of each of the five focus distinctions.
Experimenters were asked to fill in a form with
two main options, i.e. if the statements were
accepted or non-accepted and, if accepted,
which of the five alternative questions were
most appropriate as answers to these questions.
The language material of this experiment
consists of 3400 (10 sentences x 5 focus
distinctions x 85 experimenters) and 1850 (10
sentences x 5 focus distinctions x 37
experimenters) word order individual sentence
options as a function of focus distinctions for
Greek and Russian respectively.
The third experiment was to investigate
unmarked word order of spoken language
production as a function of contextual
information on the computer screen as well as
syntactic and tonal correlates of focus
distinctions elicited by different questions.
Experimenters were asked to produce neutral as
well as variable focus distinctions. The
language material of this experiment consists of
480 (12 sentences x 4 focus distinctions x 10
experimenters) and 720 (12 sentences x 4 focus
distinctions x 15 experimenters) sentence
productions in Greek and Russian respectively.
Abstract
This is an experimental study of syntactic and
tonal correlates of focus in Greek and Russian.
Three experiments were carried out the results
of which indicate: First, the dominant word
order is SVO in both Greek and Russian.
Second, focus distinctions have inverse word
order effects, according to which syntactic
elements of focus elicitations are dislocated at
sentence beginning and sentence end for Greek
and Russian respectively. Third, focus has a
local tonal range expansion and a global tonal
range compression in both Greek and Russian.
Introduction
This study is in the spirit of the forthcoming
ISCA Workshop Experimental Linguistics to
be held in Athens, Greece, in 2006 (see Botinis
et al. 2005, this volume). Three experiments
were carried out the main questions of which
are (1) which is the unmarked word order? (2)
which are the word order correlates of focus
production? (3) which are the tonal correlates
of focus production? These questions are also
related to contrastive linguistics and language
typology with reference to sentence structure
production in Greek and Russian.
The basic language material of the three
experiments in this study consists of controlled
speech situations, in which experimenters from
Athens and Saint Petersburg for Greek and
Russian respectively were asked to produce
utterances with reference to pictures on the
computer screen in apparent agent-action-goal
semantic relations. The language material was
directly recorded into computer disc and tonal
analysis was carried out with Waveserfer.
The main objective of the first experiment
was to investigate unmarked word order of
written sentence production. Lexical words
corresponding to syntactic categories subject
(S), verb (V) and object (O) were copied from
the basic language material and were written in
99
Results
The results of the three experiments described
in the previous section are shown in Figures 1
and 2 with reference to syntactic and tonal
correlates of focus distinctions respectively.
In Figures 1a and 1b, SVO is the dominant
word order structure in unmarked written
production in both Greek (1a) and Russian (1b),
with marginal word order variability across
speakers age and gender.
In Figures 1c and 1d, the neutral elicitation
of spoken productions has a dominant SVO
structure in Russian (1d) but not in Greek (1c).
OVS
VSO
SVO
Focus elicitation have dislocation effects,

according to which syntactic categories are
dislocated at sentence beginning and sentence
end for Greek and Russian respectively and this
dislocation is most evident in Russian.
In Figures 1e and 1f, the neutral elicitation
of written production has VSO and SVO
dominant structures in Greek and Russian
respectively. Focus involves inverse syntactic
dislocations at the beginning of sentence and
end of sentence for Greek and Russian
respectively. These dislocations are more
evident for Russian than for Greek and also for
Greek females than Greek males.
SVO
800
240
600
160
VSO
OVS
VOS
400
80
200
0
Female
Male
Female
Adults
Female
Male
Male
Adults
Children
Male
Children
Russian
Greek
a
SVO
Female
VOS
OVS
SVO
80
160
60
120
40
80
20
40
SOV
VSO
OSV
OVS
[N]
[S]
[VP]
[O]
[N]
Male
[S]
[VP]
[O]
[N]
Female
[S]
[VP]
[O]
[VP]
[O]
Russian
c
VOS
[S]
Female
Greek
SVO
[N]
Male
d
OVS
VSO
SVO
VOS
OVS
VSO
240
200
160
120
80
40
0
400
320
240
160
80
0
[N]
[N]
[S]
[V]
[VP]
[O]
[S]
[V]
[VP]
[O]
Russian
Greek
f
e
Figure 1. Greek (left) and Russian (right) word order of basic syntactic categories as a function of
speakers age and speakers gender written production (a-b), focus elicitations of spoken production
(c-d) and focus elicitation of written production (e-f).
100
Greek
Russian
g
h
Figure 2. Tonal structures of variable word order and distinctive focus productions of the sentences
/o ertis ftixni ti lba/ (The worker repairs the lamp) and /mljtik njisjt glbus/ (The boy carries
the globe) in Greek (left) and Russian (right) respectively (capital letters indicate focus).
101

In figure 2 some typical examples of tonal
structures as a function of focus distinctions in
Greek and Russian are presented. In both
languages, the neutral productions (a and b) have
a regular tonal structure, according to which
stressed syllables of lexical words are as a rule
associated with local tonal commands which are
aligned with respective stress group boundaries.
Focus productions, on the other hand, modify
the tonal structure in both Greek and Russian in
three main ways. First, speech material in focus
has a local tonal range expansion in relation to
the corresponding local tonal range of the neutral
productions. Second speech material out of focus
undergo deaccentuation. Third, speech material
out of focus undergoes major tonal compression.
These three ways may operate simultaneously or
in combinations in variable linguistic domains.
Our results indicate that focus productions
have constant tonal correlates which operate
independently from syntactic correlates, although
both tonal and syntactic structures may function
complementary with reciprocal reinforcement for
focus structures and focus distinctions.

Although much research has been conducted on
each word order and tonal structures in a variety
of languages, including Greek and Russian
(Botinis, 1989, Svetozarova, 1998, Yoo, 2003.),
little attention has been paid to interactions
between word order and prosody, especially with
reference to semantic impacts and focus
assignments in linguistic structures. Furthermore,
although several languages, such as Greek and
Russian, have traditionally been described as free
word order languages, in the sense that main
syntactic categories may have variable word
order, the conditions and factors that trigger
alternative
word
order
structures
are
underexamined.
The results of this study, based on the
experimental methodology and the investigated
language material described previously, indicate
that both Greek and Russian have a dominant
word order syntactic structure as well as a regular
tonal structure. On the other hand, focus has a
major effect on both tonal and syntactic
structures in the two languages. The dominant
unmarked word order structure is SVO, whereas
the regular tonal structure consists of local tonal
commands aligned with stressed syllables, which

may have variable tonal range as a function of
focus distinctions. Dislocation of syntactic
elements, which bear required information, at the
beginning of sentence and end of sentence are
syntactic correlates of focus in Greek and
Russian respectively, whereas local expansions in
relation to global compressions of the tonal range
are tonal correlates of focus in both Greek and
Russian.
Focus is a complex linguistic category with a
heavy functional load, according to which some
linguistic units are marked as more important
than other ones in communication situations. The
basic linguistic function of focus is thus semantic
weighting of variable linguistic units in relation
to information structure and contextual
specifications of actual utterances. Despite
prosodic variability in different languages, tonal
correlates are most usually reported as prosodic
correlates of focus distinctions in the majority of
analysed languages (see e.g. Hirst and Di Cristo,
1998). However, although focus has both local
and global tonal correlates, which has been
evidenced in several studies in Greek and
Russian, it is the global tonal structure that
determines focus perception rather than any local
tonal variability of the linguistic units in focus
(Botinis, 2000).
References
Botinis A. (1989) Stress and Prosodic Structure
Botinis A., Bannert R., and Tatham M. (1998)
Contrastive tonal analysis of focus perception
in Greek and Swedish. In Botinis A. (ed)
Intonation: Analysis, Modelling and
Technology. Dordrecht: Kluwer Academic
Publishers.
Botinis A., Charalabakis Ch., Fourakis M., and
Gawronska B. (2005) Athens 2006 ISCA
Workshop on Experimental linguistics (this
volume).
Hirst D., and Di Cristo A. (eds) (1998) Intonation
Systems. Cambridge University Press.
Svetozarova N. (1998) Intonation in Russian. In
Hirst D. and Di Cristo A. (eds) Intonation
Systems, 261274. Cambridge University
Press.
Yoo H-Y. (2003) Ordre des Mots et Prosody.
PhD dissertation, University of Paris 7.
102
Prosodic correlates of attitudinally-varied back channels in Japanese

Yasuko Nagano-Madsen1 and Takako Ayusawa2
1
Department of Oriental and African Languages, Gteborg University, Gteborg
2
Department of Japan Studies, Akita International University, Akita
Kitamura 2000, Katagiri, Sugito, and NaganoMadsen 1999, 2001). This is because these
studies deal with recordings of real-life communication and the samples were therefore not
systematically varied or distributed, and nor
were they easily analysable, due to overlap of
utterances.
In order to overcome the difficulties mentioned above and to balance the phonetic data
on back channels in Japanese, we present a
study of another kind: well controlled simulated utterances recorded in a good acoustic
environment. The kinds of back channels presented in this study are of the unrepeatable
back channel type, following Nagano-Madsen
and Sugitos classification based on the
phonological form. Unrepeatable back channels
look more like a proper utterance, whereas repeatable back channels are of /so:so:/, /haihai/
yes, yes type.
The first back channel dealt with in the present study is /a-soo-desu-ka/ Is that so? I see..,
which was the second common back channel
after /N:/ yes (Nagano-Madsen and Sugito
1999). Phonologically, it contains the H*L accent in the /soo/. The second type /yamada-sandesu-ka/ Is it Mr Yamada? / I see, it is Mr Yamada is classified as echo back channel
where a keyword in the previous utterance is
repeated as back channel (in this case Mr Yamada). This type of back channel shows a
deeper concern from the listener and is frequently used where a stream of conversation
becomes lively, with quick turn taking (Sugito
etc.). Phonologically, it contains unaccented
word /yamada/. In addition /are-desu-ka/ it is
that? It is that, which is similar to
/yamadasandesuka/ but shorter, is also included.
Abstract
Attitudinally-varied back channel utterances
were simulated by six professional voice actors
for Japanese. Contrary to the general assumption that a pitch accent language like Japanese
cannot vary the tonal configuration for attitudinal variation as in a stress/intonation language, all the speakers differentiated two kinds
of tonal configurations. Further variation was
achieved by phrasing utterances differently on
pitch and timing dimensions, and by adding a
rising or non-rising terminal contour.
Introduction
In many stress-accent languages that have
been traditionally classified as intonational languages, attitudinal meaning is expressed by
means of finely defined tonal contours. In contrast, pitch-accent languages such as Japanese
are assumed not to be able to choose contour
types for attitudinal meaning or emotion, because the languages use lexically fixed accent
shapes (Mozziconacci 2000). Thus, apart from
the variation in terminal contour, the dimensions where intonation can vary are pitch range
and phrasing (Beckman and Pierrehumbert
1986).
Contrary to traditional belief and assumptions, one of the findings of the present paper
is the systematic variation in utterance internal
tonal configuration in order to express attitudinal meaning in back channels.
Back channels in Japanese

Japanese is known to be language that uses
back channels extensively and this has been
studied from various perspectives. For
phonological forms used for back channels in
Japanese, Nagano-Madsen & Sugito (1999)
presents extensive analysis and classification
for both Tokyo and Osaka Japanese. The phonetic properties of back channels have only
been partially studied for restricted types of
back channels (Sugito, Nagano-Madsen and
Material
Our speech material consists of high quality
recordings of 6 professional voice actors (3
males and 3 females) who were in their 30s or
40s at the time of recording. Each of them produced 3 back channel utterances with neutral
103

(NEU), joyful (JOY), disappointed (DIS), and
suspicious (SUS) attitudes. In addition, the
same utterance was produced as a question (Q)
rather than a back channel. The second author
judged the appropriateness of each attitude
type at the time of recording and the best sample for each attitude for each utterance was
used for analysis. The recorded material was
part of the self-learning CD for Japanese accent and intonation for learners of Japanese
(Ayusawa 2001). 'SUGI Speech Analyser'
software installed on PC was used to do acoustic analyses. The three back channel utterances
are as follows:
(1) /a-soo-desu-ka/ Is that so? I see
(2) /yamada-san-desu-ka/ It is Mr Yamada? /
It is Mr Yamada...
(3) /are-desu-ka/ It is that? It is that..
Table 1 shows the distribution of two contour

types, H or LH, for the six speakers for varying
attitudes as well as for a question Q.
Table 1. The choice of tone for different attitudes
by the six speakers for /soo/ H*L in
/asoodesuka/. U,V,W are female speakers.
U
V
W
X
Y
Z
NEU
H
H
H
H
H
H(L)
Q
H
H
H
H
H
H
JOY
LH
LH
LH
H
H
LH
SUS
LH
LH
LH
LH
LH
LH
DIS
LH
LH
H
H
LH
H
It can be seen that the NEU, including Q, is

expressed predominantly by a H tone, SUS by
a LH tone. The choice of the contour was
speaker dependent for JOY and DIS. SUS and
Q always have a rising terminal contour while
others have a falling or a level contour. There
was further variation in the tonal configuration
with regard to the particular mora on which the
F0 peak was reached. Figures 1a and 1b show
two kinds of tonal configuration produced by
speaker U. She used a level H tone for NEU
and DIS, and a rising (LH) tone for JOY and
SUS. Note that for each pair, basically the same
shape of tone are placed at different time scale
(NEU and DIS) or at different pitch range
(JOY and SUS). DIS and SUS are further differentiated by falling vs. rising terminal contour.
Results
Auditory and acoustic analyses revealed that
speakers modified several parameters in order to
produce attitudinally-varied back channels in
Japanese. This included variations in tempo,
tonal configuration, pitch range, vowel quality,
voice quality, and clarity of articulation. Of
these, the most notable systematicity was observed for tonal configuration and phrasing in
the pitch and time dimensions. The rest of the
paper will focus on these aspects.
Tonal configuration
Contrary to the general assumption that tonal
configuration cannot vary in a pitch accent language like Japanese, all six speakers were found
to use two tonal configurations. These contours are further differentiated in phrasing in
the pitch and time dimensions when expressing
various attitudes.
For the utterance /asoodesuka/ is that so? / I
see, where /soo/ is associated with the lexical
H*L accent, it is interesting to note that the
expected F0 fall was largely missing. Only in
one occasion (speaker Z for NEU) was there a
very slight F0 fall, and all other cases were
produced with either a level H or a rising LH
contour. Maekawa (2004), whose data included basically the same utterance /soodesuka/
(without the initial interjection /a/) for his
study of paralinguistic information in Japanese, noted this change from the H*L to the
LH pattern in some of the utterances.
104
Figures 1a,b. The H contour for NEU and DIS

and the LH contour for JOY and SUS (speaker
U).

six speakers had the accentual F0 peak on /ma/.
Instead, it was either on the third mora /da/ (10
tokens), the fourth mora /sa/ (7 tokens) and /de/
(13 tokens). None of the tokens for DIS and
SUS had F0 peak on /da/. All the foregoing observation indicate that attitudinal meaning for
/yamadasandesuka/ is clearly conveyed by systematic variation in tonal configuration through
different phonetic implementation of the accentural-phrasal tones %L and H- (cf. X-JtoBI,
Maekawa et al. 2004).
The utterance /yamadasandesuka/ contains an

unaccented word /yamada/ and starts with a
phrasal L% followed by the H*L accent on /de/
of the copula /desu/.
Phrasing
There was fairly good agreement among the
speakers as to how the two tonal configurations were phrased in the pitch and time dimensions. With one exception, the attitude
JOY always had the highest F0 peak, while the
lowest peak was typically found for DIS. SUS
and DIS were spoken more slowly than other
types of attitudes and therefore had a longer
duration. Figure 3 shows a typical example of
phrasing for four attitudes by speaker Y.
Figure 2a,b. F0 contours for JOY, Q, and NEU
(above) and SUS and DIS (below) for speaker U.
As for the previous case, this utterance also had

two kinds of tonal configuration that were further differentiated in phrasing and by terminal
contour to express attitudinal meaning. Figures
2a and 2b show F0 contours for
/yamadasandesuka/ produced by speaker U.
The tonal configuration used for NEU (including Q) and JOY has a steeper initial F0 rise
than that used for SUS and DIS. The former
type of contour has a plateau phase. In contrast,
the latter type of contour has a slow and gradual
F0 rise that reaches to its peak only before the
accentual fall. It does not contain any plateau
phases. In Figure 2 (above) the F0 contours are
placed at three equidistant pitch range intervals,
Q being differentiated by a rising terminal
contour. Likewise, the contours for DIS and
SUS differ both in pitch range and in terminal
contour.
Note that the difference in the way F0 rises
initially, as described above, does not affect the
phonological structure of the utterance, since
/yamadasan/ is an unaccented phrase. It is the
phrasal feature that varies, i.e. the H- phrasal
tone which is expected to be on the second mora
is greatly delayed. Interestingly enough, not
only one out of the 30 tokens produced by the
Figure 3. A typical phrasing of F0 contours for

NEU, JOY, DIS, and SUS (speaker Y).
Pitch range and duration

Figures 4 and 5 show the peak F0 values and
total durations for /asoodesuka/ spoken by the
six speakers with varying attitudes.
Peak F0 value /soodesuka/
550
500
peak F0 value in Hz
450
400
U
W
V
Z
Y
X
350
300
250
200
150
100
50
NEU
JOY
DIS
SUS
attitude
Figure 4. Peak F0 value for /asoodesuka/ produced by six speakers.
105

cation of the more important parameters for
conveying attitudinally-differentiated
back
channels in Japanese will require perceptual experiments.
Total duration of /soodesuka/

1800
1600
duration in ms
1400
U
W
v
Z
Y
X
1200
1000
800
600
References
Ayusawa T. (editorial supervision). (2001) Accent and intonation in Tokyo Japanese.
CALL Sub-teaching material series, Japanese prosody Vol.1. National Institute of
Multimedia Education.
Beckman ME and Pierrehumbert JB. (1986)
Japanese prosodic phrasing and intonation
synthesis. Proceedings of the 24th Meeting
of the Association for Computational Linguistics
Katagiri, Y., M. Sugito, and Y. Nagano-Madsen
(1999) The forms and prosody of back
channels in Tokyo and Osaka Japanese. In
the Proceedings of The XIIIIth International Conference on Phonetic Sciences,
San Francisco.(CD).
Katagiri, Y., M. Sugito, & Y. Nagano-Madsen
(2001) An analysis of forms and prosodic
characteristics of Japanese 'Aiduti' in dialogue (in Japanese). In Bunpoo to Onsei
(=Speech and Grammar) III, 263-274.
Tokyo: Kuroshio Publication.
Maekawa K. (2002) Production and perception
of paralinguistic information. Proceedings
of
International
Conference:Speech
Prosody 2004, Nara, 367-374.
Maekawa K, Igarashi Y, Kikuchi E, and Yoneyama S. (2004) Intonation labelling for
Spoken Japanese Corpus, version 1.0 (in
Japanese). Electronical document for The
corpus of Spontaneous Japanese.
Mozziconacci, S. (2000) Prosody and Emotions. In Proceedings Online: ISCA Workshop on Speech and Emotion.: A conceptual framework for research.
Nagano-Madsen, Y. and M. Sugito. (1999) Analysis of back channel items in Tokyo and
Osaka Japanese (in Japanese). In Japanese
Linguistics 5, 26-45. Tokyo: National Language Research Institute.
Sugito, M., Y. Nagano-Madsen, and
M.,Kitamura.(1999) The pattern and timing
of the repeat-back channels in natural dialogue in Japanese (in Japanese). In Bunpoo
to Onsei (=Speech and Grammar) II, 3-18.
Tokyo: Kuroshio Publication.
400
200
0
NEU
JOY
DIS
SUS
attitude
Figure 5. Total utterance duration

/asoodesuka/ produced by six speakers.
for
Discussion
Some of the findings reported here agree with
those reported in Maekawas (2002) study on
paralinguistic information in Japanese. The
present study revealed more systematic details
in the way tonal configuration was varied in
conveying attitudinally varied utterances.
Although the choice of tonal configuration is
limited to two basic types, Japanese speakers
were found to vary phrase internal F0 contours
systematically in order to express attitudinally-varied back channels. The test material
contained both accented (H*L) and unaccented
words. In the case of the accented word /soo/
in /asoodesuka/, the lexical accent was altered
to either H or LH, the former being typically
used for NEU. In unaccented words, variation
in tonal configuration was achieved by modifying the rate of initial F0 rise. What is consistent in both cases is that in the contours used
for unmarked attitude NEU, F0 reaches its
peak earlier than in marked attitude such as
SUS. According to the X-JToBI (Maekawa et
al 2004), Japanese ToBi labelling, H- is
introduced to indicate the onset of F0 plateau.
It can be analysed that both accented and unaccented word can be expressed by delayed Hfor marked attitudes. Exactly on which mora
the initial and final F0 maxima are placed varies
between speakers. It should be also reminded
that the general assumption that the phrasal His on the second mora was not attested in the
present data even for utterances with NEU and
Q type. This phenomenon needs further investigation.
Apart from the phrase internal tonal variations
described above, there are differences in phrasing and terminal contours. These prosodic characteristics are further modified by vowel quality,
voice quality, and clarity of articulation. Identifi106
Prosodic Features in the Perception of Clarification Ellipses

Jens Edlund, David House, and Gabriel Skantze
Centre for Speech Technology, KTH, Sweden
Abstract
We present an experiment where subjects were
asked to listen to Swedish human-computer dialogue fragments where a synthetic voice makes
an elliptical clarification after a user turn. The
prosodic features of the synthetic voice were
systematically varied, and subjects were asked
to judge the computers actual intention. The
results show that an early low F0 peak signals
acceptance, that a late high peak is perceived
as a request for clarification of what was said,
and that a mid high peak is perceived as a request for clarification of the meaning of what
was said. The study can be seen as the beginnings of a tentative model for intonation of
clarification ellipses in Swedish, which can be
implemented and tested in spoken dialogue systems.
Introduction
Detection of and recovery from errors is important for spoken dialogue systems. To this
effect, system hypotheses are often verified explicitly or implicitly: the system makes a clarification request or repeats what it has heard.
These error handling techniques are often perceived as tedious, and one of the reasons for
this is that they are often constructed as full
propositions, verifying the complete user utterance. In contrast, humans often use short elliptical constructions for clarification Purver et
al. (2001) show that 45% of the clarification
requests in British National Corpus (BNC) are
elliptical. A dialogue system using word level
confidence scores could use elliptical clarifications to focus on problematic fragments, making the dialogue more efficient (Gabsdil, 2003).
However, the interpretation of ellipses is often
dependent both on context and on prosody, and
the prosody of clarification requests has not
been greatly studied.
We present an experiment in which subjects
were asked to listen to Swedish dialogue fragments where the computer makes elliptical
clarifications after user turns, and to judge what
was actually intended by the computer. The
study is connected to the HIGGINS spoken dialogue system (Edlund et al., 2004). The primary
107
domain of HIGGINS is pedestrian navigation, as

seen in Table 1.
Table 1. Example scenario in the HIGGINS domain
(translated from Swedish)
User
I want to go to an ATM.
System OK, where are you?
User
Im standing between an orange building
and a brick building.
Clarification ellipsis could be very useful in this

domain. Table 2 shows the scenario that is used
in the experiment presented in this paper.
Table 2. Example use of clarification ellipsis
(translated from Swedish)
User
[] on the right I see a red building.
System Red (?)
Clarification
Clarification is part of a process called grounding (Clark, 1996) or interactive communication
management (Allwood et al., 1992). In this
process, speakers give positive and negative
evidence or feedback of their understanding of
what the interlocutor says. A clarification may
often give both positive and negative evidence
showing what has been understood as well as
what is needed for complete understanding.
Clarification requests may have both different forms and different readings (i.e. functions).
In a study of the BNC, Purver et al. (2001)
studied the form and function of clarification
requests. According to their scheme, the form
of clarification ellipses studied in this paper, as
exemplified in Table 2, is called reprise fragments.
We will use a distinction made by both
Clark (1996) and Allwood et al. (1992) in order
to classify possible readings of reprise fragments. They suggest four levels of action that
take place when speaker S is trying to say
something to hearer H:
Acceptance: H accepts what S says.
Understanding: H understands what S
means.
Perception: H hears what S says.
Contact: H hears that S speaks.

For successful communication to take place,
communication must succeed on all these levels. The order of the levels is important; to succeed on one level, all the other levels below it
must be completed. Also, if positive evidence is
given on one level, all the other levels below it
are presumed to have succeeded. When making
a clarification request, the speaker is signaling
failure or uncertainty on one level and success
on the levels below it.
Other classifications of clarification readings have been made. In Schlangen (2004) a
more finegrained analysis of the understanding
level is given. In Ginzburg & Cooper (2001), a
distinction is made between what is called the
clausal reading and the constituent reading
of clarification ellipsis. Using the scheme
above, the clausal reading could be described as
a signal of positive contact and negative perception, and the constituent reading as a signal
of positive perception and negative understanding.
According to the scheme given above, the
reprise fragment in Table 2 may have three different readings:
Ok, red. (No clarification request; positive
on all levels)
Do you really mean red? What do you mean
by red? (positive perception, negative/uncertain understanding)
Did you say red? (positive contact, uncertain perception)
The reading positive understanding, negative acceptance has not been included here.
The reason for this is that it is hard to find examples, which may be applied to spoken dialogue systems, where reprise fragments may
have such a reading.
Prosody
In spite of the fact that considerable research
has been devoted to the study of question intonation, the use of different types of interrogative intonation patterns has not been routinely
represented in spoken dialogue systems. Not
only does question intonation vary in different
languages but also different types of questions
(e.g. wh and yes/no) can result in different
kinds of question intonation (Ladd, 1996). In
very general terms, the most commonly described tonal characteristic for questions is high
final pitch and overall higher pitch. In many
languages, yes/no questions are reported to
have a final rise, while wh-questions typically
are associated with a final low. Wh-questions
108
can, moreover, often be associated with a large

number of various contours. Bolinger (1989),
for example, presents various contours and
combinations of contours which he relates to
different meanings in wh-questions in English.
One of the meanings most relevant to the present study is what he terms the reclamatory
question. This is often a wh-question in which
the listener has not quite understood the utterance and asks for a repetition or an elaboration.
This corresponds to the paraphrase, What did
you mean by red?
In Swedish, question intonation has been
primarily described as marked by a raised topline and a widened F0 range on the focal accent
(Grding, 1998). In recent perception studies,
however, House (2003), demonstrated that a
raised fundamental frequency (F0) combined
with a rightwards focal peak displacement is an
effective means of signaling question intonation
in Swedish echo questions (declarative word order) when the focal accent is in final position.
In a study of a corpus of German task-oriented human-human dialogue, Rodriguez &
Schlangen (2004) found that the use of intonation seemed to disambiguate clarification types
with rising boundary tones used more often to
clarify acoustic problems than to clarify reference resolution.
Metod
Three test words comprising the three colors:
blue, red and yellow (bl, rd, gul) were synthesized using an experimental version of
LUKAS diphone Swedish male MBROLA
voice (Filipsson & Bruce, 1997), implemented
as a plug-in to the WaveSurfer speech tool
(Sjlander & Beskow, 2000).
For each of the three test words the following prosodic parameters were manipulated: 1)
Peak POSITION, 2) Peak HEIGHT, and 3) Vowel
DURATION. Three peak positions were obtained by
time-shifting the focal accent peaks in intervals
of 100 ms comprising early, mid and late peaks.
A low peak and a high peak set of stimuli were
obtained by setting the accent peak at 130 Hz
and 160 Hz respectively. Two sets of stimuli
durations (normal and long) were obtained by
lengthening the default vowel length by 100
ms. All combinations of three test words and
the three parameters gave a total of 36 different
stimuli. Six additional stimuli, making a total of
42, were created by using both the early and
late peaks in the long duration stimuli which
created a double peaked stimuli. A possible

late-mid peak was not used in the long duration
set since a late rise and fall in the vowel did not
sound natural. The stimuli are presented schematically for the word yellow in Figure 1.The
first turn of the dialogue fragment in Table 2
was recorded for each color word and concatenated with the synthesized test words, resulting
in 42 different dialogue fragments similar to the
one in Table 2.
Table 3: Interpretations that were significantly overrepresented, given the values of the parameters
POSITION and HEIGHT, and their interactions. The
standardized residuals from the 2-test are also
shown.
POSITION
Early
Mid
Late
HEIGHT
High
Low
POSITION*
HEIGHT
Early*Low
Mid*Low
Mid*High
Late*High
Interpretation
ACCEPT
CLARIFYUNDERSTANDING
CLARIFYPERCEPTION
Interpretation
ACCEPT
Interpretation
Std. resid.
3.1
4.6
3.6
Std. resid.
3.2
4.0
Std. resid.
ACCEPT
ACCEPT
CLARIFYPERCEPTION
3.4
3.4
5.6
4.4
ACCEPT
Number of votes
CLARIFYPERCEPTION
40
30
20
10
0
early
mid
HIGH
late
early
mid
late
LOW
Figure 2: The distribution of votes for the three

interpretations as a function of position: where
HEIGHT is high on the left, and low on the
right. The circles mark distributions that are significantly overrepresented.
Figure 1. Stylized representations of the stimuli

gul (yellow), showing the F0 peak position.
The top panel shows normal duration, the bottom
lengthened duration.
The subjects were 8 Swedish speakers, all

of which have some experience of speech technology, although none of them are involved in
this research.
The subjects were told that they would listen to 42 similar dialogue fragments containing
a user utterance and a system utterance each,
and that their task was to judge the meaning of
the system utterance by choosing one of three
alternatives. They were informed that they
could only listen to each stimulus once.
During the experiment, the subjects were
played each of the 42 stimuli once, in random
order. After each stimulus, they chose a paraphrase for the system utterance. The different
paraphrases were (X stands for a color):
ACCEPT: Ok, X
CLARIFYUNDERSTANDING: Do you really mean X?
CLARIFYPERCEPTION: Did you say X?
Results
There were no significant differences in the distribution of votes between the different colors
(red, blue, and yellow) (2=3.65, dF=4,
109
p>0.05), nor were there any significant differences for any of the eight subjects (2=19.00,
dF=14, p>0.05). Neither had the DURATION parameter any significant effect on the distribution of votes (2=5.72, dF=2, p>0.05). Both
POSITION and HEIGHT had significant effects on
the distribution of votes, which is shown in Table 3 (2=70.22, dF=4, p<0.001 resp. 2=59.40,
dF=2, p<0.001). The interaction of the parameters POSITION and HEIGHT also gave rise to significant effects (2=121.12, dF=10, p<0.001),
as shown in the bottom of Table 3. Figure 2
shows the distribution of votes for the three interpretations as a function of position for both
high and low HEIGHT. Results from the doublepeak stimuli were generally more complex and
are not presented here.
Discussion
The most interesting result in this experiment
from both a spoken dialogue system perspective
and a prosody modeling framework concerns
the strong relationship between intonational
form and meaning. For these single-word utter-

ances used as clarification ellipses, the general
division between statement (early, low peak)
and question (late, high peak) is consistent with
the results obtained for Swedish echo questions
in (House, 2003) and for German clarification
requests in (Rodriguez & Schlangen, 2004).
However, the further clear division between the
interrogative categories CLARIFYUNDERSTANDING and
CLARIFYPERCEPTION is especially noteworthy. This
division is related to the timing of the high
peak. The high peak is a prerequisite for perceived interrogative intonation in this study,
and when the peak is late, resulting in a final
rise in the vowel, the pattern signals
CLARIFYPERCEPTION. This can also be seen as a
yes/no question and is consistent with the observation that yes/no questions generally more
often have final rising intonation than other
types of questions. The high peak in mid position is also perceived as interrogative, but in
this case it is the category CLARIFYUNDERSTANDING
which dominates as is clearly seen in the left
panel of Figure 2. This category can also been
seen as a type of wh-question similar to the
reclamatory question discussed in (Bolinger,
1989).
Conclusions and future work

The results of this preliminary study can be
seen in terms of a tentative model for the intonation of clarification ellipses in Swedish. A
low-early peak would function as an ACCEPT
statement, a mid-high peak as a CLARIFYUNDERSTANDING question, and a late high peak as a
CLARIFYPERCEPTION question. This would hold for
single-syllable accent I words. Accent II words
may be more complex. We intend to test this
model and extend this research in two ways. By
implementing these prototypical patterns in the
HIGGINS dialogue system, we will study responses of actual users to the different prototypes. We also plan to study these types of
clarification ellipses in a database of Swedish
human-human dialogue.
Acknowledgements
This research was carried out at the Centre for
Speech Technology, a competence centre at
KTH, supported by VINNOVA (The Swedish
Agency for Innovation Systems), KTH and participating Swedish companies and organizationsand was also supported by the EU project
CHIL (IP506909).
110
References
Allwood, J., Nivre, J., & Ahlsn, E. (1992). On
the semantics and pragmatics of linguistic
feedback. Journal of Semantics, 9, 1-26.
Bolinger, D. (1989). Intonation and its uses:
Melody in grammar and discourse. London:
Edward Arnold.
Clark, H. H. (1996). Using language. Cambridge: Cambridge University Press.
Edlund, J., Skantze, G., & Carlson, R. (2004).
Higgins a spoken dialogue system for investigating error handling techniques. In
Proceedings of ICSLP, 229-231.
Filipsson, M. & Bruce, G. (1997). LUKAS - a
preliminary report on a new Swedish speech
synthesis. Working Papers 46, Department
of Linguistics and Phonetics, Lund University.
Gabsdil, M. (2003). Clarification in spoken dialogue systems. In Proceedings of the AAAI
spring symposium on natural language generation in spoken and written dialogue.
Grding, E. (1998). Intonation in Swedish, In
D. Hirst and A. Di Cristo (eds.) Intonation
Systems. Cambridge: Cambridge University
Press, 112-130.
Ginzburg, J. & Cooper, R. (2001). Resolving
ellipsis in clarification. In Proceedings of
the 39th meeting of the ACL.
House, D. (2003). Perceiving question intonation: the role of pre-focal pause and delayed
focal peak. In Proc 15th ICPhS, Barcelona,
755-758
Ladd, D. R. (1996). Intonation phonology.
Cambridge: Cambridge University Press.
Purver, M., Ginzburg, J., & Healey, P. (2001).
On the means for clarification in dialogue.
In Proceedings of SIGDial.
Rodriguez, K. J. & Schlangen, D. (2004). Form,
intonation and function of clarification requests in German task oriented spoken dialogues. In Proceedings of Catalog '04 (The
8th Workshop on the Semantics and Pragmatics of Dialogue, SemDial04), Barcelona,
Spain.
Schlangen, D. (2004). Causes and strategies for
requesting clarification in dialogue. In Proceedings of SIGDial.
Sjlander, K. & Beskow, J. (2000). WaveSurfer
a public domain speech tool, In Proceedings of ICSLP 2000, 4, 464-467, Beijing,
China.
Proceedings, FONETIK 2005, Department of Linguistics, Goteborg University
Perceived prominence and scale types

Christian Jensen1 and John Tndering2
1
Department of English, Copenhagen Business School, Denmark
2
Institute of Nordic Studies and Linguistics, Dept. of Linguistics, University of Copenhagen, Denmark
Abstract
Three different scales which have been used to
measure perceived prominence are evaluated in
a perceptual experiment. Average scores of
raters using a multi-level (31-point) scale, a simple binary (2-point) scale and an intermediate
4-point scale are almost identical. The potentially finer gradation possible with the multilevel scale(s) is compensated for by having multiple listeners, which is a also a requirement for
obtaining reliable data. In other words, a high
number of levels is neither a sufficient nor a necessary requirement. Overall the best results were
obtained using the 4-point scale, and there seems
to be little justification for using a 31-point scale.
Introduction
than the 31-point scale it still allows raters to

make some gradation in their prominence evaluations.
We investigated the three prominence scales
outlined above with the purpose of answering
two overall questions: does the choice of scale
influence the results with regard to 1) the perceived prominence relations of words in utterances, and 2) the ability to make observations about statistically significant differences
between words. These questions were addressed
from the point of view of three relevant linguistic parameters which are known to be associated with perceived prominence: part of speech
membership, information structure and correlation with F0 .
Method
The purpose of this paper is to evaluate the use
The speech material chosen to evalof different scales for measuring the perceived
uate the scales was two short monoprominence of syllables and words. In this inlogues from the Danish DanPASS project
vestigation only word-level prominence is con(http://www.cphling.dk/pers/ng/danpass.htm),
sidered.
both recordings of a map task activity. The two
Prominence, as perceived by groups of
monologues, by two different male speakers,
raters, has been measured on different types of
included a total of 123 words. The monologues
scale: some use a 31-point scale from 0 to 30,
were divided into shorter phrases which were
first described in Fant & Kruckenberg (1989).
presented via a web page (one phrase per page).
The strength of this scale is that it allows for very
The raters could hear the phrase as many times
fine gradation of the perceived prominence, even
as they wanted by pressing a play button,
for a single rater, but this also makes the task
and indicated their judgment by clicking the
quite difficult. Others, e.g. Wightman (1993),
appropriate scale point. Time consumption and
have proposed to use instead a simple binary (2a count of sound file playbacks were recorded
point) scale (0 or 1) and use the cumulative (or
for each phrase.
average) score of each word as an expression of
its level of prominence, which results in much
A large group of raters participated in the exsimpler task for the raters. The disadvantage of
periment and were randomly assigned to a spethis simple scale is that it may force raters to concific scale. Equally sized groups of 19 raters
flate items which they perceive as different, but
(the size of the smallest group) were selected for
within the same category, which could lead to a
the analyses. The instructions to the raters were
reduced or lost ability to distinguish variations in
presented from the web page and were identical
perceved prominence at either end of the promifor all three groups, except for the details about
nence continuum. For example accented words
the specific scale. The concept of prominence
with or without special emphasis. In addition,
was explained and exemplified, and raters were
the level of gradation you achieve with this scale
advised that prominence might be a question of
is directly proportional to the number of raters:
more or less. 0 represented no prominence, but
to get the same gradation as is (potentially) posno other scale points were defined. Prominent
sible with the scale from 0 to 31 you need 30
words could be assigned values up to the scale
raters. As a possible compromise between these
maximum. Raters using the 2-point scale were
two scales one could use a 4-point scale (e.g.
informed that they could not grade their ratings
from 0 to 3). While this scale is much simpler
but were given a forced choice.
111
Results
Reliability
Note: the phrase the 2/4/31-point scale is used
in the following as shorthand expressions of the
prominence ratings obtained from the group of
listeners using the 2/4/31-point scale.
The reliability of the data was tested by calculating Cronbachs coefficient, which expresses the extent to which the scores of the individual raters covary. The coefficients for all
three groups are high (from 0.94 to 0.96) and the
difference between them is nonsignificant (M =
1.02, p > 0.05).
Comparison of prominence ratings
The first question to be addressed is whether the
prominence ratings on the three scales express
the same relations between words. In order to be
able to make direct comparisons all scores were
normalised by dividing each value with the scale
maximum (1, 3 or 30, respectively), which fits all
data to a normalised scale of 0 to 1 without affecting the relations between scores. These values were then plotted on a line chart for simple
visual inspection. An example diagram of one
phrase is shown in Fig. 1.
Perceived prominence
2-pt. =
4-pt. =
31-pt. =
0.75
0.5
0.25
0
til
till
du kommer
you come
til
to
det
the
nste kryds
next intersection
Figure 1: Prominence of selected phrase all scales
The diagrams showed a high level of agreement across the three scales, which was further
tested in a correlation analysis (Spearmans ).
The result can be seen in Table 1.
Table 1: Correlation coefficients (Spearmans )
across all three scales
Correlation
2-pt
4-pt
4-pt 31-pt
0.933 0.926
0.964
The correlation coefficients were high for

each scale pair and quite similar, with the best
correlation apparently between the 4-point scale
112
and the 31-point scale. The preliminary conclusion is clear: raters arrive at approximately
the same rank order of perceived prominence regardless of the scale used.
It appears from Fig. 1 that the 2-point scale
displays somewhat larger variation in values between the scale minimum and maximum than the
4-point scale and especially the 31-point scale.
This was in fact a general trend demonstrating
a certain compression of values on the 31-point
scale (and to a lesser degree the 4-point scale),
while the 2-point scale has more mean values
near the scale extremes. Analyses of the distribution of scores (inter-quartile range for each rater
and visual inspection of x-y plots) showed that
many raters on the 31-point scale assigned most
ratings to a restricted sometimes very restricted
range of the scale, either at the lower, the middle or the higher end of the scale. There are
therefore no mean values at the scale extremes,
although there were many individual scores near
the minimum and maximum values.
Obtaining significant differences
One very important aspect of choosing a scale is
whether it will affect the ability to obtain statistically significant differences between test items.
The hypothesis might be that scales with too few
points (most notably the 2-point scale) would
mask subtle perceptual differences which could
be brought out with more scale points.
This suitability of the three scales for quantitative analysis was tested by examining the
association between perceived prominence and
three linguistic phenomena: part of speech membership, information structure and a specific
acoustic correlate, namely F0 . The purpose was
to see if the data obtained by using three different scales will lead to different conclusions about
linguistic structure.
Comment on the statistical procedures
Since it is not possible to compare results directly across scale types we simply decided to
use the statistical procedures which were felt to
be most appropriate for each individual scale.
This resembles quite well the choice which researchers would be forced to make when they are
making a choice about scale type.
For all scales we have decided to use nonparametric methods. For significance testing on
the 2-point scale we use the Fisher exact test or
a chi-square test with corrections for continuity
(when n > 40), and for the other two scales we
use the Wilcoxon-Mann-Whitney test with correction for ties (WMW).

Table 2: Prominence ratings and parts of speech. Left braces indicate non-significant differences. Nonadjacent, nonsignificant differences on the 31-pt scale: adv-v, art-prep
1
2
3
4
5
6
7
8
9
Scale
Part of speech
Adjectives
Nouns
Interjections
Adverbs
Verbs
Pronouns
Conjunctions
Articles
Prepositions
n
9
28
3
12
13
16
10
2
30
2-point
Ranked
x
adj
0.92
n
0.78
int
0.60
{
adv
0.58
v
0.34
{
pron
0.33
conj
0.17
{
art
0.13
{
prep
0.10
Parts of speech
The mean prominence ratings of nine parts of
speech are listed in Table 2, ordered according
to their ranking on each scale. These ranking are
very similar for all three scales. The only difference which can be detected is the relegation of
prepositions to ninth place on the 2-point scale,
instead of the seventh place it holds on the other
two scales. (The different ranking of pronouns
and verbs on the 31-point scale is irrelevant.)
Most of the differences between the classes are
significant: except for two cases on the 31-point
scale (see the table caption) all differences between classes which are not adjacent in the rankings are significant, and of the differences between adjacent classes four are nonsignificant
on the 2-point scale, two are nonsignificant on
the 4-point scale, and three are nonsignificant on
the 31-point scale (giving a total of five differences which are not significant for this scale).
These figures are quite similar, with a small bias
in favour of the 4-point scale, where the highest
number of significant differences was found.
Information structure
Chafe (1994) states that new information is more
prominent than non-new information. To test
the validity of this statement we compared the
prominence ratings of all words carrying new information with the most prominent word carrying non-new information in the same phrase (20
cases), thus testing the hypothesis that new information is more prominent than other information
(H1 ). H0 states that the perceived prominence of
the new information is less than or equal to that
of the given/accessible information.
In four cases (three on the 31-point scale) the
new information is not more prominent than the
non-new information, in which case H0 cannot
113
4-point
Ranked
x
adj
0.73
n
0.66
int
0.50
adv
0.38
v
0.30
{
pron
0.30
prep
0.21
conj
0.13
{
art
0.12
31-point
Ranked
x
adj
0.67
n
0.63
{
int
0.58
adv
0.40
pron
0.35
{
v
0.35
prep
0.28
conj
0.24
{
art
0.22
be dismissed. Of the remaining 16 (17) cases,

where the new information had higher prominence ratings than the non-new information, nine
were significant on the 2-point scale (Fisher exact test, one-tailed, p < 0.05); 15 were significant on the 4-point scale and 14 on the 31-point
scale (WMW, one-tailed, p < 0.05).
Here we find a clear difference between the
2-point scale and the 4-point and 31-point scales
in the number of significant differences. Our
conclusion about the relative prominence levels
of new versus non-new information would therefore be affected by our choice of scale, provided
that we want to verify observed differences in
mean ratings statistically.
Correlation with F0
The prominence level of a Danish accented syllable, and of the word in which it occurs, is generally felt to be associated with, among other
cues, a rise in F0 . The greater the rise, the more
prominent the syllable is perceived to be. For
this investigation two F0 values were measured
for all words in which such a rise occurs: the
F0 trough and the F0 peak value within the domain of onset of the accented vowel and the end
of the word (since we were concerned with word
level prominence). The rise is expressed as the
difference in semitones between these two values, and the values for the rises were then correlated against the prominence ratings from the
three scales. The results are displayed in Table 3.
The correlation coefficients are very similar
for the three data sets, indicating the the association between prominence and F0 can be described equally well regardless of the scale used.
To the (slight) extent that any difference can be
detected it seems that the correlation is better
with data obtained on the 4-point scale.

Table 3: Correlation (Spearmans ) between perceived prominence and F0
Scale
2-pt
4-pt
31-pt
0.593
0.626
0.606
Rater effort, or level of difficulty

In a few places we have described the 2-point
scale, and to some extent the 4-point scale, as
simpler and less difficult for the rater than the
31-point scale. At least this was our expectation,
and as an attempt to capture this we measured the
time consumption for each phrase and number
of times the raters listened to each phrase. The
hypothesis is that both of these measures will increase with an increase in the number of scale
points.
This hypothesis was in fact borne out: there
is an increase in time consumption of 18% when
going from two to four scale points, and an
increase of 42% when going from two to 31
points. All pairwise comparisons between the
three scales are significant (t-tests, one-tailed,
p < 0.05). The pattern is less clear for the number of playbacks, where only the tendency for
more playbacks on the 31-point scale compared
with the 2- and 4-point scales is statistically significant.
It must be concluded, though, that using
more scale points will result in a somewhat
higher cost.
speech categories and between words with new

versus given/accessible information, and the correlation with F0 is better. No such improvement
can be obtained, however, by raising the number of scale point to 31. On the contrary we
find slightly fewer significant differences on this
scale.
One reason for this finding may be that it
is too difficult for untrained listeners to use the
31-point scale. In a parallel experiment (to be
reported elsewhere) we had five expert listeners
rate the same phrases as in this experiment (with
slightly different instructions). The performance
of this group was generally better than any random group of five untrained listeners (higher
Cronbach coefficient and more significant differences), which indicates that they did in fact do
better on this scale. The analysis also showed,
however, that five expert listeners cannot replace
a larger group of untrained listeners if the objective is to find statistically significant differences
the number of observations becomes too small.
It was shown that expenses, in terms of
especially time consumption, grew with an increase in the number of scale points. Combined
with the above observations this points to a recommendation of using many listeners rating on
a scale with relatively few levels. A 2-point
scale may then be adequate for most purposes
and makes for the simplest and fastest task, but it
would appear that increasing the number of levels to four results in slightly better performance.
There seems to be no justification for using a
31-point scale, unless the requirement of using
many listeners cannot be met. The task becomes
more difficult and takes more time, and there is
no gain in terms of precision or discriminatory
power to balance the extra cost.
Two main questions were asked about the influence of scale type on ratings of perceived promiReferences
nence: 1) do we get the same prominence relations in utterances, as expressed in mean valFant, G. and Kruckenberg, A., Preliminaries to
ues and rankings, and 2) does scale type affect
the study of Swedish prose reading and readour ability to make observations about statistiing style, STL-QPSR 2/1989:183, 1989.
cally significant differences between words. The
Wightman, C., Perception of multiple levels
overall conclusion must be that the perceived
of prominence in spontaneous speech, ASA
prominence relations in the utterances are very
126th Meeting Denver 1993 (abstract).
similar whether expressed on a 2-point scale,
Chafe, W., Discourse consciousness, and time:
a 4-point scale or a 31-point scale. The difthe flow and displacement of conscious exferences are small and are mostly caused by a
perience in speaking and writing, The Unitendency for some raters to prefer a restricted
versity of Chicago Press, 1994.
range within a multi-level scale. The differences
are also relatively small when it comes to statistical testing of observations, but it does seem
that raising the number of scale points from two
to four yields slightly better results: there are
more significant differences between the part of
114
Proceedings, FONETIK 2005, Dept. of Linguistics Gteborg University
The postvocalic consonant as a complementary cue to

the perception of quantity in Swedish a revisit
Bosse Thorn
Dept. of Linguistics at
The University of Stockholm
based on the temporal organization of Swedish,
with stressed syllables having longer duration
than unstressed, and the well known quantity
contrast in stressed syllables, manifesting itself
as either /VC/ or /VC/. The simplified
prosodic description is henceforth called basic
prosody, or BP, and it can in the teaching
situation
be
reduced
to
a
short
recommendation: lengthen the proper speech
sound, thus aiming at enhancing the word
stress as well as the quantity contrast, both of
which depend mainly on duration as perceptual
cue for the listener (Fant & Kruckenberg 1994,
Thorn 2003). Measuring of Swedish syllable
duration has shown that stressed syllables are
50-100% longer than unstressed syllables (e.g.
Strangert 1985). If a stressed syllable
containing a short vowel is going to be
lengthened,
an
increased
post-vocalic
consonant duration is one way of maintaining
the proper duration of the stressed syllable.
The importance of Swedish word stress is
studied by Bannert (1986) who showed that
word stress on the improper syllable can make
otherwise familiar words unintelligible to the
Swedish listener. The importance of the
quantity contrast is evident from the fact that
there are numerous minimally contrasting word
pairs, also within the same part of speech. The
role of duration as an important cue to both the
word stress and the quantity feature makes it
reasonable to assign it great importance in
Swedish pronunciation. Thorn (2001) showed
that
digitally
increased
duration
in
phonologically long segments in Swedish with
a foreign accent tended to be judged as
improved Swedish pronunciation by native
Swedish listeners. The study showed similar
effect for lengthening of both vowel and
consonant duration. The non-native speaker in
the study was Polish, and Polish is a language
without phonological quantity.
The last mentioned study deals with the
duration feature as maintainer of euphony
rather than phonology, and BP is assumed to
enhance naturalness as well as the word stress
and the quantity contrasts. In this intricate
Abstract
Is the duration of the post-vocalic consonant in
stressed syllables an important property when
teaching Swedish as a L2? Is it a cue to the
discrimination of /VC/ and /VC/ words or a
buffer for proper syllable duration, or both?
Four Swedish words, providing two minimal
pairs with respect to phonologic quantity, and
containing the vowel phonemes // and //,
were gradually changed temporally from /VC/
to /VC/ and vice versa. Manipulations of
durations were made in two series one with
changing of vowel duration only, and one with
changing of vowel and consonant duration in
combination. 30 native Swedish listeners
decided whether they perceived test words as
original quantity type or not. The results show
that the duration of the post-vocalic consonant
had substantial influence on how the listeners
categorized the test words. The study also
includes naturalness judgements of the test
words, and here the proper post-vocalic
consonant duration had a positive influence on
the listeners judgements of naturalness for //
but not for //.
Introduction
Teaching and learning the pronunciation of a
second
language
comprises
many
considerations as to what phonetic features are
more or less important in order to on the one
hand make oneself understood, and on the
other hand not to disturb the listener. Bannert
(1984) states:
to improve pronunciation when learning a
foreign language, linguistic correctness
has been the guiding principle. It seems
however, that hardly any consideration has
been given to the native listeners problems of
understanding foreign accent.
In the past 20-25 years, a simplified description
of Swedish prosody for pedagogical use has
appeared in a number of teaching media
(Kjellin 1978, Fasth & Kannermark 1989,
Slagbrand & Thorn 1997). The description is
115

interplay between sentence stress, word stress
and quantity, the perceptual role of the postvocalic consonant is probably the least explored
property. The present study is an attempt to
evaluate the role of the duration of the postvocalic consonant as a cue to the phonologic
quantity contrast.
In addition to temporal corelates, the
phonological quantity contrast is also known to
use different proportions of spectrum as
perceptual cue, depending on vowel phoneme
and regional variety. Studies testing the
perceptual roles of these correlates (HaddingKoch & Abramson 1964, Behne et al. 1997,
Thorn 2003) have arrived at somewhat
different conclusions, but agree that duration is
the overall most important cue to the quantity
contrast, but that the /a/ phoneme ([]-[a]) and
even more the // phoneme ([]-[]) depend
more on spectrum than the rest of the Swedish
vowel inventory.
In a study by Thorn (2003), in which
spectral properties were kept intact while both
vowel and consonant durations were
manipulated in a complementary way, most of
the listeners perceived even the // phoneme as
non-original quantity type, which was not the
case in the study by Hadding-Koch &
Abramson (1964). This indicates that the total
timing of the VC-sequence, rather than mere
vowel duration, can be important for
discrimination of /VC/-/VC/, and that the
listeners in Thorn (2003) may have used the
post-vocalic consonant as a complementary
cue. However there are confusing conditions
since Thorn (2003) used a central standard
Swedish speaker, and listeners from all over the
Swedish speaking area, while Hadding-Koch &
Abramson (1964) used south Swedish speaker
and listeners. South Swedish (Skne dialect) is
known for having smaller differences in
consonant duration after long and short vowel
allophone (Grding 1974), and moreover other
spectral properties for the vowel system than
central standard Swedish.
The present study compares two series of
vowel duration manipulations; one with
changing of vowel durations only, and one with
vowel and consonant duration change in
combination in accordance with the
complementary VC-relation in Swedish. This
method could evaluate the consonant duration
as a possible complementary cue to the
Swedish quantity distinction.
Hypothesis 1: Complementary vowel +

consonant duration change helps the listener
perceive the non original quantity type with
less vowel duration change than change of
vowel duration only.
Hypothesis
2:
Test
words
with
complementary duration /VC/ or / VC/
will be judged as more natural sounding than
words with correct vowel duration and
wrong consonant duration i.e. /VC/ or /VC/.
Method
Stimuli
The test words in the present study are mta
[mta] to measure, mtta [mta] to satisfy
skuta [skta] boat, skutta [skta] to
scamper. These words provide two minimal
pairs with respect to phonologic quantity. One
pair - contains the vowel phoneme //, and the
other pair contains the vowel phoneme //. The
words were recorded in a fairly damped room
in the present authors home, using a Rde NT3
condenser microphone and a Sony MZ-N710
mini-disc player. The speaker was a Swedish
male speaking central standard Swedish
(Stockholm variety). The test words were
pronounced within a carrier phrase: Det var
jag menade It was .. that I meant
Vowel and consonant durations in the test
words were manipulated in Praat (Boersma &
Weenink 2001). All stimuli were given
stepwise vowel duration change. Half of the
stimuli kept a constant consonant duration,
identical with the original quantity type, and the
other half were given stepwise consonant
duration changes based on original values for
non-original quantity type. The manipulated
durations are shown in table 1.
Listeners
30 native speakers of Swedish listened to the 48
stimulus words, marking whether they
perceived them as /VC/ or /VC/. The listeners
were between 23 and 60 of age, and had
different regional varieties of Swedish as their
L1. None of them had any hearing deficiencies
that affected their perception of normal speech.
Presentation
The 48 stimuli were presented in random order,
in the carrier phrase, preceded by the reading of
stimulus number. The test was presented from
116

Table 1. Vowel- and consonant (occlusion)
durations for manipulated stimuli in the present
study. Shaded parts represent original durations for
non-original quantity type.
30
Number of "mtta" (VC:) responses for

original "mta" (V:C)
25
20
Changing of vowel duration only (ms)
15
188
168
148
128
108
88
10
[mta]
153
153
153
153
153
153
Original
136
156
176
196
216
236
[mta]
334
334
334
334
334
334
V
C
141
166
121
166
101
166
81
166
61
166
41
166
V
C
166
312
186
312
206
312
226
312
246
312
266
312
Original
Original
[skta]
Original
[skta]
188
[mta]
Original
[mta]
Original
[skta]
Original
[skta]
V
C
188
234
168
254
148
274
128
294
108
314
88
334
V
C
136
253
156
233
176
213
196
193
216
173
236
153
V
C
141
232
121
252
101
272
81
292
61
312
41
332
V
C
166
246
186
226
206
206
226
186
246
166
266
146
148
128
108
88
Vowel duration
Number of "skutta" (VC:) responses for

original "skuta" (V:C)
30
25
20
15
10
5
0
Changing of V and C duration (ms)

Original
168
141
121
101
81
61
Vow el duration
41
Figure 1. Number of /VC/-responses for each value

of vowel duration in original /VC/-words. Filled
squares represents manipulations of both vowel and
consonant durations and open squares represents
manipulations of vowel duration only.
CD-player via headphones. The listener was

first allowed to hear 2-3 stimuli while adjusting
the sound level. The response was marked on
an answering sheet, presenting the number and
the pair of words providing the two choices.
The listener had to chose one of the two
possibilities. Naturalness rating was done in
direct connection to each identification task.
After hearing the test word a second time, the
listener marked a figure (1-10) on a horizontal
10 cm scale, where 1 represented totally
unnatural or unlikely pronunciation for a native
speaker of Swedish and 10 totally natural
pronunciation for a native speaker of Swedish,
regardless of regional variety.
Number of "mta" (V:C) responses for

original "mtta"
30
25
20
15
10
5
0
136
156
176
196
Vowel duration
216
236
Number of "skuta" (V:C) responses" for

original "skutta" (VC:)
30
25
20
15
10
5
0
Result
In both the vowel lengthening series and the
vowel shortening series, the complementary
consonant manipulation seems to have distinct
influence on the listeners perception of /VC/ or
/VC/ (figure 1 and 2). Listeners start to perceive stimuli as non-original quantity type at
lower degree of vowel duration change when
the post-vocalic consonant duration follows the
complementary pattern.
For
//,
the
complementary manipulation seems to make
166
186
206
226
246
Vowel duration
266
Figure 2. Number of /VC/-responses for each

value of vowel duration in original /VC/-words.
Filled squares represents manipulations of both
vowel and consonant durations and open squares
represents manipulations of vowel duration only.
117

less difference compared to //, both when
going from /VC/ durations to /VC/ and vice
versa.
The over all effect of duration change is
greater for // than for //, which is expected,
because of the greater difference in formant
spectrum between long and short allophone of
//.
Correct consonant duration gave higher
naturalness ratings in the two // series, but had
a vague effect in the // series. There was a
slight positive effect when going from original
skutta [skta] to skuta [skta] , and a small
but consistent negative effect when going in the
other direction. The observed effect on
naturalness from post-vocalic consonant
duration in both series containing the //
phoneme has low significance, due to the
smaller number of non-original quantity type
responses.
References
Bannert R. (1984) Prosody and intelligibility of
Swedish spoken with a foreign accent.
Nordic Prosody III. Acta Universitatis
Umensis, Ume Studies in the Humanities
59, 7-18.
Bannert R. (1986) From prominent syllables to
a skeleton of meaning: a model of
prosodically guided speech recognition. In
Proceedings of the XIth ICPhS Tallinn, 7376.
Behne D. Czigler P. and Sullivan K. (1997)
Swedish Quantity and Quality: A
Traditional Issue Revisited. In Phonum 4,
Dept of Linguistics, Ume University.
Boersma P. & Weenink D. (2001) Praat a
system for doing phonetics by computer.
http://www.fon.hum.uva.nl/praat/
Fant G. and Kruckenberg A. (1994). Notes on
stress and word accent in Swedish, KTH,
Speech Transmission Laboratory, Quarterly
Progress and Status Report 2-3, 125-144.
Fasth C. & Kannermark A. (1989) Goda
grunder. Kursverksamhetens frlag, Lund.
Grding E. (1974). Den efterhngsna prosodin.
I: Teleman & Hultman Sprket I bruk.
Liber, Lund.
Hadding-Koch K. & Abramson A. (1964).
Duration versus spectrum in Swedish
vowels: Some perceptual experiments. I
Studia Linguistica 18. 94-107.
Kjellin O. (1978) Svensk prosodi i praktiken.
Hallgren & Fallgrens studidefrlag.
Uppsala.
Slagbrand Y. & Thorn B. (1997) vningar i
svensk basprosodi. Semikolon, Boden.
Thorn B. (2001). Vem vinner p lngden? Tv
experiment med manipulerad duration i
betonade stavelser. D-uppsats i fonetik.
Institutionen fr filosofi och lingvistik.
Ume universitet.
Thorn B. (2003). Can V/C-ratio alone be
sufficient for discrimination of V:C/VC: in
Swedish? A perception test with
manipulated
durations.
Phonum
(Department
of
Phonetics,
Ume
Conclusion
The result shows that the duration of the postvocalic consonant is more than a means to
assign the proper length to stressed syllables. It
does obviously play a distinctive role for the
perception of quantity type in the present
material. Since the involved vowels represent
the maximal (//) and the minimal (//) spectral
differences between long and short vowel
allophone in the Swedish vowel inventory, the
result indicates that the duration of the postvocalic consonant functions as a general
complementary cue to the perception of
quantity type in Swedish.
The ambiguous contribution from correct
consonant duration to naturalness for //, can
probably be accounted for by the already
damaged naturalness caused by changing of
durations with intact spectral properties. In the
case of //, the listeners were probably not
disturbed by incorrect vowel timbre, and
could consequently enjoy the adjusted
consonant duration easier.
Since there is already enough evidence for
the greater duration of stressed syllables in
Swedish, it can be assumed that the duration of
the post-vocalic consonant contributes to
perception of quantity, word stress and in
most cases improved naturalness. This in turn
makes it reasonable to regard both vowel and
consonant duration as important properties
when learning Swedish as a second language.
118
Gender differences in the ability to discriminate emotional content from speech

Juhani Toivanen1, Eero Vyrynen2 and Tapio Seppnen2
1
MediaTeam, University of Oulu and Academy of Finland
2
MediaTeam, University of Oulu
Abstract
In this paper, an experiment is reported which
was carried out to investigate gender differences in the ability to infer emotional content
from speech. Fourteen professional actors
(eight men, six women) produced simulated
emotional speech data representing the most
important basic emotions (three emotions in
addition to neutral). Each emotion was simulated when reading aloud a semantically neutral text. Fifty-one listeners (27 males, 24 females) were asked to listen to the speech samples and choose (among the four options) the
most appropriate emotional label describing
the simulated emotional state. The female listeners were consistently better at discriminating the emotional state from speech than the
male subjects. The results suggest that females
are emotionally more sensitive than males, as
far as emotion recognition from voice is concerned.
Introduction
Phoneticians, speech scientists and engineers
are taking increasing interest in the role of the
expression of emotion in speech communication. In addition to so-called basic emotions,
other global speaker-states are investigated, for
example, irritation and trouble in communication (Batliner et al. 2003). A major approach in
basic (phonetic) research has been to investigate the vocal parameters of specific emotions,
and these parameters are now understood relatively well. Nowadays, the role of the vocal expression of emotion is gaining increasing importance in the computer speech community,
for example, in the applied context of the automatic discrimination/classification of emotional
content from speech (ten Bosch 2003). It can be
argued that, after a long exploratory stage, the
study of the vocal expression of emotion is
reaching a level of maturity where the main focus is on important applications, particularly
those involving human-computer interaction.
In the study of the vocal communication of
emotion, an important taking-off point is the
base-line data, i.e. the human emotion discrimination performance level. There is now a
relatively large literature on the human
discrimination of emotions from speech:
reviewing over 30 studies of the subject
conducted up to the 1980s, Scherer (1989)
concludes that an average accuracy percentage
of about 60 % can be obtained in experiments
where listeners are to infer emotional content
from vocal cues only (without any help from
lexis etc.). In a recent large-scale cross-cultural
study (Scherer et al. 2001), an accuracy level
rate of 66 % was found, across emotions
(neutral, anger, fear, joy, sadness and surprise)
and cultural contexts (Europe, Asia and the
US). In a western cultural context, vocal
recognition of six emotions (neutral, anger,
fear, joy, sadness and disgust) was 62 %.
Typically, in investigations of the human
discrimination of emotions, a standard speech
sample (an utterance or a short passage) is
used: the same lexical content is produced (often by actors) with different simulated emotions
and test subjects are asked to choose the most
appropriate emotional label for each sample
(among the intended emotion categories). The
emotions investigated in these studies usually
represent basic emotions: it is argued that
certain emotions at least fear, anger, happiness, sadness, surprise and disgust are the
most important or basic emotions (because they
are seen to represent survival-related patterns
of responses to events in the environment).
Although the vocal expression of emotions
has been investigated rather intensively, at least
as far as simulated data is concerned (and empirical evidence has cumulated indicating how
well basic emotions can be discriminated by
human listeners in different cultures), there has
been little research on inter-subject differences
(within a culture) in emotion discrimination
ability. Usually, the emotion recognition performance level of a group of test subjects is reported as a single numerical value, without
making any intra-group distinctions. Thus there
is very little reported empirical evidence concerning possible differences between female
119

speech samples) from two high-quality computer speakers. The emotional labels to choose
between were limited to the intended emotions,
not containing any distracters. The subjects
heard the samples in random order in eight consecutive sessions within a period of two months
(each session was arranged at the beginning of
a lesson).
and male subjects, for example, in their ability

to infer emotional content from vocal cues
only.
This paper concentrates on the inter-gender
differences in emotion recognition ability in a
simulated emotional speech data context. The
research question is: within a speech community, are female listeners better than male listeners at distinguishing between different emotions in speech? And if they are better, are they
consistently so, that is, are they better than
male listeners also at discriminating emotions
from speech produced by male speakers? To
our knowledge, these are questions no-one has
systematically addressed in the literature on the
vocal communication of emotion.
Results
Tables 1-9 show the results of the experiment:
the emotion discrimination performance of the
subjects is first presented in toto (female and
male subjects listening to female and male
speakers), and then the results are broken down
into sub-categories (females listening to all
speakers, females listening to females only,
etc.). Each table is a confusion matrix, where
the column on the left indicates the intended
emotions and the rows indicate the recognized
emotions. The underlined percentages indicate
the average discrimination accuracy for each
specific emotion. The average emotion recognition performance level in each setting is given
as the TOTAL percentage.
Speech data
For the purposes of the present study, simulated
emotional speech data was collected. Fourteen
professional actors (eight men, six women)
produced the speech data. The speakers were
aged between 26 and 50 (average age was 39);
all were speakers of the same northern variety
of Finnish. The speakers read out a phonetically rich Finnish passage of some 120 words
simulating three basic emotions, in addition to
neutral: sadness, anger and happiness/joy. The
text was emotionally completely neutral, representing matter-of-fact newspaper prose. The
recordings were made in an anechoic chamber
using a high quality condenser microphone and
a DAT recorder to obtain a 48 kHz, 16-bit recording. The data was stored in a PC as wav
format files. Each monologue was divided into
five consecutive segments of equal duration for
discrimination experiment purposes: thus there
were a total of 280 emotional speech samples
with an average duration of 13 seconds (five
samples for four emotions by fourteen speakers).
Human
ments
discrimination
Table 1. Emotion discrimination from voice: females and males listening to females and males.
experi-
A performance test for human emotion discrimination was performed in the form of listening tests. The listeners were students in a
junior high school, aged between 14 and 15.
Fifty-one subjects (27 males, 24 females) participated as volunteers. All were speakers of the
same northern variety of Finnish (the actors
were speakers of the same variety of Finnish).
The listening tests took place in a classroom
where the subjects heard the speech data (280
TOTAL Neutral Sad

76.9 %
Neutral 78.4 % 16.9 %
Angry
Happy
2.6 %
2.1 %
Sad
12.9 %
85.3 %
1.0 %
0.8 %
Angry
14.9 %
2.9 %
76.9 %
5.3 %
Happy
24.3 %
5.4 %
3.3 %
67.0 %
Table 2. Emotion discrimination from voice: females and males listening to males.
TOTAL Neutral Sad

76.1 %
Neutral 77.6 % 17.2 %
Angry
Happy
2.9 %
2.2 %
Sad
14.7 %
83.9 %
1.1 %
0.4 %
Angry
14.6 %
1.2 %
78.2 %
6.0 %
Happy
26.2 %
5.3 %
3.6 %
64.9 %
120
Table 3. Emotion discrimination from voice: females and males listening to females.
Table 6. Emotion discrimination from voice: males

listening to males.
TOTAL Neutral Sad

77.9 %
Neutral 79.5 % 16.3 %
Angry
Happy
Angry
Happy
2.2 %
2.0 %
TOTAL Neutral Sad

73.9 %
Neutral 77.2 % 16.6 %
3.7 %
2.5 %
Sad
10.5 %
87.1 %
0.9 %
1.4 %
Sad
15.9 %
81.8 %
1.8 %
0.5 %
Angry
15.3 %
5.3 %
75.2 %
4.2 %
Angry
16.4 %
1.8 %
74.9 %
6.9 %
Happy
21.7 %
5.5 %
3.0 %
69.8 %
Happy
28.3 %
5.6 %
4.7 %
61.4 %

listening to females and males.
TOTAL Neutral Sad

74.4 %
Neutral 78.4 % 15.9 %
Angry
Happy
3.3 %
2.4 %
Sad
14.4 %
83.0 %
1.6 %
1.0 %
Angry
16.8 %
3.8 %
73.3 %
6.1 %
Happy
26.8 %
6.0 %
4.3 %
62.8 %
Table 7. Emotion discrimination from voice: females listening to males.
TOTAL Neutral Sad

78.8 %
Neutral 78.2 % 18.0 %
Angry
Happy
2.0 %
1.9 %
Sad
13.2 %
86.3 %
0.2 %
0.2 %
Angry
12.5 %
0.5 %
82.0 %
5.1 %
Happy
23.9 %
4.9 %
2.4 %
68.8 %
Table 5. Emotion discrimination from voice: females listening to females and males.

listening to females.
TOTAL Neutral Sad

79.7 %
Neutral 78.4 % 18.0 %
Angry
Happy
Angry
Happy
1.8 %
1.8 %
TOTAL Neutral Sad

75.1 %
Neutral 80.1 % 14.9 %
2.7 %
2.3 %
Sad
11.1 %
87.9 %
0.3 %
0.7 %
Sad
12.5 %
84.6 %
1.3 %
1.6 %
Angry
12.7 %
1.9 %
81.1 %
4.3 %
Angry
17.3 %
6.5 %
71.2 %
4.9 %
Happy
21.4 %
4.6 %
2.2 %
71.7 %
Happy
24.9 %
6.6 %
3.9 %
64.7 %
121
Table 9. Emotion discrimination from voice: females listening to females.
TOTAL Neutral Sad

81.1 %
Neutral 78.8 % 18.0 %
Angry
Happy
1.6 %
1.7 %
Sad
8.2 %
90.1 %
0.5 %
1.3 %
Angry
12.9 %
3.8 %
80.0 %
3.3 %
Happy
18.2 %
4.2 %
2.0 %
75.6 %
The average human emotion discrimination

ability was approximately 77 %, which can be
regarded as a good result in the light of earlier
research. What is more interesting from the
viewpoint of this paper is the systematic advantage of the female listeners.

Looking at the results, it can be seen that the
female subjects were better than the males in
the emotion discrimination task in each setting:
they were better (79 %) than the male listeners
(74 %) at inferring emotional content also from
the speech data produced by the male speakers.
The male-male listening setting (74 %) in fact
produced the lowest emotion discrimination
performance level among the nine settings. The
best results were, unsurprisingly, obtained in a
setting involving females listening to female
speakers (81 %). However, it cannot be argued
that the male listeners were really poor with
female speech data (75 %) bearing in mind the
general accuracy results reported in the literature. All in all, the female speakers produced
vocal portrayals of emotions which were easier
to interpret (by both sexes) than those produced
by the male speakers (78 % vs. 76 %). The
emotional state which was best recognized in
the whole data was sadness in the femalefemale setting (90 %); the most difficult emotion was happiness in the male-male setting (61
%). It would be interesting to speculate about
the possible reasons for this: maybe males are
not interested in finding happiness in fellow
males while females are generally empathetic
towards other females in distress?
That the female listeners/speakers were better with the vocal communication of emotion
than the male listeners/speakers is not surprising. A relevant concept in this context may, in
fact, be empathy. Psychological research (see
e.g. Tannen 1991) has shown that female superiority in empathizing is manifested in interaction by the following trends, for example: females speech involves much more direct talk
about feelings and affective states than guy
talk, females are usually more co-operative
and reciprocal in conversation than males, and
females are much quicker to respond empathically/emotionally to the distress of other people. It has been shown that, form birth, females
look longer at faces, and particularly at peoples eyes, while males are more prone to look
at inanimate objects (Connellan et al. 2001).
The results of this study support the consensus view that, emotionally, females are more
sensitive than males; this time concrete evidence is presented for the vocal (prosodic, nonlexical) communication of emotion. To draw
more far-reaching conclusions, however, we
need more speakers to produce the speech data,
so that we can exclude the possible effect of
speaker-specific idiosyncrasies on the results of
the listening tests.
References
Batliner A., Fischer K., Huber R., Spilker J.
and Nth E. (2003) How to find trouble in
communication. Speech Communication 40,
117-143.
Connellan J., Baron-Cohen S. Wheelwright S.,
Batki A. and Ahluwalia J. (2001) Sex differences in human neonatal social perception. Infant behavior and Development 23,
113-118.
Scherer K. R. (1989) Vocal correlates of emotion. In Wagner H. and Manstead A. (eds.)
Handbook of Psychophysiology: Emotion
and Social Behavior, 165-197. London:Wiley.
Scherer K.R., Banse R. and Walbott H.G.
(2001) Emotion inferences from vocal expression correlate across languages and cultures. Journal of Cross-Cultural Psychology
32, 76-92.
Tannen D. (1991) You just dont understand:
Women and men in conversation. London:Virago.
ten Bosch L. (2003) Emotions, speech and the
ASR framework. Speech Communication
40, 213-225.
122
Vowel durations of normal and pathological speech

Antonis Botinis1,2, Marios Fourakis3 and Ioanna Orfanidou1
1
Department of Linguistics, University of Athens, Athens, Greece
2
School of Humanities and Informatics, University of Skvde, Skvde, Sweden
3
Department of Speech and Hearing Science, The Ohio State University, Columbus, USA
Abstract
This is an experimental investigation of vowel

durations in Greek, produced by normal speakers as well as speakers with cerebral palsy mobility dysfunction. The results indicate that mobility, gender and stress have significant effects
on vowel durations. There are also significant
interactions between mobility and stress as well
as between gender and stress but not between
mobility and gender.
The speech material under investigation consists of disyllabic nonsense words in the context
of a meaningful carrier phrase. The words have
a CVCV segmental structure where the first
vowel (V) is one of the five Greek vowels, i.e.
{i, e, a, o, u} in the carrier phrase to klb sVsa
pzi kal musik (The club sVsa plays good
music). The nonsense key words were produced
with lexical stress either on the first or second
syllable and the speech material was produced
in normal tempo with no prosodic brake on an
individual basis.
The speakers were six persons with cerebral
palsy dysfunction and six persons with no
known pathologies (henceforth called the mobility factor) with standard Athenian Greek
pronunciation. Each group was comprised of
three female and three male speakers.
Acoustic analysis was carried out with the
use of Wavesurfer and measurements were
made of the vowel durations from the waveform. The results were subjected to statistical
analysis with the StatView software package
and ANOVA tests were carried out.
In the remainder of this paper, the results
are presented next, followed by discussion and
conclusions.
Introduction
This is an experimental investigation of vowel
durations in Greek as a function of mobility,
gender, stress and vowel category. Two main
questions are addressed: (1) what are the effects
of the investigated factors? And (2) what are
the interactions between the factors?
Considerable research has been carried out
on Greek and contrastive prosody with regards
to temporal structures and vowel durations (see
e.g. Fourakis, 1986, Botinis, 1989). Thus, different vowels have different intrinsic durations,
according to which low vowels are longer than
high vowels and back vowels tend to be longer
than front vowels (Fourakis at al., 1999). Stress
has a temporal effect on vowels, according to
which stressed vowels are longer than unstressed vowels (Botinis, 1989, Botinis et al.,
2001, 2002). Gender has also a temporal effect
on vowels and thus vowels produced by female
speakers are longer than vowels produced by
male speakers. The effect of gender is a language-specific effect as it has been reported for
some languages, such as Greek and Albanian,
but not for others, such as English and Ukrainian (Botinis et al., 2003).
Our knowledge with regards to pathological
speech and cerebral palsy mobility dysfunction
is very limited and the main target thus of the
present investigation is to produce basic data
and initialize research on speech produced by
speakers with various pathologies.
123
Results
The results are presented in Figures 1-6, based
on the acoustic analysis and duration measurements of the total speech material in accordance
with the experimental methodology.
Figure 1 (next page) shows overall vowel
durations as a function of mobility and gender.
Vowels produced by speakers with cerebral
palsy were significantly longer than vowels
produced by speakers with no pathologies
(F(1,596)=40.08, p<.001). Vowels produced
by female speakers were longer than vowels
produced by male speakers (F(1,596)=14.18,
p<.001). The interaction was not significant.
160
Cerebral Palsy
Normal
120
120
80
80
40
40
Normal
Female
Male
Figure 1. Overall vowel durations (ms) as a function of mobility and gender.
160
Cerebral Palsy
160
Cerebral Palsy
Normal
Figure 4. Individual vowel durations (ms) as a function of mobility.
Female
160
120
120
80
80
40
40
Male
Stressed
Unstressed
Figure 2. Overall vowel durations (ms) as a function of mobility and stress.
Female
160
Male
Figure 5. Individual vowel durations (ms) as a function of gender.
Stressed
160
120
120
80
80
40
40
Unstressed
Stressed
Unstressed
Figure 3. Overall vowel durations (ms) as a func- Figure 6. Individual vowel durations (ms) as a function of gender and stress.
tion of stress.
Figure 2 shows overall vowel durations as a

function of mobility and stress. As before,
vowels produced by persons with cerebral palsy
were significantly longer than vowels produced
by persons with no pathologies. Stressed vowels were longer than unstressed vowels
(F(1,596)=527, p<.001). However, the interaction was significant (F(1,596)=19, p<.001).
The difference between the two groups of
speakers was greater for the stressed vowels
than for the unstressed vowels. Post hoc t-tests
revealed that the differences between groups
124
were significant for both stressed and unstressed vowels.

Figure 3 shows overall vowel durations as a
function of gender and stress. As before, vowels produced by female speakers were longer
than those produced by male speakers and
stressed vowels were longer than unstressed
vowels. However, the interaction was significant (F(1,596)=43, p<.001). Post hoc t-tests
revealed that the difference was significant for
stressed vowels (t(298)=6.714, p<.001) but not
for unstressed vowels (t(298)=-1.574, p>.05).

The next set of comparisons examined individual vowel durations, in order to determine
whether the three factors (mobility, gender,
stress) affected all vowels uniformly or differently for each vowel.
Figure 4 (previous page) shows individual
vowel durations for each speaker group. As
before, vowels produced by speakers with cerebral palsy were longer than vowels produced by
speakers with no pathologies. In addition
vowel category also significantly affected
vowel duration (F(1,590)=33.56, p<.001).
However, there is no significant interaction
(F(4,590)=1.0, p>.05).
Thus, the effect of
group was uniform for all vowels.
vowel durations for each gender. As before,
vowels produced by female speakers with were
longer than vowels produced by male speakers.
Vowel category also significantly affected
vowel duration (F(1,590)=33.56, p<.001).
However, there was no significant interaction
(F(4,590)=2.364, p>.05). Even though the
analysis of variance did not produce a significant interaction (due to large variances), it is
evident from the figure that the vowel [i] was
longer for male speakers than for female speakers.
vowel durations for each stress condition. As
before, stressed vowels were longer than unstressed vowels. In addition vowel category
also significantly affected vowel duration
(F(1,590)=66.44, p<.001). The interaction was
also significant (F4,590)=3.768, p<.01). The
effect is due to the different behavior of high [i,
u] versus nonhigh [e,o,a]. High stressed vowels
were on the average 42 ms longer than their
unstressed counterparts while nonhigh stressed
vowels were on the average 56 ms longer than
their unstressed counterparts.
Discussion
In the present investigation, some old knowledge has been corroborated and some new
knowledge has been produced. The old knowledge concerns the vowel category durations as
well as the effects of stress and gender on
vowel durations and the new knowledge concerns the effects of cerebral palsy dysfunction
on vowel durations.
Our results indicate that the investigated
factors of mobility, gender and stress have a
significant effect on vowel durations, i.e. vow125
els produced by speakers with cerebral palsy

dysfunction are longer than vowels produced
by normal speakers, vowels produced by female speakers are longer than vowels produced
by male speakers, and stressed vowels are
longer than unstressed vowels. There were also
significant interactions between mobility and
stress as well as gender and stress, i.e. both
mobility and gender temporal effects are mostly
correlated to stressed syllables and have hardly
any effect on unstressed syllables. Furthermore,
the five Greek vowels have different intrinsic
durations, which are mainly determined by the
low vs. high dimension, i.e. low vowels are
significantly longer than mid vowels which, in
turn, are significantly longer than high vowels.
The intrinsic vowel durations, also referred
to as microprosody, are widely documented
in the phonetics literature and reported in many
languages, among them in Greek (Di Cristo and
Hirst, 1986, Fourakis et al., 1999), and the present investigation corroborates earlier reports
on this area.
The prosodic correlates of lexical stress
have also been studied extensively and the results indicate that, other phonetic contexts being equal, stressed syllables and hence vowels
are most usually longer than unstressed syllables in a variety of related as well as unrelated
languages (Beckman, 1986, Botinis, 1989, Sluijter, 1995, Fant et al., 1991, 2000, Botinis et al.,
2002, de Jong, 2004). Duration has been reported as an invariable acoustic correlate,
which also functions as a perceptual correlate
of lexical stress distinctions in Greek (Botinis,
1989, Botinis et al., 1999). Lexical stress has
variable effects on consonants and vowels in
different languages. In some languages the effects of lexical stress are larger on vowels than
on consonants, whereas, in other languages, the
effects of lexical stress are more equally distributed on vowels and consonants (Botinis,
2003). The effects of lexical stress in Greek are
larger than other prosodic effects such as focus
and syllable position (Botinis et al., 2002).
The effects of gender on segmental durations have not drawn particular attention in prosodic research and thus very little is known on
this area. In the present investigation, experimental evidence has been provided that female
speakers produce vowels with longer durations
than that of male speakers. This is most probably a sociolinguistic effect as in Albanian and
Greek, e.g., female vowel productions have
longer durations than male vowel productions

whereas in other languages, such as English
and Ukrainian, no gender effects on segmental
durations have been observed. However, the
effects of gender on vowel durations are mostly
evident on vowels of stressed syllables and not
on vowels of unstressed syllables.
The most important finding and the main
target of the present investigation concerns the
mobility factor and the effects of cerebral palsy
dysfunction on vowel durations. Obviously,
cerebral palsy has a lengthening effect on
vowel durations, which is however confined to
the stressed syllables. Thus, cerebral palsy
speakers have satisfactory temporal control
with reference to vowel durations of unstressed
syllables which implies that cerebral palsy temporal effects are not evident in general speech
production but are rather confined to several
prosodic and phonetic categories. The results of
the present investigation provide a starting
point for research in this area and further work
is required before the temporal structure of
speech produced by persons with cerebral palsy
is basically understood.
Beyond the present results, this investigation has led to further immediate questions with
reference to consonant and vowel productions.
Thus, on the one hand the effects of mobility on
consonant durations and, on the other hand, the
effects of mobility on quality and the formant
structure of vowel productions are eminent
questions to be dealt with in the framework of
the present investigation.
Conclusions
In accordance with the results of the present
investigation the following conclusions have
been drawn: First, each mobility factor, gender
factor and stress factor has a significant effect
on vowel durations. Second there are significant interactions between mobility and lexical
stress as well as between gender and lexical
stress but not between mobility and gender.
Thus, both mobility factor and gender factor
have considerably bigger effects on stressed
syllables than unstressed ones.
Acknowledgements
Our sincere thanks to the speakers with cerebral
palsy dysfunction as well as to our students at
the University of Athens for their participation
in the production experiments.
126
References
Beckman M. E. (1986) Stress and Non-stress
Accent. Dordrecht: Foris.
Botinis A. (1989) Stress and Prosodic Structure
Botinis A., Bannert R. Fourakis M., and Dzimokas D. (2003) Multilingual focus production of female and male focus production.
6th International Congress of Greek Linguistics, University of Crete, Greece.
Botinis A., Bannert R. Fourakis M., and Pagoni-Tetlow S. (2002) Crosslinguistic segmental durations and prosodic typology. International Conference of Speech Prosody
2002, 183-186, Aix-en-Provence, France.
Botinis A., Fourakis M., Panagiotopoulou N.,
and Pouli K. (2001) Greek vowel durations
and Prosodic interactions. Glossologia 13,
101-123.
Botinis, A., Fourakis, M., and Prinou I (1999).
Prosodic effects on segmental durations in
Greek. 6th European Conference on Speech
Communication and Technology
EUROSPEECH 1999, vol. 6, 2475-78, Budapest, Hungary.
de Jong K. J. (2004) Stress, lexical focus, and
segmental focus in English: patterns of
variation in vowel duration. Journal of Phonetics 32, 493-516.
Di Cristo A., and Hirst D. J. (1986) Modelling
French micromelody: analysis and synthesis. Phonetica 43, 11-30.
Fant G., Kruckenberg A., and Liljencrants J.
(2000) Acoustic-phonetic analysis in Swedish. In Botinis A. (Ed.) Intonation: Analysis,
Modelling and Technology, 55-86.
Dordrecht: Kluwer Academic Publishers.
Fant G., Kruckenberg A., and Nord L. (1991)
Duration correlates of stress in Swedish,
French and English. Journal of Phonetics
19, 351-365.
Fourakis M. (1986) An acoustic study of the
effects of tempo and stress on segmental intervals in Modern Greek. Phonetica 43, 172188.
Fourakis M., Botinis A., and Katsaiti M. (1999)
Acoustic characteristics of Greek vowels.
Sluijter A. (1995) Phonetic Correlates of Stress
and Accent. The Hague: Holland Ac.
Graphics.
Acoustic Evidence of the Prevalence of the Emphatic

Feature over the Word in Arabic.
Zeki Majeed Hassan
Department of Linguistics Gothenburg University, Sweden
Abstract.
An acoustic study is carried out to see whether
the phenomenon of pharyngalization and/or
velarizsation is confined to the emphatic consonant and the adjacent vowels or it extends
over the whole word in Arabic. Measurements
in Hz of F1 & F2 of front unrounded vowels in
monosyllabic, bisyllabic and trisyllabic words
in ISA having emphatic vs. non-emphatic consonants were made. They showed significant
narrowing between F1 & F2 for vowels in the
vicinity of emphatic consonants than those in
the vicinity of non-emphatic consonants. This is
attributed to the secondary coarticulatory configuration formed in the pharyngeal region by
the projection of the root of the tongue toward
the back wall of the pharynx and possible lowering of the velum toward rising tongue dorsum
which prevails, though in different levels of
significance, over the other syllables of the
word.
Phonetic and phonological background.

Arabic emphatic consonants are characterized
by two types of articulations. Abercrombie
(1967) calls them primary and secondary articulations. He describes the secondary articulation as a stricture that involves less constriction
of the vocal tract than the primary stricture. The
most evident feature characterizing this secondary articulation, as agreed by many phoneticians, is the constriction of the pharynx and
whence the term pharyngalization. In their cinefluorographic study, Ali and Danilof (1972)
found out that during the articulation of emphatic consonants the tongue exhibits a simultaneous slight depression of the palatine dorsum and a rearward movement of the pharyngeal dorsum toward the posterior pharyngeal
wall. They also observed a lowering velum toward rising tongue dorsum.
What is more interesting in Arabic phonology
is that this second articulation determines phonemic distinction between sounds having the
same primary articulation e.g / seef / sword
and / seef / summer where / s / and / s / share

the same primary articulation i.e. both are
voiceless alveolar fricative except that / s / is
velarized and/or pharyngalized.
In so far as the vowels preceding and/or following emphatic consonants, Hassan (1981) found
out that they show closer F1 & F2 than those
adjacent to non-emphatic consonants. His
acoustic and myodynamic data also showed
longer vowel duration before emphatic consonants than non-emphatic ones and attributed
this to an earlier execution of the secondary
constriction. Hence these vowels undergo
quantitative as well as qualitative difference.
He asserts that the phonetic exponents of the
emphatic feature in Arabic are not confined to
either the consonantal or the vocalic segment
but stretch over both or probably over the
whole syllable. In Firths (1957) prosodic
analysis terms, the emphatic feature is considered as a prosody which is best seen as extending over units which can encompass more than
one segment. McCowley, cited in Hyman
(1975) argues against Jacobsons theory of distinctive features where he failed to account for
velarized and pharyngalized vowels adjacent to
Arabic velarized and phayngalized consonants
and suggests new phonological rules to cover
these features. In this study we are looking for
acoustic evidence of how far this emphatic feature prevails over the syllables of the word in
Arabic.
Experimental procedure and measurements criteria.

Monosyllabic, bisyllabic and trisyllabic words
have been recorded by two native speakers of
Iraqi spoken Arabic (hereafter ISA) directly to
a PC computer using Speech Station 2 software
copyright 1995 with a microphone connected to
audiocard and noise being cut down. The
words have been used in a carrier sentence; /
iktibuu-------- sit mar'raat/ Write ----six
times. The minimal pairs have been repeated
six times in a continuous recording session.
The two informants (male and female) are in
127

their twenties and have no history of speech
pathology representing educated Iraqi Arabic
Hassan (1981). The file was then transferred to
Praat software version 4.2.21 copyright 19922004 as it was found more specific and thorough in so far as it concerns formant measurements.
Formant Measurements criteria are mainly
based on our vocoid duration criteria Hassan
(2002) and segmentation criteria of Fant (1958,
1960) and Peterson & Lehiste (1960). Vocoid
segments were identified by positioning a cursor at time points in the wave form as well as
the segment onset and offset on spectrograms.
Both aspiration and affrication are excluded
from the domain of vocoid duration. After identifying the vocoid segment domain F1 & F2
measurements were obtained by clicking formants. In this way formant variations are calculated within the specified domain and average value in Hz is given for the formant concerned. This is believed to get as much precision as possible concerning formant measurement and their frequency regions. Values of F1
and F2 for each vocoid e.g. (V1, V2) have
been calculated for six tokens and then values
of F1 are subtracted from values of F2 for each
token. Resultant values, which are indicative of
the distance between F1 and F2, for the six tokens of the word concerned are then computed
with the six token of its counterpart using
Mann-Whitney U test to see how significant the
distances between F1 & F2 among the computed
tokens. Average values of F2 - F1 for each vocoid of the word concerned with their probability values are tabulated in tables 1, 2 and 3. All
the values above are average values of the tokens of both informants (male and female).
Our choice of front unrounded vowels for
analysis is due to the fact that the most important parameters affecting the movement and
transition of formants are lip rounding and back
constriction as both cause F2 to go down to
lower regions of frequencies and F1 few frequencies up, though in different degrees. Lip
rounding is excluded from the analysis to see
how far the constriction in the velo-phryngeal
region affects this formant structure. No measurements of F3 have been made as no conclusive finding has been seen in the literature concerning the relationship between velopharyngeal constriction and F3 lowering. Any
possible formant structure and/or formant transitions in the regions of contoids patterns are
excluded from the measurements and analysis

as it is generally believed that the acoustic parameters affecting vocoids could be the same
affecting contoids patterns particularly when it
concerns emphatic feature as it is still controversial whether the consonant or the vowel is
the main domain of the emphatic feature Hassan (1981).
Results and conclusions

As can be seen from table 1 where the vowel is
either preceded or followed by an emphatic
consonant in monosyllabic words the distance
between F1 & F2 is very much affected by the
velo-pharyngeal constriction as the narrowing
between F1 & F2 is very significant no matter
whether the vowel is followed or preceded by
an emphatic consonant.
Table 1. Average values of measurements in Hz of
F2 - F1 of vowels in monosyllabic words containing
emphatic vs. non-emphatic consonants with P values (Mann-Whitney U test).
F2-F1
Monosyllabic words
1.
/ seef/
V
1042
2.
/seef/
/faad/
/faad/
1571
582
896
U-test(2tailed)
0.002
P<0.01
0.002
P<0.01
This is very much in agreement with previous

studies Hassan (1981).Table 2, however, shows
that not only those vowels preceded or followed by emphatic consonants are affected by
the emphatic feature but also those in the second syllable following or preceding the syllable
containing the emphatic consonant.
It is true that V2 where the emphatic consonant
is in the first syllable or word medial position
showed less level of significance but still the
influence of emphatic is very much accounted
for. That is perhaps why in /'faatir / both V1 and
V2 showed equal level of significance as the
emphatic is word medial position.
The same applies for the third word in the table
where the emphatic is word final position
where V1 shows less value of significance than
V2 as V2 is nearer to the emphatic than V1.
Nevertheless, vowels in all bisyllabic words
showed significant narrowing between F1 & F2
128

and the emphatic feature is seen as prevailing
over the two syllables of the words.
Table 2. Average values of measurements in Hz of
F2 F1 of V1, V2 in bisyllabic words containing emphatic vs. non-emphatic consonants with U-test values showing level of significance between means of
six tokens for each word.
2.
3.
F2-F1
V1
665
1492
0.002
V2
1492
1661
0.04
P<0.01
P<0.05
/'raakid/
602
840
0.002
P<0.01
747
690
1161
0.002
P<0.01
898
/'raakid/
824
1829
U-test(2tailed)
0.03
P<0.05
0.001
P<0.01
/'faatir/
/'faatir/
U-test(2tailed)
3000
2500
V1[a] F2
2000
V1[a] F1
V2[aa] F2
HZ
Bisyllabic
words
1.
/'saaib/
/ 'saaib/
U-test(2tailed)
#
sonant in the first syllable word initial position

the influence is hardly noticeable in the third
syllable where the difference is statistically insignificant. This is surprisingly in total agreement with the findings of Ali and Danilof
(1972) where they found no difference in
tongue position or the articulatory behavior for
/ r / in the two words of the same minimal pair
investigated by them.
However, the picture gets compatible again
with table 2 concerning the second and the
third minimal pairs. In /fa'saail / and /fa'saail
/ for example all vowels and hence all syllables
showed significant difference, though slightly
less significance in V3.
V2[aa] F1
1500
F2_neutral
F3_neutral
1000
F1_neutral
V3[i] F2
500
Table 3. Average values of measurements in HZ of

F2 - F1 of V1, V2 and V3 in trisyllabic words containing emphatic vs. non-emphatic consonants with Utest values showing level of significance between
means of six tokens for each word.
Trisyllabic
# words
0
1
10
11
12
13
14
mm
F2-F1
V1
V2
/ta'baaiir/
610
720
/ta'baaiir/
988
850
U-test(2tailed) 0.002
0.009
P<0.01 P<0.01
V3
1818
1824
0.8
P>0.05
2.
/fa'saail/
558
/fa'saail/
1008
P<0.01
627
931
0.002
P<0.01
1643
1776
0.02
P<0.05
3.
/fa'raaid/
536
/fa'raaid/
695
P<0.01
500
683
0.002
P<0.01
806
1245
0.002
P<0.01
1.
V3[i] F1
Figure 1 Frequencies in Hz plotted against vocal

tract length (glottis-----lips distance.) where values
in Hz of F1 & F2 for V1, V2 and V3 of six tokens of
/fa'raaid/ vs. /fa'raaid/ are spotted to see how
these formants appear from a front constriction to a
velopharyngeal constriction.
In table 3 however, the picture is slightly different with trisyllabic words. The first word /
ta'baaiir /for example where the emphatic con-
What is more interesting is that in /fa'raaid /

and /fa'raaid /all vowels showed the same high
significance despite the fact that the emphatic
consonant is in the last syllable word final position. It is speculated here that the velopharyngeal constriction is assumed in anticipation of
the emphatic consonant in word final position.
This is in sharp contrast to the assertion of Odisho (1975) that the primary stricture must be
129

properly retained while the rearward gesture is
executed and in line with Hassans (1981)
myodynamic findings of an earlier execution of
secondary constriction of vowels preceding
emphatic consonant which was seen as the
main reason behind the longer duration of vowels before emphatic consonant than those before non-emphatic ones. This may also clarify
Mitchells assertion cited in Ali and Danilof
(1972) (personal communication) that emphasis
in Arabic has no predetermined domain-it may
be referable to one, two or three syllables. Also,
this may bear out Hassans (1981) assertion
that it is still uncertain whether the vowel or the
consonant is the main domain of the emphatic
feature of Arabic.
Figure 1 shows clear descend of F2 from higher
frequency regions for V1, V2 and V3 in
/fa'raaid/ to lower regions for /fa'raaid/ for all
the six tokens for each. However, F1 does not
show as much elevation as the lowering of
F2.This is in line with many studies that confirm the importance of F2 lowering over the F1
raising in the identification of the pharyngeal
constriction.(Watson 2002.)
Nevertheless, the figure shows very clear narrowing of F1 & F2 for all the vowels and hence
the syllables of /fa'raaid/. Than those for
/fa'raaid/
Another interesting measurement has been
made for the word duration of /fa'raaid/ in ms
and found to be 613 ms. This is taken to represent the duration of the emphatic feature as well
but it can well last more than that if this word is
followed by another word having an emphatic
consonant. A more extensive study is needed to
investigate this aspect in more detail.
In consequence, it is plausible to conclude from
the above discussion that the phenomenon of
emphatic feature extends beyond the emphatic
consonant and may well prevail, though in different levels of significance, over one, two or
three syllables no matter whether the emphatic
is word initial, word medial or word final position. This acoustic evidence can be added to
other myodynamic, articulatory and perceptual
cues that work in synergism to see how far this
emphatic feature prevails over a multisyllabic
word in Arabic.
Acknowledgements
I would like to thank Mohammed and Day Majeed for acting as informants for the speech material and to Raghad Majeed for her help in
bringing figure 1 to its final shape.
References
Abercrombie, D. (1967) Elements of General
Phonetics. Edinburgh University Press..
Ali, L. H. and Daniloff, R. E. (1972) A cineflorographic phonological investigation
emphatic sounds assimilation in Arabic.
Proceedings of the 7th International Congress of Phonetic Science .Montreal 1971,
Mouton 1972, 639-648.
Fant G. (1958) Modern instruments and methods for acoustic studies of speech. Proceedings of the 8th International Congress of
Linguistics, 282-358 Oslo University Press.
Fant G. (1960) Acoustics Theory of Speech
Production. The Hague, Mouton.
Firth J. R. (1957) Sounds and prosodies in Papers in Linguistics. 121-138. London. Oxford University Press.
Hassan Z. M. (1981) An experimental study of
vowel duration in Iraqi Spoken Arabic Unpublished Ph.D. thesis U.K. Dept. of Linguistics & Phonetics, University of Leeds.
Hassan Z. M. (2002) Gemination in Swedish
and Arabic with a particular reference to the
preceding vowel duration: an instrumental
and comparative approach. Proceedings of
Fonetik 2002 TMH-QPSR 44, 81-85.
Hyman L. M. (1975) Phonology: Theory and
Analysis. Holt, Rinehart and Winston.
Odisho. E. Y. (1975) The phonology and phonetics of the Neo-Aramaic as spoken by the
Assyrians in Iraq. Unpublished Ph.D. Thesis, Dept of Phonetics, University of Leeds.
Peterson & Lehiste (1960) Duration of syllable
nuclei in English. Journal of the Acoustical
Society of America, 32 (6) 693-703.
Watson,Janet C.E (2002) The Phonology and
Morphology of Arabic.Oxford University
Press.
130
Athens 2006 ISCA Workshop on Experimental Linguistics

Antonis Botinis1,2, Christoforos Charalabakis1, Marios Fourakis3 and Barbara Gawronska2
1
Department of Linguistics, University of Athens, Greece
2
School of Humanities and Informatics, University of Skvde, Sweden
3
Department of Speech and Hearing Science, The Ohio State University, Columbus, USA
Abstract
This paper refers to the forthcoming ISCA (International Speech Communication Association) Workshop Experimental Linguistics in
Athens, Greece, in 2006. The major objectives
of the Workshop are (1) the development of experimental methodologies in linguistic research
with new perspectives for the study of language, (2) the unification of linguistic knowledge
in relation to linguistic theory based on experimental evidence, (3) the design of multifactor linguistic models and (4) the integration of
interdisciplinary expertise in linguistic applications. Key knowledge areas ranging from cognition and neurophysiology to perception and
psychology are organised in invited lectures as
well as in oral and poster presentations along
with interdisciplinary panel discussions.
Background
The present paper refers to the forthcoming
workshop Experimental Linguistics, which is
an ISCA (International Speech Communication
Association) interdisciplinary workshop, to be
held in Athens, Greece, in 2006 (workshop details along with paper submission procedures
will be announced later this year). The workshop is organised under the joint aegis of the
University of Athens, Department of Linguistics, Greece, The Ohio State University, Department of Speech and Hearing Science, USA,
and the University of Skvde, School of Humanities and Informatics, Sweden.
The scientific study of language is the
backbone of a variety of established disciplines,
among them theoretical linguistics, experimental phonetics, computational linguistics and
language technology. Language is a complex
code system, the study of which is related to a
variety of knowledge areas ranging from cognition and neurophysiology to perception and
psychology. The code system of language consists of functional categories in variable combinations and relations with multiple interactions,
which determine the linguistic structure and the
communicative function of language.
131
The study of language necessarily involves

a variety of scientific fields associated with different aspects of linguistic knowledge. Thus,
although language is commonly studied within
the context of specific questions and targets of
investigation, it is nevertheless a multifactor
complex code with distinctive interactions between factors. Variability in any factor may
have a serial effect on other factors which determine both the linguistic structure and the
communicative function of language.
Theoretical linguistics is probably the most
widespread discipline of linguistic inquiry.
Specification of abstract linguistic categories
and linguistic structures related to language
functions in established theoretical frameworks
leads to analytical methodologies and theoretical hypotheses, which define the type of linguistic knowledge. Experimental phonetics runs
a course parallel to theoretical linguistics.
Specification of discrete phonetic correlates of
abstract phonetic categories and phonetic structures related to language functions in sophisticated laboratory environments leads to analytical methodologies and theoretical hypotheses,
which define the type of phonetic knowledge.
Computational linguistics focuses on data
specification and data processing against the
background of different theories and experimental methodologies in language applications.
Language technology is also a language applications discipline with interdisciplinary anchoring in, among others, theoretical linguistics,
experimental phonetics, computational linguistics and computer science.
In general, linguistic knowledge is related to
a variety of established disciplines. This knowledge is, however, fragmented into different areas with different objectives, typically leading
to tailor-cut specific theories and variable trial
and error methodologies. Although much progress has been achieved in integrating interdisciplinary research and applications, much still
lies ahead, especially with reference to the development and use of experimental methodologies with theoretical impact and the unification
of linguistic knowledge.
Objectives and perspectives

The major objectives of the Workshop are (1)
the development of experimental methodologies in linguistic research with new perspectives for the study of language, (2) the unification of linguistic knowledge in relation to linguistic theory based on experimental evidence,
(3) the design of multifactor linguistic models
and (4) the integration of interdisciplinary expertise in linguistic applications.
Knowledge is a fundamental condition of
any science, be that theoretical, experimental or
applied. How can we however define knowledge? And how much do we know about language? Assuming that knowledge is the ensemble of existing hypotheses in a scientific area,
we may proceed with the very basic hypothesis
about the nature of language and its typical
characteristics. Language is the verbal means of
communication mainly between speaker(s) or
writer(s), i.e. senders, and hearer(s) or
reader(s), i.e. receiver(s). Assuming further that
written language is a conventional representation of spoken language, we will hypothesise
about the latter.
The intent to communicate through language initialises in the sender abstract concepts
which are coded and sent through air or compatible transmission means as physical signals.
These signals reach the receiver, are decoded
and interpreted, and the message (in an ideal
case, the message intended by the sender) is
perceived and a communicative act thus takes
place. A communicative act may however be a
complex process with variable interactions between sender and receiver. What is going on
inside the senders and receivers heads respectively? And what is the relation between the
physical signal and the perception of language?
Assuming that the transfer of meanings is the
basic function of language, where is meaning
derived from? How much do we know about
these questions and how can we increase our
knowledge? Although intensive research has
added to our knowledge regarding these questions, much is still required before we can successfully deal with basic aspects of these and
similar questions.
Nevertheless, we do have a considerable
amount of scientific knowledge about language
to consider and make some plausible assumptions. In order to be perceived, the intended
message has to be produced and the outcome of
this production is in spoken communication
an acoustic signal. This signal is the end result
132
of motor commands from the brain to, and coordinated control of, the speech production
mechanism. The acoustic signal is processed
and decoded by the auditory system and the
perceptual mechanism, which ultimately extract
the intended meaning. Consequently, the mutual relation between acoustic signal and intended meaning may be considered the very
basic functional anchoring of linguistic structure and language communication.
There are however several discrepancies between acoustic signal and intended meaning,
the most characteristic of which are the continuous vs. discrete, as well as the one-to-many,
relationships between the two. Thus, the acoustic signal is basically a continuous process
whereas the intended meaning is a structural
unit which consists of discrete functional categories. On the other hand, some variations of
the acoustic signal, even large ones, may have
no functional effect, whereas other variations,
even small ones, may have critical effects on
the transmission of intended meanings. Also,
the same parameter of the acoustic signal may
have different effects, whereas different acoustic parameters may have the same effect on intended meanings in different contexts.
A typical example of the case above is segmental categories where duration or aspiration
parameters may independently or in combination determine variable distinctions such as in
stop consonants, and may have critical effects
on produced words and thus meanings in a variety of languages. Duration may determine
several distinctions, such as lexical stress and
other prosodic categories, which also have
critical effects on produced meanings. Sentence
types and intonation forms are also typical examples in which, independently or in combination with other linguistic markers, dissimilar
intonation forms may define the same sentence
type, and, inversely, different sentence types
may be defined by similar intonation forms in
different contexts.
In relation to the question above where is
meaning derived from?, if we hypothesise that
meaning is basically derived from the acoustic
signal, we are consequently led to the question
how is meaning derived?. Linguistic theories
have historically posited a variety of functional
linguistic units such as phonemes, morphemes,
phrases and so forth, which the acoustic signal
is presumably organised into. However, how
much are these units determined by the acoustic
signal and how much by psycholinguistic proc-

esses and knowledge of a particular language?
For example, is the segmentation of the continuous acoustic signal into discrete words determined by acoustic effects or by knowledge
of the language and the corresponding words?
Alternatively, are some functional linguistic
units determined by acoustic effects and others
by knowledge of the language? Phonetic research on both production and perception aspects of these and similar questions is most indicative but further progress is definitely
needed before we have basic psycholinguistic
knowledge in these areas.
Each and every acoustic signal is a unique
event whereas any produced meaning may be in
principle infinitely reproduced by any speaker
of a given language. Thus, no acoustic signal
may be identically reproduced in strict physical
terms by different or even the same speaker in
general. In addition to linguistic meaning, the
acoustic signal may however be related to a
large variety of other functions such as paralinguistic and extra linguistic ones with reference
to e.g. speakers emotional state, age or gender.
Consequently, relating the acoustic signal to
intended meaning is a complex process, during
which the acoustic signal is subjected to a variety of functional filterings in relation to distinctive linguistic categories and structures. In this
respect, language is presumably organised in
different levels of abstraction involving multiple interactions between the acoustic signal and
intended meaning.
Linguistic theory studies, as a rule, language
with reference to several linguistic components
and key areas such as phonology, morphology
and syntax in isolation rather than in interaction. Although isolation methodology has its
own merits in linguistic analysis, linguistic
communication functions as a whole, with multiple interactions between categories and structures of different components, which have, in
addition, variable overlapping functions. In
languages, for example, with inflectional morphology, case variability distinctions may have
a series of functional effects not only on morphology proper but also on syntax, such as subject and object specifications, and ultimately on
semantic interpretation of produced utterances.
Interaction analysis and integration perspectives between linguistic categories and linguistic structures in the study of language have
generally not drawn particular attention, and
thus little is known about the contribution of
different components and linguistic categories
133
to linguistic communication. Although different

components have historically been the object of
intensive discussions, basic questions of the
effects of morphology or syntax on the language communication process in the first place
have only cursorily been addressed.
In the vein of the functional anchoring between acoustic signal and produced meaning,
isolation methodology, despite theoretical simplicity and practical merits, may lead to serious
inadequacies in the study of language and linguistic theory. Thus, although semantics, for
example, is related to all linguistic categories
and structures and to the variability of the
acoustic signal, it is most usually studied with
reference to selective functional variability in
the lexicon and syntax. On the other hand, although phonetics studies the functional variability of the acoustic signal, it most usually
concentrates on the phonetic correlates of phonetic categories and phonetic structures with no
attempt to integrate phonetic functioning in
general linguistic theory. Similarly, linguistic
theory does not usually involve production and
perception mechanisms, for which phonetics
has an established expertise and sophisticated
experimental methodologies.
Produced meanings are not however defined
unless the situational context is specified, and
thus, the pragmatics of linguistic communication in relation to the acoustic signal is another
dimension of the link and interactions between
acoustic signal and produced meanings. The
unification of linguistic knowledge and the
overall integration of theoretical linguistics and
experimental phonetics seems the most realistic
and promising enterprise in the study of language and the development of linguistic theory.
Coming back to the major objectives of the
Workshop, the use of experimental methodologies in linguistic research, similar to the ones
already established in phonetic research, may
be a promising start on the road towards new
methodologies for and new perspectives on the
study of language. Standard scientific criteria
may thus be applied on linguistic analysis and
development of linguistic theory, based on solid
premises provided by experimental evidence.
Promotion and unification of linguistic knowledge will provide an interactive background to
the forms and functions of different components which will pave the way to the design of
multifactor linguistic models to be applied in a
variety of language applications with interdisciplinary dimensions.
Organisation
The Workshop is organized into invited speaker
lectures, original research presented in oral and
poster
presentations, and interdisciplinary
panel discussions. Both oral and poster papers
will undergo standard peer review from independent reviewers of the International Scientific Committee and a post-workshop volume is
planned with representative papers of key
knowledge areas, in addition to the published
proceedings which is common practice for ISCA
Workshops.
In accordance with the objectives and perspectives of the Workshop, key knowledge areas and language applications are shown in Table 1.
Table 1. Key linguistic areas and language applications of the Athens 2006 ISCA Workshop on Experimental linguistics.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Theory of language
Cognitive linguistics
Neurolinguistics
Speech production
Speech acoustics
Phonology
Morphology
Syntax
Prosody
Speech perception
Psycholinguistics
Pragmatics
Semantics
Discourse linguistics
Computational linguistics
Language technology
Key linguistic areas and language applications are organized into linguistic domains,
rather than theoretical premises and analysis
methodologies, in order to allow for interdisciplinary approaches and interaction perspectives.
The way we see and study language with reference to general linguistic theory, and the relation and interaction between phonetic production and produced meaning, are paid special
attention. Extensive discussions of the theoretical assumptions of key linguistic areas as related to models and experimental methodol-
134
ogyies will be carried out. Plenty of room will

also be provided for discussion of the nature of
linguistic training and the requirements for linguistic education in a variety of scientific contexts in both theory and practice.
Outlook
Language has been studied throughout human
history with various, sometimes overlapping,
perspectives and objectives, and with various
applications in mind. The pursuit of linguistic
knowledge has been the driving force behind
the study of language as one of the most defining characteristics of human beings. Language
has been the primary means of communication
in human societies and there have been several
applications critical to the development of human societies as we know them.
Among these applications, three have
marked the route of human societies. First,
early insights in basic forms and functions of
phonetic systems led to the development of
writing systems and the acquisition of reading
and writing skills as part of educational systems, probably the most fundamental language
application. Second, basic knowledge of voice
characteristics and voice signal transmission led
to the development of telephone systems and
distant voice communication. Third, the growth
of information technology and the advent of the
Internet paved the way for a variety of language
applications. Thus, in our days, linguistic research and language technology set common
goals and joint efforts into multifunctional language applications. However, the main precondition for the achievement of these goals and
the fruition of these efforts is that solid theoretical knowledge meets basic scientific criteria
and reflects linguistic reality.
As every era sets its conditions, our era sets
additional requirements and alternative methodologies for the study of language. A new
generation of linguists is to be educated and
equipped with a rich arsenal of experimental
methodologies and new perspectives in linguistic research, and the present Workshop is organised in order to discuss and set the groundwork for this in an interdisciplinary context.
A positional analysis of quantity complementarity in

Swedish with comparison to Arabic.
1
Zeki Majeed Hassan and 2Barry Heselwood

1
Department of Linguistics, University of Gothenburg, Gothenburg.Sweden
2
Department of Linguistics and Phonetics, University of Leeds, Leeds, UK.
*Alphabetical order.
Abstract
The most favoured solution to the problem of
quantity complementarity in Swedish has been
to claim that vowel length is phonemic and
consonant length is predictable (Linell, 1978).
Evidence from listeners perceptual behaviour
supports this over the reverse claim that it is
only consonant length which is distinctive
(cited in Czigler, 1998), a position that has
nevertheless been argued for (Eliasson & La
Pelle, 1973). However, there is a phonological
cost: the vowel inventory must be doubled. We
present an analysis based on positional criteria
to account for the phonetic facts reported in
instrumental studies such as Czigler (1998),
Hassan (2002), Strangert & Wretling (2003),
without the cost of additional phonemes. It
takes Trubetzkoys (1969) correlation of syllable contact and develops it according to more
recent functionalist principles of phonotactic
analysis (Mulder, 1989; Dickins, 1998). Vowel
and consonant length are predicted by whether
there is a consonant in the phonotactic position
immediately following the syllable nucleus.
Quantity complementarity in Swedish is compared to vowel and consonant length in Arabic
and shown to bear out Hassans (2003: 48) assertion that the phenomenon of length constitutes a systematic difference between the phonological systems of both languages.
Duration and length in Swedish

stressed syllables
In Swedish, stressed syllables1 can be of three
types regarding the content of their rhyme
structure. This can be expressed as Danells
formula (Wittting, 1977):
A) long vowel only, e.g. se see
B) short vowel + long consonant, e.g.
hatt hat
C) long vowel + short consonant, e.g.
hat hate
It is the types B) and C) which have intrigued
phoneticians and phonologists because of the
relationship between the vowel and its follow-
ing consonant. From a phonetic point of view,

the relationship is one of inverse co-variation of
duration. Because the durational differences for
both the consonants and the vowels are significantly well above the absolute difference
limens (Hassan, 2002), we can say that we are
dealing with length differences as well as durational differences. We therefore have inverse
co-variation of phonological length when the
vowel is long, the consonant is short, and vice
versa, although there is considerable variation
in degree of complementarity cross-dialectally
(Strangert & Wretling, 2003). As well as a long
consonant, a cluster of two short consonants
can follow a short vowel. Examples are given
in Table 1.
Throughout this paper, discussion and exemplification will be restricted to monosyllabic
forms in order to avoid the issue of where the
syllable boundary is in forms such as Swedish
titta to look, or Arabic kattab he made somebody write. Also excluded are forms with a
cluster of three consonants such as svensk,
tjnst on the assumption that the final consonant is outside the domain of quantity complementarity. Morphological boundaries are ignored.
Table1. Inverse co-variation of vocalic and
consonantal length in Swedish stressed syllables.
Long vowel + short Short vowel + long conconsonant
sonant
hat [
: ] hate
hatt [
:] hat
lam [
vg [
kp [
:
:
:
]lame
lamm [
:] lamb
] road
vgg [
:] wall
] buy kpt [
] bought
As implied by the transcriptions, as well as a

difference in vowel length there is also a
clearly noticeable difference in vowel quality.
135

Although vowel quality has been found not to
be robust as a perceptual cue across all shortlong pairs (Behne, Czigler & Sullivan, 1996;
1998, cited in Czigler, 1998), Linell (1978) adduces it as evidence that there is a phonemic
opposition between short and long vowels. The
quality of consonants in terms of place and
manner of articulation is, however, not noticeably affected by length. The taxing question is,
how should these phonetic facts be analysed
phonologically? There is not space in this paper
to discuss the relative merits of all possible answers to this question, but they include, using
hatt-hat as examples, the following:
1) phonemic vowel length and phonemic
consonant length - /hat:/, /ha:t/;
2) phonemic consonant length with vowel
length predictable - /hat:/, /hat/;
3) phonemic vowel length with consonant
length predictable (Elert, 1964; Witting,
1977; Linell, 1978; Czigler, 1998) /hat/, /ha:t/;
4) singleton-geminate consonant opposition with vowel length predictable (Eliasson & La Pelle, 1973; Eliasson, 1978)
- /hatt/, /hat/.
Each of the above analyses seems to account
equally for the phonetic facts, so we need to
bring the criterion of economy of description to
bear. 1) requires doubling the vowel inventory
and the consonant inventory and could be rejected as an uneconomic solution. 2) requires
doubling the consonant inventory but not the
vowel inventory. 3) requires doubling the
vowel inventory but not the consonant inventory, which is to be preferred over 2) because
the vowel inventory is smaller in the first place;
it is the solution favoured by most writers on
the subject, following Elert (1964), although
Linell (1978: 125), an advocate of this solution,
admits that vowel length is distinctive only
before single consonants. 4) does not require
any additional phonemes in the inventories, but
does lead to an increase in the complexity of
some of the phonological forms that take part in
quantity complementarity. For example, the
phonological form of hatt must comprise four
phonemes instead of three and therefore falls
foul of a phonotactic simplicity metric.
All the above analyses involve some difference in phonemic content to account for the
quantity complementarity. It is worth consider-
ing an alternative approach which accounts for

quantity phenomena in prosodic terms rather
than in phonemic terms, and that is Trubetzkoys correlation of syllable contact.
Trubetzkoys correlation of syllable contact analysis

Trubetzkoy (1969: 199-201) provides a prosodic analysis of quantity complementarity in
Swedish instead of the kind of phonemic analyses outlined in (1)-(4) above. That is to say, the
phonological difference between the pairs in
table 1 and others like them is due to a prosodic
opposition, not a phonemic opposition. According to Trubetzkoy, postvocalic consonants in
stressed syllables in Swedish can relate to the
preceding vowel with either close contact or
open contact. In the case of close contact, the
vowel is predictably shortened and the consonant equally predictably lengthened. In the case
of open contact, the vowel is predictably long
and the consonant equally predictably short.
Both vowel length and consonant length are
predictable and hence non-distinctive, occurring as a consequence of, respectively, open
and close syllable contact. When there is no
following consonant the situation is as for open
contact it is pertinent here to note that short
vowels do not occur in open stressed syllables
in Swedish.
Trubetzkoys syllable contact analysis has
the advantage of descriptive economy over alternatives (1)-(3) above in that it does not necessitate setting up either long vowel or long
consonant phonemes in opposition to short
ones. Neither does it require increasing the
complexity of phonological forms as is the case
in (4). However, while the phonetic lengthening of vowels in open contact is entirely plausible in order to fill what we can think of as a vacated space, Trubetzkoys analysis fails to give
an adequate explanation for why, in close contact, the postvocalic consonant is lengthened, or
of why it is not lengthened when it is part of a
cluster of two consonants.
A positional analysis of Swedish

quantity complementarity
We propose that it is possible to recast
Trubetzkoys analysis in such a way as to retain
its insights and advantages while rendering it
more explanatory, and that the way to do this is
to account for the distribution of phonemes in
Swedish stressed syllables in terms of a frame
of phonotactic positions known as a phonotagm
(Mulder, 1989; Dickins, 1998). Mulder (1989:
444) explains that a phonotagm is the minimum type of structure within which the distri-
136

bution of cenotactic (natural language: phonotactic) entities can be described completely and
exhaustively. It comprises a set of positions to
which the constituent phonemes of a phonological form can be assigned by functional criteria. Unlike the syllable, realisational properties of phonemes are ignored when assigning
phonemes to positions. Quantity complementarity in Swedish stressed syllables can be accounted for by setting up a phonotagm with two
post-nuclear positions which we shall call postnuclear1 and post-nuclear2. Together with the
nucleus itself (the identity element of a phonotagm), this is somewhat analogous to the
three rhyme X-slots set up for English by
Giegerich (1992). The nucleus is always occupied by a vowel phoneme, while the postnuclear positions can either be phonologically
empty or occupied by a consonant phoneme.
Examples are given in table 2.
Table 2. Positional analysis of quantity complementarity
onset
nucleus
postnuclear1
postnuclear2
A generalised realisation statement to the effect

that an empty position is filled by the phonetic
material from the preceding position accounts
for the phonetic facts.
Positional analysis gives a coherent phonotactic interpretation to Trubetzkoys syllable
contact analysis. Close contact is equivalent to
the consonant occupying POST-NUCLEAR1, and
open contact equivalent to its being in POSTNUCLEAR2. The advantage over Trubetzkoys
analysis is that we can explain the lengthening
of the vowel in rhyme types A and C, and the
lengthening of the consonant in rhyme type B,
in the same terms. It also accounts for why a

consonant in POST-NUCLEAR1 does not
lengthen if it is part of a cluster, i.e. if there is
another consonant in POST-NUCLEAR2. It shares
with Trubetzkoys analysis the advantage of
not having to set up additional phonemes in the
inventory of Swedish because it renders both
vowel and consonant length predictable and
therefore non-distinctive. The opposition between pairs such as hat-hatt etc. is set up purely
as a phonotactic difference the phonological
forms comprise the same phonemes, but distributed differently in the phonotactic frame.
Comparison to vowel and consonant length in Arabic

The situation regarding vowel and consonant
length in Arabic is rather different and does not
lend itself to the analysis proposed above for
Swedish. While there is evidence of inverse covariation of duration between consonants and
vowels in the Iraqi Arabic data examined by
Ghalib (1984) and Hassan (1981, 2002),
whether this extends to inverse co-variation of
length is not so clear. According to Hassan
(2002), vowel duration differences before singleton and geminate consonants do not significantly exceed difference limen values. In this
he is in agreement with Ghalib (1984) who also
concluded that such vowel duration differences
are negligible.
The really important difference between
Arabic and Swedish in respect of quantity concerns predictability. Above we presented a positional analysis in which Swedish quantity is
predictable on the basis of the contrastive distribution of phonemes within the phonotagm.
The reason this analysis works is that there are
no stressed syllables in Swedish of the type
CV:C:2 or CVC. Arabic presents a different
picture because these types do exist in opposition to CVC: and CV:C. In fact in Arabic it appears that all combinations of short and long
vowels and consonants can occur in stressed
syllables. Mitchell (1990: 65) gives the minimal pair example / aam/ year and / aamm/
public showing that consonant length is distinctive after a long vowel, and Hassan (2003:
45-6) provides /samm/ poison and /saamm/
poisonous to show that vowel length is distinctive before a long consonant. There is therefore no inverse co-variation of quantity in Arabic: vowel quantity and consonant quantity
137

vary independently and neither can be predicted from knowing the other. This, we suggest, bears out Hassans (2003: 48) contention
that quantity in Swedish and Arabic constitutes a systematic difference between the phonologies of both languages.
Notes
1. By stressed syllable we mean one that is
not unstressed, i.e. it includes secondary stress
(or what has been called reduced main stress)
as well as primary stress.
2. Witting (1977) cites moln cloud as an example of CV:CC to argue that vowel length is
not predictable when followed by a cluster, but
the pronunciation [mo:ln] is described as a
regional exception by Czigler (1998: 23) and
marked as an exception by Linell (1978: 123),
we therefore discount it.
References
Behne, D.M., Czigler, P.E. & Sullivan, K.P.H.
(1996) Acoustic characteristics of perceived
quantity in Swedish vowels. Speech Science
and Technology 96,(Adelaide), 49-54.
Behne, D.M., Czigler, P.E. & Sullivan, K.P.H.
(1998) Perceived vowel quantity in Swedish: effects of postvocalic voicing. Proceedings of the 16th International Congress of
Acoustics and the 135th Meeting of the
Acoustical Society of America, (Seattle),
2963-64.
Czigler, P.E. (1998) Timing in Swedish VC(C)
sequences. PHONUM 5, Dept of Phonetics,
Ume University.
Dickins, J. (1998) Extended Axiomatic Linguistics. Berlin: Mouton de Gruyter.
Elert, C.-C. (1964) Phonologic Studies of
Quantity in Swedish. Uppsala: Almqvist &
Wiksell.
Eliasson, S. (1978) Swedish quantity revisited.
In Grding, E., Bruce, G. & Bannert, R.
(eds) Nordic Prosody. Dept of Linguistics,
Lund University. 111-122.
Eliasson, S. & La Pelle, N. (1973) Generative
regler fr svenskans kvantitet. Arkiv fr
nordisk filologi 88, 133-148.
Giegerich, H.J. (1992) English Phonology.
Cambridge: Cambridge University Press.
Ghalib, G.B.M. (1984) An experimental study
of consonant gemination in Iraqi Spoken
Arabic. Unpublished PhD Thesis, University of Leeds.
Hassan. Z.M (1981) An experimental study of

vowel duration in Iraqi Spoken Arabic. Unpublished Ph.D. Thesis U.K: Dep. of Linguistics & Phonetics, University of Leeds.
Hassan, Z.M. (2002) Gemination in Swedish
and Arabic with a particular reference to the
preceding vowel duration: an instrumental
and comparative approach. In Proceedings
of Fonetik 2002 TMH-QPSR 44, 81-85.
Hassan, Z.M. (2003) Temporal compensation
between vowel and consonant in Swedish &
Arabic in sequences of CV:C & CVC: and
the word overall duration. PHONUM 9, 4548, Dept of Phonetics, Ume University.
Linell, P. (1978) Vowel length and consonant
length in Swedish word level phonology. In
Grding, E., Bruce, G. & Bannert, R. (eds)
Nordic Prosody. Dept of Linguistics, Lund
University. 123-136.
Mitchell, T.F. (1995) Pronouncing Arabic. Oxford: Clarendon Press.
Mulder, J.W.F. (1989) Foundations of Axiomatic Linguistics. Berlin: Mouton de
Gruyter.
Strangert, E. & Wretling, P. (2003) Complementary quantity in Swedish dialects.
PHONUM 9, 101-104, Dept of Phonetics,
Ume University.
Trubetzkoy, N.S. (1969) Principles of Phonology. Berkeley: University of California
Press.
Witting, C. (1977) Studies in Swedish Generative Phonology. Stockholm: Almqvist &
Wiksell.
138
Author index
Ayusawa, Takako
103
Asu, Eva Liina
29
Bannert, Robert
75
Bjurster, Ulla
55
Blomberg, Mats
51
Bodn, Petra
37
Bonsdroff, Lina
59
Botinis, Antonis
95, 99, 123, 131
Cerrato, Loredana
41
Charalabakis, Christoforos
131
Edlund, Jens
107
Eklund, Petra
63
Elenius, Daniel
51
Engstrand, Olle
59, 63, 67
Fourakis, Marios
123, 131
Fransson, Linna
79
Ganetsou, Stella
95
Gawronska, Barbara
131
Griva, Magda
95
Gunnarsdotter Grnberg, Anna
5
Gustafsson, Kerstin
63
Gustavsson, Lisa
83
Hassan, Zeki Majeed
127, 135
Heselwood, Barry
135
Hincks, Rebecca
45
House, David
107
Huber, Dieter
49
Ivachova, Ekaterina
63
Jande, Per-Anders
25
Jensen, Christian
Kadin, Germund
Karlsson, Fredrik
Karlsson, sa
Kim, Yuni
Klintfors, Eeva
Kostopoulos, Yannis
Krull, Diana
Lacerda, Francisco
Lindh, Jonas
Nagano-Madsen, Yasuko
Nikolaenkova, Olga
Nolan, Francis
Oppelstrup, Linda
Orfanidou, Ioanna
Schaeffler, Felix
Schtz, Susanne
Segerup, My
Seppnen, Tapio
Skantze, Gabriel
Strangert, Eva
Stlten, Katrin
Sundberg, Ulla
Themistocleous
Thorn, Bosse
Toivanen, Juhani
Tndering, John
Vyrynen, Eero
139
111
67
71
63
9
83
99
33
55, 83
17, 21
103
99
29
51
123
1
87
13
119
107
79
91
55
99
115
119
111
119
140

Proc Fonetik 2005

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Proc Fonetik 2005

Încărcat de

Drepturi de autor:

Formate disponibile

Proceedings

Proceedings FONETIK 2005

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

Gteborg in May 2005

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

Previous Swedish Phonetics Conferences (from 1986)

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

Phonological variation and geographical orientation among students in a West

On the phonetics of unstressed /e/ in Stockholm Swedish and FinlandSwedish

The interaction of word accent and quantity in Gothenburg Swedish

Speaker recognition and synthesis

A model based experiment towards an emotional synthesis

Annotating speech data for pronunciation variation modelling

Prosody duration, quantity and rhythm

Duration of syllable-sized units in casual and elaborated Finnish: a comparison with

Language contact, second language learning and foreign accent

The communicative function of "s" in Italian and "ja" in Swedish: an acoustic

Presenting in English and Swedish

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

Scoring children's foreign language pronunciation

Speech development and acquisition

Durational patterns produced by Swedish and American 18- and 24-month-olds:

/r/ realizations by Swedish two-year-olds: preliminary observations

Tonal word accents produced by Swedish 18- and 24-month-olds

Development of adult-like place and manner of articulation in initial sC clusters

Multi-sensory information as an improvement for communication systems efficiency

Effects of age of learning on VOT in voiceless stops produced by near-native L2

Prosody F0, intonation and phrasing

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

Prosodic correlates of attitudinally-varied back channels in Japanese

Perceived prominence and scale types

The postvocalic consonant as a complementary cue to the perception of quantity in

Gender differences in the ability to discriminate emotional content from speech

Additional paper submitted for the poster session

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

Phonological Quantity in Swedish dialects: A data-driven categorisation

Material and Subjects

Aims and Objectives

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

to the Northern parts of Sweden, while the

Figure 1: Geographic distribution of the 4

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

cluster (1) and (2), due to their longer C:. V:/C

Relations between long and short segments

Relations between vowel and consonant duration

Cluster (4) is clearly separated from the rest

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

ence, that the loss of the V:C: structure in the

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

Phonological variation and geographical orientation

Geographical orientation and linguistic

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

dard forms and innovations to a much higher

Figure 1. Frequency of U variants in geographical

Elements in the Sollebrunn-informants adherence to groups point towards a stronger need

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

Further spreading of Gteborg features?

Figure 2. Frequency of I/Y variants in geographical

One of the lexical variables displays a

Figure 3. Frequency of R variants in geographical

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

school, thus having reached the age of 15-16