Documente Academic
Documente Profesional
Documente Cultură
FONETIK 2005
The XVIIIth Swedish Phonetics Conference
May 2527 2005
Department of Linguistics
Gteborg University
Preface
This volume contains the contributions to FONETIK 2005, the Eighteenth Swedish
Phonetics Conference, organized by the Phonetics group at Gteborg University on
May 2527, 2005. The papers appear in the order they were presented at the conference.
Only a limited number of copies of this publication has been printed for distribution among the authors and those attending the conference. For access to electronic
versions of the contributions, please look under:
http://www.ling.gu.se/konferenser/fonetik2005/
We would like to thank all contributors to the Proceedings. We are also indebted to
Fonetikstiftelsen for financial support.
sa Abelin
iii
Jonas Lindh
IX
X
XI
XII
XIII
XIV
XV
XVI
XVII
1986
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
Uppsala University
Lund University
KTH Stockholm
Ume University (Lvnger)
Stockholm University
Chalmers and Gteborg University
Uppsala University
Lund University (Hr)
(XIIIth CPhS in Stockholm)
KTH Stockholm (Nsslingen)
Ume University
Stockholm University
Gteborg University
Skvde University College
Lund University
KTH Stockholm
Ume University (Lvnger)
Stockholm University
iv
Contents
Dialectal, regional and sociophonetic variation
Phonological quantity in Swedish dialects: A data-driven categorization
Felix Schaeffler
13
17
21
25
29
33
37
41
45
49
51
55
59
63
67
71
Poster session
Phonological interferences in the third language learning of Swedish and German
(FIST)
Robert Bannert
75
Word accents over time: comparing present-day data with Meyers accent contours
Linna Fransson and Eva Strangert
79
83
Effects of stimulus duration and type on perception of female and male speaker age
Susanne Schtz
87
91
vi
95
99
103
Speech perception
Prosodic features in the perception of clarification ellipses
Jens Edlund, David House, and Gabriel Skantze
107
111
115
119
Speech production
Vowel durations of normal and pathological speech
Antonis Botinis, Marios Fourakis and Ioanna Orfanidou
123
Acoustic evidence of the prevalence of the emphatic feature over the word in Arabic
Zeki Majeed Hassan
127
Closing discussion
Athens 2006 ISCA Workshop on Experimental Linguistics
Antonis Botinis, Christoforos Charalabakis, Marios Fourakis and Barbara
Gawronska
131
135
Author index
139
vii
viii
Abstract
This study presents a data-driven categorisation (cluster analysis) of 86 Swedish dialects,
based on durational measurements of long and
short vowels and consonants. The study reveals
a clear geographic distribution that, for the
most part, corresponds with dialectological descriptions. For a minor group of dialects,
however, the results suggest mismatches
between the quantity system and the observed
segment durations. This phenomenon is discussed with reference to a theory of quantity
change (Labov 1994).
Introduction
Phonological quantity in Standard Swedish is
usually described as being complementary:
Long vowels in closed syllables are followed
by short consonants, while short vowels are always followed by a long consonant or a consonant cluster.
This modern system has developed from a
quantity system with independent vowel and
consonant quantity, where all four possible
combinations of long and short segments (VC,
V:C, VC: and V:C:) existed. The modern system evolved by shortening of V:C: and lengthening of VC structures. Not all dialects of Modern Swedish have completed this change. Some
dialects kept the four-way-distinctions. This applies to a group of dialects in the FinnishSwedish region and in Dalarna in Western
Sweden. Another group of dialects abandoned
V:C: successions but kept VC structures. This
has mainly been reported for large parts of
Northern Sweden, but also for some places in
Middle Sweden.
There are, thus, today at least three different
quantity systems in the dialects of Modern
Swedish: 4-way-systems (VC, V:C, VC: and
V:C:), 3-way-systems (VC, V:C and VC:) and
2-way-systems with complementarity quantity
(V:C and VC:).
Data-driven categorisation
The method of choice in this study was a hierarchical cluster-analysis, with euclidean dis-
Results
The visual inspection of the cluster dendrogram
suggested a four cluster solution as an appropriate number of clusters for the current analysis.
Additionally, the parameter 2 was calculated,
which is usually used in analyses of variance to
describe the amount of explained variance, but
has also been suggested as a criterion for the
estimation of different cluster solutions (Timm,
2002). The four-cluster solution lead to an 2
value of 0.68. For comparison, the three-cluster
solution showed an 2 of 0.32, and a tencluster solution showed an 2 of 0.86. The large
increment of 2 from the three to the four
cluster solution followed by comparatively low
increments supported the four cluster solution.
Durational
characteristics
of
the
clusters
The figures 2 and 3 show the distributions of
the four segment durations across the four
clusters in the form of box-plots. Figure 2
shows the distribution of V: and V durations,
figure 3 of C: and C durations.
The Finnish cluster (4) is separated from the
other clusters by longer V: and V durations and
shorter C durations. C: durations in the Finnish
cluster (4) are close to the ones in the Northern
cluster (3), but clearly longer than those in the
Southern clusters (1) and (2). Consequently, the
Northern cluster (3), as well, is separated from
the Southern clusters (1) and (2) by longer C:
durations. However, the Northern cluster (3)
shows rather long C durations, which constitutes a clear difference to the short C durations
in the Finnish cluster (4).
Geographic distribution
Figure 1 shows a stylised map of Sweden and
the Swedish parts of Finland. The colour-coding and the numbers show the geographic distribution of the four clusters.
The clusters show a clear geographic distribution. Cluster (4), n=7, separates all dialects
on the Finnish mainland from the rest of the
dialects. Cluster (3), n=23, is mainly restricted
2
V:/C
V/C:
Cl. (1)
1.15
0.53
Cl. (2)
1.13
0.57
Cl. (3)
1
0.41
Cl. (4)
1.83
0.44
Figure 2: V and V: durations per cluster. Grey: short vowels, white: long
vowels.
Figure 3: C and C: durations per cluster. Grey: short consonants, white: long
consonants.
Figure 4: V:/V and C:/C ratios per cluster. Grey: vowel ratios, white: consonant ratios.
Discussion
The cluster analysis showed results similar to
the analysis presented in Schaeffler (2005).
There are three main groups with a clear geographic distribution: A Finnish cluster, which
includes all dialects on the Finnish mainland, a
Northern cluster, mainly concentrated in Northern Sweden from Lappland to Gstrikland, and
two Southern clusters.
The consideration of two Southern cluster
instead of one was motivated by a major increase in the value of 2. The durational characteristics, however, do not present much support
for such a partitioning. All segments in cluster
(1) are longer than in cluster (2), which leads to
very similar segment ratios (see table 1 and figure 4). This, together with the lack of a clear
geographic distribution, suggests that speech
rate effects might be responsible for the difference.
The geographic distribution corresponds
with the traditional descriptions of the Swedish
dialects as outlined in the introduction. 4-way
distinctions are mainly found in the Finnish region, 3-way distinctions frequently in the
Northern regions and 2-way distinctions in the
Southern Swedish areas (see e.g. Wessen 1969,
Riad 1992). In Schaeffler (2005), the observed
durational differences were attributed to these
functional differences. In 4-way distinctions,
where V:C and VC: sequences contrast with VC
and V:C:, clear durational distinctions of vowels and consonants are expected. This corresponds with the observed durations for the
Finnish region. All other dialects, however,
show rather low consonant ratios, while a clear
durational difference between the vowels is
maintained.
A further aspect of the results deserves attention: The geographic distribution resulting
from the cluster analysis is almost too clear-cut.
According to dialectological descriptions, some
dialects in the Finnish cluster (4) show 3-way
systems (Ivars 1988), comparable to those in
the Northern Swedish regions. In spite of these
functional congruences between parts of Northern Swedish and Finland-Swedish, the segment
durations differ. There is, however, strong evid-
References
Bortz J. (1999) Statistik fr Sozialwissenschaftler. 5th edition. Berlin: Springer.
Gordon A.D. (1999) Classification. 2nd edition.
Boca Raton: Chapman & Hall
Ivars A.-M. (1988) Nrpesdialekten p 1980talet. In Std. i nordisk fonologi, volume 70.
Labov W. (1994) Principles of Linguistic
Change., Vol. I. Cambridge: Blackwell P.
Riad, T. (1992) Structures in Germanic Prosody. Stockholm: Univ.
Schaeffler F. and Wretling P. (2003) Towards a
typology of phonological quantity in the
dialects of Modern Swedish. Proc. 15th
ICPhS Barcelona, p. 2697-2700.
Schaeffler, F. (2005, forthcoming) Drawing the
Swedish quantity map: from Bara to Vr.
Proc. Nordic Prosody IX. Lund.
Strangert & Wretling (2003) Complementary
quantity in Swedish dialects. Proc. Fonetik
2003. Ume.
Timm N.H. (2002) Applied Multivariate Analysis. New York: Springer.
Wessn, E. (1969) Vra folkml. C E Fritzes
Bokfrlag, Lund.
4
Abstract
This paper presents main results from a Ph.D.
thesis on sociolinguistic variation among students in an upper secondary school in Alingss,
a town of 25,000, northeast of Gteborg.
Phonological variants are found to be associated with traditional local dialect, regional and
supraregional standard, Gteborg vernacular,
general and Gteborg youth language.
Correlations with demogeographical areas
generally show a pattern going from southwest
to northeast (along the E20 highway and the
railway from Gteborg). One area does not fit
into the continuum, Sollebrunn (NW of Alingss), where particularly female informants tend
to use standard and innovations to a surprisingly high extent. Gender is the second most
important social factor, but in different ways.
There are major differences from one social
group to another when it comes to expressing
gendered identity through linguistic means.
Method
The informants were categorized according to
social variables representing different aspects
of background and identity: Gender, type of
study programmes (vocational, intermediate,
preparatory for university), demogeographical
areas (based on the extent of urbanization in the
five municipalities), Alingss neighbourhoods
(divided on the basis of socio-economic factors), and lifestyle based on two-dimensional
mapping (concerning taste, leisure, mobility,
plans for the future, etc.). The lifestyle analysis
both complements and includes traditional sociolinguistic variables.3
Eight linguistic variables were analyzed extensively, four phonological, two lexical and
two morpho-phonological. Instants of variants
in the recorded interviews were counted manually, and frequencies of variants were correlated
statistically to social variables on a group level.
Examples from analyzes of three phonological
variables will be used in the followng discussions.
Introduction
This article presents some of the core results
from my Ph.D. thesis (Grnberg 2004), a study
of sociolinguistic variation among students
from five municipalities, all attending an upper
secondary school in Alingss, a town of 25,000,
northeast of Gteborg, Sweden.1
The main aim of the thesis was to study covariation between linguistic variation and social
identity, and to relate it to language and dialect
change. A number of questions were raised, of
which the following will be discussed here:
To what extent does linguistic variation
depend on the informants orientation towards
the place where (s)he lives or other places?
How do the findings from the upper secondary school in Alingss differ from results
from comparable informants in Gteborg?
Which social factors are most important
for linguistic choices?
Material
The material studied consists of tape-recorded
interviews with 97 students at the Alstrmergymnasium, which at the time of recording in
5
Diphtongized U
Standard U
40
20
Lowered U
0
Alingss
Herrljunga
Lerum
Grfsns
Sollebrunn!!
%
100
80
60
Fricativized I/Y
Standard I/Y
40
20
Lowered I/Y
0
Herrljunga
Alingss
Lerum
Grfsns
Sollebrunn
Gteborg
%
100
Closed R
80
60
Open R
40
20
0
Herrljunga
Sollebrunn!!
Lerum
Grfsns
Alingss
Discussion
Which social factors are most important for linguistic choices? Is it possible to identify groups
7
References
Bourdieu, Pierre (2002) [1984]. Distinction. A
Social Critique of the Judgement of Taste.
Reprint. London: Routledge & Kegan Paul
Ltd.
Dahl, Henrik 1997. Hvis din nabo var en bil. En
bog om livsstil. Kbenhavn: Akademisk
Forlag A/S.
Grnberg, Anna Gunnarsdotter. (2004) Ungdomar och dialekt i Alingss. (Nordistica
Gothoburgensia 27.) Gteborg: Acta Universitatis Gothoburgensis.
Grnvall, Camilla. (2005) Lttgteborgska i
Kungsbacka. En beskrivning av ngra gymnasisters sprk 1997. Gteborg. Unpublished manuscript.
Norrby, Catrin & Karolina Wirdens. (1998)
The Language and Music Worlds of High
School Students. In: Pedersen, Inge Lise &
Jann Scheuer (eds.). Sprog, Kn og Kommunikation. Rapport fra 3. Nordiske Konference om Sprog og Kn. Kbenhavn, 11.
13. Oktober 1997. Kbenhavn: C.A.
Rietzels Forlag.S. 155164.
Sandy, Helge (1993). Taleml. Oslo: Novus.
Thelander, Mats. (1979) Sprkliga variationsmodeller tillmpade p nutida
Burtrsktal. Del 1 & 2. (Studia Philologiae
Scandinavicae Upsaliensa 14:1 &14:2.)
Uppsala: Acta Universitatis Upsaliensis.
Ungdomsbarometern. (1999) Livsstil & fritid.
Stockholm: Ungdomsbarometern AB.
Westerberg, Anna. (2004) Norsjmlet under
150 r. (Acta Academiae Regiae Gustavi
Adolphi LXXXVI.) Uppsala: Kungl. Gustav Adolf Akademien fr svensk folkkultur.
Notes
1. The Swedish upper secondary school, gymnasium, gives courses of three years duration
for students having completed nine years of
8
Abstract
Dialects of Swedish vary in the pronunciation of
unstressed /e/ in different phonological
environments. In this pilot study, Stockholm
Swedish is compared with several Finland
Swedish dialects. Stockholm and one land
dialect lower and back /e/ before [n], while
Helsinki and most Nyland dialects lower and
back /e/ before [r]. The data provide evidence
for the sociolinguistic relevance of unstressed
vowel pronunciation.
Introduction
Stressed short [e] and [] are in complementary
distribution in most Swedish dialects: the
allophone [] occurs before [r], and [e] occurs in
all other environments. In Finland Swedish,
transcription conventions (e.g. in Harling-Kranck
1998) and informal reports by native speakers
suggest that the same distribution may hold in
unstressed syllables as well. Since it is not clear
how widespread this phenomenon is, a pilot
study was conducted to investigate the phonetics
of unstressed /e/ across several dialects. The
following questions were addressed: 1) How is
unstressed /e/ pronounced in Stockholm
Swedish? 2) Are unstressed [e] and [] in fact in
complementary distribution in standard Helsinki
Swedish? 3) Do rural Finland Swedish dialects
pattern with Helsinki, or with Stockholm or do
they show their own patterns? Finally, 4) Can
the regional differences be explained?
Results
Stockholm
Unstressed /e/ in Stockholm Swedish was
generally realized as schwa, but a pattern
emerged for both Stockholm speakers that the
schwa had higher F1 and lower F2 when
preceding [n] than in other environments. There
is little overlap between the [n]-environment
tokens and the other tokens in the F2 vs. F1 plots
in Figs. 1 and 2.
2100
1900
1700
1500
2200
2000
1800
1600
350
450
550
1300
300
400
750
500
600
700
800
650
1550
1450
1350
1250
400
E
500
N
R
600
1900
1800
1700
1600
1500
400
700
E
R
Helsinki
The Helsinki newscasters had a very different
pattern from Stockholm.
Both speakers
categorically lowered and backed unstressed /e/
before [r], as in Fig. 3. This result seems to
confirm the existence of [e]~[] allophony in
unstressed syllables, at least on a phonetic level.
500
600
N
700
2000
1800
1600
450
E
550
R
650
750
Nyland
In most rural villages of Nyland (the province
where Helsinki is located), the Helsinki pattern
obtains: before [r], unstressed /e/ approaches an
[]-like pronunciation.
2450
2250
2050
1800
1600
1400
400
E
R
N
500
600
1850
450
550
700
650
750
850
950
2000
1800
1600
1400
450
550
650
750
11
Acknowledgements
I would like to thank Leanne Hinton for valuable
advice and discussion. This research has been
supported by a Fulbright Grant and a Jacob K.
Javits Graduate Fellowship.
References
Bergroth H. (1917) Finlandssvenska:
handledning till undvikande av
provinsialismer i tal och skrift. Helsinki:
Holger Schildts.
Harling-Kranck G. (1998) Frn Pyttis till
Nedervetil: tjugonio dialektprov frn Nyland,
boland, land och sterbotten. Helsinki:
Svenska Litteratursllskapet.
Kuronen M. and Leinonen K. (2000) Fonetiska
skillnader mellan finlandssvenska och
rikssvenska. Svenskans beskrivning 24, 125138. Linkpings universitet.
Labov W. (1994) Principles of Linguistic
Change: Internal Factors. Oxford: Blackwell.
12
Background
According to the Swedish tonal typology (Grding,
1973, with Lindblad 1975, Bruce & Grding, 1978)
the timing/alignment of the word accent gesture is
essential to the Swedish word accent distinction.
The traditionally described word accent pattern
of the West Swedish prosodic dialect type (see Bruce
& Grding, 1978) involves low pitch on the stressed
vowel for accent 1 words and a peak on the stressed
vowel for accent 2 words in focal position. Bruce
(1998) has suggested that Gothenburg Swedish is
characterized by two-peaked pitch contours for both
word accents with an earlier timing in accent 1. A
previous production study (Segerup, 2004) confirms
that Gothenburg Swedish accent 1 deviates from the
generally accepted West Swedish accent 1 pattern
through having a fall on the stressed vowel.
Furthermore, the fall of accent 2 is only marginally
later than that of accent 1, meaning that the expected
timing difference between accent 1 and accent 2 is
less than in other dialect types. Consequently, the
overall shape of the pitch contours is strikingly similar,
but yet they are perceptually distinct (Segerup, 2004,
Segerup & Nolan, forthc.).
Abstract
According to the conventional wisdom the word
accent distinction in Swedish (dialects) is
maintained chiefly by a difference in the timing
of the word accent gesture (Grding, 1973).
Gothenburg Swedish, however, does not obey to
the norm since both pitch height and timing
contribute to the word accent distinction in this
dialect (Segerup, 2004). In Gothenburg Swedish
both word accents have a fall on the stressed
vowel, which makes the pitch contours strikingly
similar (Segerup, 2004).
Up until now the material investigated has
consisted of contrastive words in which the
stressed vowel is phonologically long. In the
present production study we proceed with wordpairs where the stressed vowel is phonologically
short for a comparison. The acoustic analysis
involves measurements of fundamental frequency
(F0) and segments duration of five speakers
production of seven word-pairs altogether.
The results show a significant difference in the
duration of the short stressed vowel between
accent 1 and accent 2. Further, that word accent
has effects on the vowel duration
INTRODUCTION
The present paper investigates the interaction
between word accent and quantity in Gothenburg
Swedish. Minimal pairs with accent 1 and accent 2
and with either long or short stressed vowel are
examined. How are pitch height and timing affected
when the voiced portion of the syllable is minimized
by having a short vowel followed by a voiceless
consonant as opposed to a more sonorant
environment, i.e. a long vowel or sonorant consonant?
This is related to the general question of truncation
or compression of the f0 contour in an intonationally
unfavourable environment (see e.g. Bannert &
Bredvad-Jensen, 1975).
13
Purpose
Acoustic analysis
INVESTIGATION
Materials, subjects, method
The speech materials comprise seven contrastive disyllabic word-pairs, all of which are listed pair-wise
in Table 1 below. (Since the present investigation is
part of a large-scale study, the word-pairs included
here are not completely symmetric). The target words,
in phrase-final focal position, were extracted from
various sets of sentences (statements) spoken in two
different speaking styles; normal and clear voice.
RESULTS
The results of the acoustic analysis are exemplified in
Figures 1-4. Figures 1-3 show the average f0 curves
for five speakers production of malen/malen, pollen/
pllen and tecken/tcken in clear style, respectively.
The duration of the stressed vowel is indicated by a
bar (at the top for accent 2 and at the bottom for
accent 1). The pitch contours are aligned at the start
of the stressed vowel and earlier points are shown as
having negative times relative to the alignment point.
In words with a long vowel the duration of the stressed
vowel and the overall timing of the two word accents
is very similar, meaning that a direct comparison of
the timing of pitch events is possible, which is generally
not the case in words with a short vowel.
Figure 4 compares, for accent 1 and accent 2,
the average duration of the stressed vowel for the
word-pairs malen/malen, pollen/pllen and tecken/
tcken.
Polen (Poland)
Judith
malen (the moths)
buren (the cage)
Accent 2 V:
plen (the stake)
ljudit (to have sounded)
malen (ground)
buren (carried)
Accent 1 V
pollen (pollen)
tecken (signs)
tjecker (Czechs)
Accent 2 V
pllen (horsey)
tcken (quilts)
checker (cheques)
Accent 1 V:
frequency (hz)
275
250
225
200
175
150
125
100
75
50
-300
Accent 1
Average 5 speakers
malen A1 & A2
Accent 2
Acc 1 V
Acc 2 V
-100
100
300
time (ms)
500
700
frequency (hz)
frequency (Hz)
275
250
225
200
175
150
125
100
75
50
-300
Acc 2
Acc 2 V
100
300
time (ms)
500
Acc 1 V
Acc 2 V
-100
100
300
time (ms)
500
700
Acc 1 V
-100
275
250
225
200
175
150
125
100
75
50
-300
Accent 2
Acc 1
Accent 1
700
15
References
Acc 2
50
100
150
200
250
malen
malen
pollen
pllen
tecken
tcken
DISCUSSION
In Gothenburg Swedish short vowel words, accent
2 seems to demonstrate truncation of the pitch fall
and accent 1 seems to demonstrate compression
of the fall and also some lengthening of the stressed
vowel. It appears to be the case that Gothenburg
Swedish speakers strategy is to preserve the fall
on accent 1, while the fall seems to be of less
importance for accent 2.
One interpretation of this is that the falling f0
contour in the stressed vowel of accent 1 and the
height from which the fall takes place in accent 2
is enough of a cue to maintain the distinction
between the word accents in words with a short
stressed vowel. House (1990) has worked with a
model of tonal feature perception which may be
applied to these findings.
In order to fully understand the interaction of
these cues a perceptual experiment with synthetic
stimuli is in preparation, which will manipulate pitch
height and slope in order to discover the relative
importance of these factors.
16
Visual Acoustic vs. Aural Perceptual Speaker Identification in a Closed Set of Disguised Voices
Jonas Lindh
Department of Linguistics
Gteborg University
the effects of low quality recordings. Generally,
one can say that primarily aural identification
has been the leading method when it comes to
casework. Many studies have been carried out
to see what parameters are most stable or where
effects of low quality can be calculated, for example the telephone effect (Knzel, 2001).
Generally, LTAS becomes rather stable after
30-40 seconds of speech. (Boves, 1984; Fritzell
et. al., 1974; Keller, 2004) LTAS reflects the
energy highs and lows generated by the vocal
tract filter on average, which means that it
should be more difficult to alter than, for example, F0 or specific phones, why this measure is
often chosen to visually represent the general
energy distributions in frequency for the speech
signal. Several studies have been conducted to
study energy ratios and level differences for
LTAS (Lfqvist, 1986; Lfqvist & Mandersson, 1987; Gauffin & Sundberg, 1977; Kitzing,
1986). Kitzing (1986) recommended that patients should read at the same degree of vocal
loudness to avoid the differences that occurred
especially in higher frequencies. Kitzing & kerlund (1993) pointed out the need for an investigation of the effect of vocal loudness on LTAS
curves. Nordenberg & Sundberg (2003) performed such a test and showed that vocal loudness and varied f0 gave variations in Long Time
Average Spectra. However, even though an expected variation has been shown, the ability to
perform pattern matching on the graphs seems
to be possible. It has been observed that a slight
difference between the identification results between subjects depends on whether they consider distance more important than
shape/pattern. Hollien & Majewski (1977)
tested long-term spectra as a means of speaker
identification under three different speaking
conditions, i.e. normal, during stress and disguised speech. LTS for fifty American and fifty
polish male speakers were used under fullband
as well as passband conditions. The results
demonstrated high levels of correct identification (especially under fullband conditions) for
normal speech with degrading results for stress
and disguise.
Abstract
Many studies of automatic speaker recognition
have investigated which parameters that perform best. This paper presents an experiment
where graphic representations of LTAS (Long
Time Average Spectrum) were used to identify
speakers from a closed set of disguised voices
and determine how well the graphic method
performed compared to an aural approach.
Nine different speakers were recorded uttering a fake threat. The speakers used different
disguises such as dialect, accent, whisper, falsetto etc. and the verbatim threat in a normal
voice.
Using high quality recordings, visual comparison of the Praat vocal tract graphs of
LTAS outperformed the aural approach in identifying the disguised voices. Performing speaker
identification aurally does not mean analyzing a
different sample than the one being analyzed
acoustically. Studies of aural perception show a
hypothesizing, top-down, active process, which
create interesting questions regarding aural
speaker identification with bad quality recordings in noisy backgrounds etc. However, more
tests on telephone quality recordings, authentic
material and additional types of acoustic measurements, must be performed to be able to say
anything about LTAS with implications for forensic purposes.
Method
The sixteen disguised voices and suspects
(references), were recorded by six females and
three males. The recordings were made with a
high quality microphone in front of a personal
computer and the subjects recorded one normal and as many disguised voices as they
wanted, repeating the same fake threat in
Swedish. All recordings were between four and
six seconds long and sampled at 16kHz. Forced
choice was applied in both the aural and visual
tests.
The graphic representations of LTAS were created from an LTAS object using 100 Hz frequency bins. (Boersma & Weenink, 2005)
N of Disguised Voices
N of Subjects
Alpha
16
10
0.91
18
% Correct
60
40
40
20
0
10
0
1
30
40 40 40
10
6
30
10
N of Disguised Voices
N of Subjects
Alpha
20
10 11 12 13 14 15 16
The reliability score is lower in this test compared to the visual test. However, the correlation is high enough to be interpreted as a rather
high correlation between subjects.
% Correct
Figure 2 shows how many correct identifications that were made per disguised voice sample. Some graphs were obviously very difficult
to identify. Why that is so, or how those
graphs differ, has not yet been investigated.
% Correct
16
7
0.83
80
71
71
57
60
43
40
29
20 14 14
43
29
29
14 14
14
14
0
0
1
10 11 12 13 14 15 16
44
31
50
50
38
44
44
10
Subject
% Correct
100
90
80
70
60
50
40
30
20
10
0
38
44
31
31
38
44
19
Subject
19
References
Boersma P. & Weenink D. (2005) Praat: doing
phonetics by computer (Version 4.3.01)
[Computer Program]. Retrieved from
<http://www.praat.org/>
Boves L. (1984) The phonetic basis of perceptual ratings of running speech Foris Publications, Dordrecht.
Gauffin J. & Sundberg J. (1977) Clinical application of acoustic voice analysis. Part II:
Acoustic analysis, results 1977/2-3: 39-43.
Grosjean F. (1980). Spoken word recognition
processes and the gating paradigm. Perception and Psychophysics, 28, 267-283.
Hollien H. & Majewski W. (1977) Speaker
identification by long-term spectra under
normal and distorted speech conditions.
Journal of the Acoustical Society of America 62: 975-980.
Keller E. (2004) The analysis of voice quality
in speech processing. In Lecture notes in
computer science, Springer Verlag, Berlin.
Kersta L. G. (1962) Voiceprint identification.
Nature 196: 1253-1257.
Kitzing P. (1986) LTAS criteria pertinent to the
measurement of voice quality. Journal of
Phonetics, 14: 477-482.
Knzel H. J. (2001) Beware of the 'telephone
effect': The influence of telephone transmission on the measurement of formant frequencies. Forensic Linguistics 8: 80-99.
Lfqvist A. (1986) The long-time-average
spectrum as a tool in voice research. Journal
of Phonetics, 14: 471-475.
Lfqvist A. & Mandersson B. (1987) Longtime average spectrum of speech and voice
analysis. Folia phoniatrica, 39: 221-229.
Nordenberg M. & Sundberg J. (2003) Effect on
LTAS of vocal loudness variation. In:
TMH-QPSR, KTH, 45: 93-100.
Rose P. (2002) Forensic Speaker Identification.
New York, Taylor & Francis.
Stevens K. N. (1993) Lexical access from features. In Speech communication group
working papers (Vol. VIII, p. 119-144). Research Laboratory of Electronics, Massachusetts Institute of Technology.
Conclusions
General advantages with graphic representations are:
Intra subjectively applicable (depending
on the amount of data).
Relatively simple fundamentals for calculation.
Rather easy to visualize.
General disadvantages are:
Difficult to quantify and substantiate
comparisons.
The visualization depends on F0 and
vocal loudness variations.
An average always ignores specific
events in the speech signal.
Considering the categorical, top-down active
human speech perception process (Grosjean,
1980), it is interesting to find complementary
visual acoustic information to aural methods in
forensic speaker identification. When two voice
samples are compared, the same input is judged
no matter if it is aurally or acoustically. The
question is how it is analyzed and how the
acoustic visual and the aural perceptual information are processed. If a better understanding
between the two is reached, objective methods
can be used to judge similarities. Objective
acoustic methods can also more easily be excluded on well-grounded arguments as well as
subjective aural ones. This could also lead to
better statistical data in forensic speaker identification if computer based methods can be used
with more confident supervision. It is clear that
aural mistakes are made, especially for disguised voices.
The graphic representations used in this experiment are not claimed to be complete images
reflecting the voice of a speaker. They are but
examples showing that in some cases visual
acoustic input are better at discriminating between speakers than are ears alone.
20
Abstract
The most successful methods to induce emotions
on state of the art unit selection speech synthesis
have been built by switching speech database
depending on the desired emotion. These methods require a substantial increase of memory
compared to a single database and are computationally slow. The model-based approach is
an attempt to reshape a neutrally recorded utterance (comparable to the desired output from
a modern unit selection system) into simulating
a recorded model of a desired emotion.
Factors for manipulation of duration, amplitude and formant shift ratio are calculated by
comparing the recorded neutral utterance with
three recorded, basic emotional models in accordance with discrete emotion theory sadness, happiness and anger. F0 (regarded as the
intonation) is copied from the model and is then
imposed on the neutrally recorded utterance.
The evaluation of the experiment shows that
subjects easily categorize discrete emotions in a
forced choice. They also grade the resynthesized
emotional quality from the neutrally recorded
utterance almost equally high as the naturally
recorded models for the male voice. The female
voice created more difficulties and contained
more synthetic artifacts, i.e. it was judged to
have a lower quality than the recorded models.
Method
Two speakers, one female and one male, were
recorded uttering the same sentence (Jag har
inte varit dr idag) in four different expressive
styles: natural, sad, happy and angry. The recordings were made in a studio environment
using a high quality microphone. The speakers
were told to first consider how to express the
emotions in speech concerning duration, amplitude and intonation. They were then told to express the emotions as clearly as possible while
recorded, even though the semantic content did
not suggest a specific emotion.
Each recorded emotion was then used, both
as a model to induce the specific emotion in the
neutrally recorded utterance as well as a reference against which the resynthesized speech
should be compared. If one uses the same
speaker and the calculated differences from the
same utterance with different emotions one
should be able to resynthesize at least the specific parameters correctly. Six subjects finally
evaluated the results by categorizing and grading
the neutral recording, the recorded models and
thre three resynthesized objects for the two
speakers, i.e. fourteen utterances of the same
sentence.
22
F0
std
24
18
16
52
46
8
5
Ampl
(dB)
68
69
69
75
72
70
68
F1
mean
519
405
512
528
522
517
507
F2
mean
1482
1300
1508
1464
1451
1367
1452
F3
mean
2644
2592
2517
2602
2629
2672
2629
F0
mean
172
328
311
358
349
250
236
F0
std
17
73
25
119
107
53
52
Ampl
(dB)
70
67
68
77
73
77
74
F1
mean
573
587
610
707
608
638
614
F2
mean
1670
1535
1651
1661
1734
1658
1689
F3
mean
2687
2783
2681
2709
2767
2649
2686
Sad Happy
Angry
4.3
4.2
4.8
3
3.67
3.5
Natural
4.3
3.8
2.7
4.7
3.5
4.33
3.5
The average naturalness score for the four resynthesized male samples is 3.37, while the
overall average for the recorded models is 4.29.
Whether this decrease in naturalness is an acceptable has not been investigated. Categorizing
the male samples created no problems except
one uncertain exception (0.7 happy-neutral).
This means that there is a trade-off between
naturalness and an computationally cheap
method.
23
Natural
4
3.5
1.7
3.5
2
4.7
2.8
References
Boersma P. & Weenink D. (2005) Praat: doing
phonetics by computer (Version 4.3.07)
24
Abstract
Background
Introduction
Speech Data
The speech data used for pronunciation variation modelling has not been recorded specifically for this project, but has been collected
from various sources. The speech corpus includes data recorded or made available for research within the fields of phonetics, phonology and speech technology in different earlier
research projects. The speech data has been selected to be dialectally homogeneous, to avoid
dialectal pronunciation variation. The language
variety used is central standard Swedish.
The speech data has been recorded in different situations and speaking style related variables are defined from the speaking situation.
The collection of speech data collected for the
project includes radio news broadcast and interviews, spontaneous dialogues, elicited
monologues, acted readings of childrens
books, neutral readings of fact literature and
recordings of dialogue system interaction.
Methods and software for annotation has
been developed using mainly the VAKOS corpus (Bannert and Czigler, 1999) as the target to
be annotated. This corpus was originally recorded and annotated for the study of variation
in consonant clusters in central standard Swedish. It consists of ~103 minutes of monologue
from ten native speakers of central Standard
Swedish.
25
Method
Automatic methods (with some minor exceptions) are used for annotation of spoken language data, where annotation is not supplied
for the corpora used. The word level annotation
is the base for all other annotation. The automatically obtained word boundaries and orthographic transcripts are manually corrected. In
this way, relatively little work can give a large
gain in annotation performance for most types
of annotation.
The annotation system is built as a serialised
set of modules, producing output at different
levels. The output can be manually edited and
used as input to modules later in the chain.
Annotation Structure
All annotation is connected to some durationbased unit at one of six hierarchically ordered
tires. The tiers correspond to 1) the discourse,
2) the utterance, 3) the phrase, 4) the word, 5)
the syllable and 6) the phoneme. Each tier is
strictly sequentially segmented into its respective type of units. Some non-word units can be
introduced in the word tier annotation to ensure
that parts of the signal that are not speech can
be annotated, e.g. pauses and inhalations.
A boundary on a higher tier is always also a
boundary on a lower tier. An utterance boundary is thus also always a phrase boundary, a
word boundary, a syllable boundary and a phoneme boundary. Thus, information can be unambiguously inherited from units on higher
tiers to units on the tires below.
Having the information stored at different
tiers enables easy access to the sequential context information, i.e., properties of the units adjacent to the current unit at the respective tiers.
Segmentation
Each annotation tier is segmented into its corresponding units, beginning at the word tier.
Based on the word tier segmentation and information derived from the word units, the tiers
above and below the word tier are segmented.
The phoneme tier is segmented word-by-word
using the orthographic annotation, a canonical
pronunciation lexicon and an HMM phoneme
aligner, NALIGN (Sjlander, 2003). The phonemes are clustered into syllables with forced
syllable boundaries at word boundaries and the
syllable tier is segmented using this clustering
and the durational boundaries from the phoneme segmentation. Utterance boundaries are
Model Performance
The annotation has been used for decision tree
model induction (initial results are reported in
Jande, 2004). The decision tree pronunciation
variation model works with phonemes in a canonical phonemic pronunciation representation
as its central units. A vector containing all
available context information is connected to
each canonical phoneme. For each canonical
phoneme, the model makes a decision about the
appropriate phone realisation given the context
associated with the canonical phoneme.
Decision tree models trained on annotation
from elicited monologue showed a PER of
9.91% when evaluated against the same type of
data as they were trained on in a tenfold cross
validation setting. This meant a 55.25% error
reduction compared to using the canonical pronunciation representation for estimating the
phonetic realisation.
The decision tree models were pruned to
make them more general (less specific to the
particular training data from which they were
induced). Thus, not all variables were used in
the final models. None of the discourse or utterance tier attributes were used in any of the
pruned models, probably due to the fact that
only one type of speaking style was used. From
the phrase, word, syllable and phoneme tiers,
many different types of attributes were used. As
could be expected, the identity of the canonical
phoneme was the primary phone level realisation predictor.
Conclusions
A system for annotation of speech data with
variables hypothesised to be important for the
pronunciation of words in discourse context has
been described. Automatic methods used for
obtaining or estimating variables have been
presented. The annotation has been used for
creating pronunciation variation models in the
form of decision trees. The models show a decrease in phone error rate with 55.25% compared to using canonical phonemic word representations from a pronunciation lexicon.
References
Allwood J., Bjrnberg M., Grnqvist L., Ahlsn E. and Ottesj C. (2000) The Spoken
Language Corpus at the Linguistics Department, Gteborg University. Forum
Qualitative Social Research 1.
28
Abstract
The Pairwise Variability Index (PVI), a measure of how much unit-to-unit variation there is
in speech, has been used as a correlate of
rhythmic impressions such as syllable-timed
and stress-timed. Grabe and Low (2002) included Estonian among a number of languages
for which they calculated the durational PVI
for vowels and for intervocalic intervals, but
other than that Estonian rhythm has not been
studied within recent approaches to rhythm
calculation. The pilot experiment reported in
this paper compares the use of various speech
units for the characterisation of Estonian
speech rhythm. It is concluded that the durational PVI of the syllable and of the foot are
more appropriate for capturing the rhythmic
complexity of Estonian, and might provide a
subtle tool for characterising languages in general.
Introduction
The Pairwise Variability Index (PVI) is a
quantitative measure of acoustic correlates of
speech rhythm which calculates the patterning
of successive vocalic and intervocalic (or consonantal) intervals showing how one linguistic
unit differs from its neighbour. It was first applied, at the suggestion of the second author of
this paper, by Low (1998: 25) in her study of
Singapore English rhythm. Among other measures Low compared successive vowel durations
and showed that Singapore English had a lower
average PVI over utterances than British English. This fits in with the impressionistic observation that Singapore English is more syllable
timed than British English.
Syllable timing (Abercrombie, 1967: 97)
carries the implication that speakers make syllables the same length, and is opposed to stress
timing, the tendency to compress syllables
where necessary to yield isochronous feet (i.e.
inter-stress intervals). Attempts to find
isochrony of either kind have produced disappointing results, event for languages canonically perceived as syllable-timed (e.g. French)
or stress-timed (e.g. British English). The PVI,
however, shifted the emphasis from absolute
Method
Materials
The materials used were the first four sentences
of a read passage recorded for intonation analysis in Asu (2004). The four sentences comprised 62 syllables and (depending on the
speaker) four to seven intonation phrases. This
is less than half the 160 or so syllables on
which Grabe and Lows (2002) Estonian results
are based, but compensatorily the present
analysis uses data from five speakers compared
to only one in theirs. The speakers are all female speakers of Standard Estonian who were
asked to read the passage in a normal tempo.
Three subjects were recorded in a quiet environment in Tartu, Estonia, and two in the
sound-treated booth of the phonetics laboratory
of Cambridge University.
29
30
Results
Discussion
31
Conclusion
This paper has presented a preliminary investigation of Estonian rhythm, comparing a number
of measures, each of which expressed the fluctuation in duration between successive phonological units. It has been argued that the common practice of characterising languages in
terms of pairwise variability of vowels and
intervocalic intervals may be less appropriate
than using variability measures of phonological
syllables and of stress feet. This is particularly
so when the results are to be related to impressionistic characterisations in terms of syllabletiming and stress-timing. However the point
is made that these terms are not opposites
ranged on a single continuum, but two independent parameters along which languages can
vary.
References
Abercrombie D. (1967) Elements of General
Phonetics. Edinburgh: Edinburgh University
Press.
32
Abstract
Recordings of Finnish casual dialogue and
careful reading were analyzed auditorily and
on spectrograms. Syllables on the phonological
level were compared to syllable-sized units
(CVs) on the phonetic level. Comparisons
with existing Swedish and Spanish data revealed several differences: Finnish had much
less temporal equalization of syllable-sized
units in casual speech than Swedish, and even
slightly less than Spanish. Instead, there was a
greater decrease in the number of CVs. In all
three languages, the duration of a CV was
partly dependent on its size. In Finnish, however, (in contrast to Swedish and, to a lesser
degree, Spanish) CV duration was only marginally affected by lexical stress. Finnish, like
Spanish, had rhythmic patterns typical of syllable-timed languages in both speaking styles,
while Swedish changed from a more stresstimed pattern in careful reading to a more syllable-timed in casual speech.
Methods
The Finnish speech material consisted of a
lively dialogue between native speakers PJ and
EK, and text reading (PJ). The dialogue was
recorded in in the early 1990-ies and was used
for segment duration analyses (Engstrand and
Krull, 1994). The text reading was recorded in
2005. The Swedish and Spanish materials cited
for comparisons come from Engstrand and
Krull 2001, 2002. All recordings were made in
sound-treated recording studios using high
quality professional equipment.
The digitized Finnish material was segmented into syllable-sized units and labeled using the Soundswell Signal Workstation
(http://www.hitech.se/development/products/so
undswell.htm). Since casual speech is characterized by numerous coarticulation and reduction phenomena, a conventional
morphophonetically based syllabification was not possible. Reliable identification and segmentation
of units posed problems; e.g. the Swedish word
behandla could be pronounced as [bela].
Therefore, contoid-vocoid(-contoid) sequences
mirroring opening-closing movements were
chosen as units (see Engstrand and Krull,
2002). For simplicity, they will be referred to
as CV units, where C may be a single contoid or a cluster and V a single vocoid or a
diphthong. The term unit is used in a strictly
phonetic sense, it may sometimes containin
traces of underlying segments.
The segmentation was carried out auditorily
and visually on spectrograms in the same manner as with the Swedish and Spanish material
(Engstrand and Krull 2001, 2002). Onsets consisted where possible of a single contoid or a
contoid cluster, and a unit was considered wellformed if it agreed with Jespersens sonority
hierarchy (Jespersen 1926). No consideration
was given to the phonotax of a given language
or to word and morpheme boundaries. For example, the Finnish words mys kielellisen
would be segmented as my-skie-le-lli-sen re-
Introduction
The complex Swedish phonotax has been
shown to be considerably simplified in casual
speech (Engstrand and Krull, 2001). Syllables
containing heavy consonant clusters on the
phonological level were often represented by
alternating simple contoid-vocoid sequences in
casual conversation. As a consequence, the durations of syllable-sized units tended to be
equalized and the rhythmic pattern of Swedish
came closer to that of a syllable timed language
such as Spanish (Engstrand and Krull, 2002).
The present paper addresses the question:
How would the durations of syllable-sized units
in different Finnish speaking styles compare
with Swedish and Spanish? On the one hand,
Finnish resembles Spanish in the simplicity of
the phonotax which would let us expect a similarity to the Spanish pattern. On the other hand,
there is a segmental short-long contrast as in
Swedish, although not limited to stressed syllables. Phonetically, the difference between short
and long segments is larger in Finnish and there
33
Results
Table 1 shows the incidence of phonetic syllable-sized units vs. phonologically determined
syllables in Finnish. Swedish data (Engstrand
and Krull 2001) were added for comparison.
(The duration of units in prepausal positions
was not included.) It can be seen that in both
languages, the total number of phonetic units
was lower than the corresponding number of
phonological syllables. The decrease was larger
for the Finnish speakers: 12% (PJ) and 15%
(EK) in casual speech, while the corresponding
amount for the Swedish was 9% (RL) and 10%
(JS). In reading condition, the decrease was
10% for Finnish and only 2% for Swedish.
In addition, Table 1 shows that the share of
open units (i.e. sequences ending in a vowel or
vocoid) was larger in the phonetic representation of both languages. The increase was much
larger for the Swedish speakers: from 27% to
73% (RL), 40% to 73% (JS), and 31% to 62%
(GT, read text). For Finnish, the corresponding
increase was from 58% to 80% (PJ), 61% to
79% (EK), and 57% to 77% (PJ, read).
Analysis No.
of
syll.
Finn. PJ
casual
Finn. EK
casual
Finn. PJ
read
Phonet.
Phonol.
Phonet.
Phonol.
Phonet.
Phonol.
Swed. RL
casual
Swed. JS
casual
Swed. GT
read
Phonet.
Phonol.
Phonet.
Phonol.
Phonet.
Phonol.
%
open
syll.
960
1097
584
647
876
954
%
%
CV CCV
and
V
73 7
58 0
67 11
61 0
71 7
57 0
822
900
543
491
977
997
61 12
25 2
67 13
37 2
53 9
29 2
73
27
81
40
62
31
Condition Mean
Std
Read
Casual
Finnish EK Casual
156
154
166
48
42
48
675
761
429
Swedish
Read
Casual
200
178
86
62
350
306
Spanish
Read
Casual
156
155
59
51
167
287
Table 1 Phonetic syllable-sized units and. phonological syllables in casual and read Finnish. Swedish data (Engstrand and Krull 2001) are added for
comparison.
Language,
speaker
condition
Language
speaker
Finnish PJ
80
58
79
61
77
57
Another difference between the two languages was the share of simple syllables
(mainly CV, in a few cases V). In Finnish, such
34
Finnish
Swedish
150
Spanish
50
40
0.14
0.2
120
90
t
n
u
o
C
0.12
40
0.1
60
30
Pr
o
p
o
r
oti
n
p
e
r
B
a
r
0.10
30
t
n
u
o
C
0.08
0.06
20
0.04
10
0.02
0
0
125
250
Duration (ms)
375
0.0
500
0
0
150
125
250
Duration (ms)
375
50
0.2
32
P
or
p
o t
r n
oti u
o
n C
p
e
r
B
a
r
24
8
0
0
0.00
500
0.1
16
125
250
Duration (ms)
375
0.14
0.12
90
t
n
u
o
C
0.10
0.08
60
0.06
0.04
30
125
250
Duration (ms)
375
0.00
500
32
40
0.12
P
or
p
o
r
oti
n
p
e
r
B
a
r
0.10
30
t
n
u
o
C
0.08
20
0.06
0.04
10
0.10
Pr
o
p
ot
rn
otiu
no
pC
e
r
B
a
r
24
0.08
0.06
16
0.04
8
0.02
0.02
0.02
0
0
0.12
0.14
0.16
0
0
125
250
Duration (ms)
375
0.00
500
0.0
500
40
0.16
0.18
120
Pr
o
p
o
r
oti
n
p
e
r
B
a
r
0
0
125
250
Duration (ms)
375
P
or
p
o
r
oti
n
p
e
r
B
a
r
0.00
500
Figure 1. Distributions of CVunit durations (ms) in Finnish, Swedish and Spanish in two speaking
conditions: upper row read text, lower row casual speech. Data affected by prepausal lengrhening are
removed (The Swedish and Spanish data are from Engstrand and Krull, 2002)..
structures made up more than half of the syllables on the phonological level, while in Swedish the corresponding amount was less than a
third. Compared to Swedish, therefore, Finnish
allows less possibilities for syllable simplification.
It appears that instead of simplifying syllables, the Finnish speakers dropped them. There
was a relatively large decrease in the number of
syllables between the phonological representtion and its phonetic counterpart in Finnish,
both in casual speech and in text reading. Another difference between Finnish and Swedish
was found in the distribution of syllable durations Although Finnish has a comparatively
large difference in duration between short and
long segments (see Engstrand and Krull, 1994
for data from speakers PJ and EK), the duration
of (open) syllables tended to center around a
narrow area around a peak (Figure 1). There
was not much change in distribution breadth
between reading and casual speech. Part of an
explanation for this phenomenon may lie in the
near-equality of the durations of stressed and
Finnish
)
s
m
(
n
oti
ar
u
D
Swedish
500
500
400
400
)
s
m
(
n
oi
t
ar
u
D
300
200
)
s
m
(
n
oi
t
ar
u
D
200
100
100
0
0
300
2
3
4
Number of segments
0
0
500
500
400
400
)
s
m
(
n
oti
ar
u
D
300
200
2
3
4
Number of segments
2
3
4
Number of segments
300
200
100
100
0
0
2
3
4
Number of segments
0
0
Figure 2. Mean durations (ms) as a function of CV unit size in read and unscripted Swedish and Finnish. Upper graph show casual speech; lower graphs text reading. Filled circles represent stressed units
and triangles unstressed units.
References
Engstrand, O. and Krull, D. (1996). Durational
correlates of quantity in Swedish, Finnish
and Estonian: cross-language evidence for a
theory of adaptive dispersion. Phonetica
Vol. 51, No. 1-3, 1994.
Engstrand, O. and Krull, D. (2001). Simplification of phonotactic structures in unscripted
Swedish. J.I.P.A. 31, 41-50.
Engstrand, O. and Krull, D. (2002). Duration of
syllable-sized units in casual and elaborated
speech: cross-language observations on
Swedish and Spanish. In: Fonetik 2002,
TMH-QPSR Vol. 44.
Eriksson, A. (1991). Aspects of Swedish
speech rhythm. Gothenburg Monographs in
Linguistics 9. Department of Linguistics,
University of Gteborg.
Jespersen, O. (1926). Lehrbuch der Phonetik. 4.
Aufl., 190-91, Leipzig and Berlin: Teubner.
36
Abstract
In the present paper, recordings of Swedish on
multilingual ground from three different cities
in Sweden are compared and discussed.
Introduction
In Sweden, an increasing number of adolescents speak Swedish in new, foreign-sounding
ways. These new ways of speaking Swedish are
primarily found in the cities. The overarching
purpose of the research project Language and
language use among young people in multilingual urban settings is to describe and analyze
these new Swedish varieties (hereafter referred
to as Swedish on multilingual ground, SMG)
in Malm, Gothenburg and Stockholm.
Most SMG varieties are known by names
that reveal where they are spoken: Rinkeby
Swedish in Rinkeby, Stockholm, Grdstenska in Grdstena, Gothenburg and Rosengrd
Swedish in Rosengrd, Malm. However, if
you discuss Rinkeby Swedish with young people in Malm, they will instantly associate with
Rosengrd Swedish (i.e. with the corresponding
Malm SMG variety), if you play examples of
Rosengrd Swedish to teenagers in Lund, they
will associate with the Lund SMG variety
Fladden (named after Norra Fladen), and so
on. In other words, obvious similarities are perceived between different varieties of SMG.
Method
The material comes from the speech database
collected by the research project Language and
language use among young people in multilingual urban settings.
During the academic year 2002-2003, the
project collected a large amount of comparable
data in schools in Malm, Gothenburg and
Stockholm. The speakers are young people
(mainly 17-year-olds) who attended the second
year of the upper secondary schools educational program in social science during 20022003.
The recordings are comprised of both
scripted and spontaneous speech. The recordings include: (01) interviews between a
project member and the participating pupils,
(02) oral presentations given by the participating pupils, (03) class-room recordings, (04) pupil group discussions, and (05) recordings made
by the pupils themselves (at home, during the
lunch break, at cafs, etc.).
The recordings were made with portable
minidisk recorders (SHARP MD-MT190) and
electret condenser microphones (SONY ECM717), and subsequently digitized.
The naturalness of the speech material has
been obtained on the expense of good sound
quality. Acoustic analyses using the speech
analysis programs WaveSurfer and Praat have
been undertaken when possible, other parts of
the material have primarily been examined using auditory analyses.
Purpose
In the present paper, a first comparison between
SMG materials recorded in Malm, Gothenburg and Stockholm is undertaken with the object of searching for differences and similarities
in the varieties phonology and phonetics.
Previous research
Descriptions in the literature of so-called ethic
accents or (multi) ethnolects of different languages reveal some similarities. One example
of such a similarity is a staccato-like rhythm or
syllable-timing. A staccato-like rhythm has
been observed in e.g. Rinkeby Swedish (Kotsinas 1990), in the so-called Nuuk Danish spoken
by monolingual Danish-speaking adolescents in
37
Results
In the following, we will restrict ourselves to
describing a small set of SMG features that
demonstrate interesting differences and similarities between the cities.
Prosody
Word accents
It is a well known fact that L2 learners of
Swedish find it difficult to perceive and produce the word accent distinction. Given the
close relation between foreign accent and
SMG, one possible common feature of the
SMG varieties is a lack of word accent distinction.
Phonetically, the difference between accent
I and II is one of F0 peak timing. The F0 peak
of accent I has an earlier alignment with the
stressed syllable than accent II. In the Malm
dialect, the F0 peak is found at the beginning of
the stressed syllable in accent I words, and at
the end in accent II words. The same pattern
can be seen in examples of Rosengrd Swedish,
see Figure 1.
Segmentals
and t
When we ran a perception experiment in
Malm with the object of investigating whom
of our informants spoke Rosengrd Swedish
(Hansson & Svensson 2004), we noted that one
of the stimuli contained something typical for
SMG at the very beginning of the recording.
Instead of listening to the entire 30 second long
stimulus, the listeners (adolescents from
Malm) marked their answer after having heard
only the first two prosodic phrases (approximately 6.5 seconds). The two phrases in question are given in (1).
(1) ja(g) ska g plugga lite nu ass hon
checkar sprket snt
400
300
200
100
'bra
ti(ll)
'hm,sidor
men
1
Time (s)
R sounds
If you ask a Scanian to imitate Rosengrd
Swedish, he or she will most likely use front r
sounds. Indeed, among the SMG speakers in
Malm, the pronunciation of the r sound varies
greatly. Out of the ten stimuli perceived as
Rosengrd Swedish, front r sounds can be
heard in five. Among them, there are both fricative and approximant r sounds and the more
perceptually salient trilled r sounds.
Also in the Stockholm SMG material, the r
sounds differ from the regional dialect in that
trilled r sounds appear to be used more fre38
400
600
500
300
400
300
200
200
100
de(t r)
''skt,go(d)
100
de(t r) tju(go)sju p(ro)cent
som villha kvar
'mat
1
kungen
1.8
Time (s)
Time (s)
The word accents in the Gothenburg SMG material still remain to be investigated.
In summary, the SMG varieties have both features in common and regional features.
Intonation
An expanded F0 range can be observed in utterances recorded in all three cities. The pattern
is found mainly in exclamations and rhetorical
questions, see Figures 3, 4 and 5.
Discussion
How come there are similarities?
How come the different SMG varieties share
the above-mentioned features? The relation to
learner language and foreign accent is of course
obvious in both Malm, Gothenburg and Stockholm, but a foreign accent can sound in a
multitude of different ways.
One possible explanation is, of course, that
all SMG varieties are influenced by the same
language or language family. On the other
hand, SMG does not sound as one particular
foreign accent. Another possible explanation is
that the varieties are characterized by features
that are typologically unmarked and frequent in
the worlds languages. It is either related to the
fact that many of those features exist in the
teenagers first languages, or to the fact that
simplification and usage of unmarked features
is generally favored in language contact situations (regardless of what the languages in contact are). A third explanation is that it is features in the Swedish language that give rise to
the varieties similar sound, e.g. the difficulties encountered when learning Swedish.
All three alternatives probably have some
explicative power, although either completely
accounts for why the varieties sound like they
do. Word accents are unusual in the speakers
first languages, tend to disappear in language
contact situations (as in Finland Swedish), and
are difficult for second language learners to
learn. A word accent distinction is, nevertheless, maintained in SMG.
600
500
400
300
200
100
ve
vem
smart
nt
1.7
0
Time (s)
ja(g)ba
hungri(g)
4.7
3.5
Time (s)
39
In the present paper, we have presented a number of segmental and prosodic features that are
common for all SMG varieties, but also discussed a feature that distinguishes them from
each other (the word accent realization). Future
research will reveal more similarities and differences and, thereby, hopefully shed some
light on the relationship between the different
SMG varieties on the one hand (e.g. if cityhopping has occurred), and on the relationship
between SMG and foreign accent on the other.
Acknowledgements
The research reported in this paper has been
financed by the Bank of Sweden Tercentenary
Foundation.
References
Hansson P. & Svensson G. (2004) Listening for
Rosengrd Swedish. Proceedings
FONETIK 2004, The Swedish Phonetics
Conference, May 26-28 2004, 24-27.
Holmes J. & Ainsworth H. (1996) Syllabletiming and Maori English. Te Reo 39, 75-84.
Jacobsen B. (2000) Sprog i kontakt. Er der opstet en ny dansk dialekt i Grnland? En pilotundersgelse. Grnlandsk kultur- og
samfundsforskning 98/99, 37-50.
Kotsinas U-B. (1990) Svensk, invandrarsvensk
eller invandrare? Om bedmning av frmmande drag i ungdomssprk. Andra
symposiet om svenska som andrasprk i
Gteborg 1989, 244-274.
Labov W. (2003) Pursuing the cascade model.
In Britain D. & Cheshire J. (eds) Social
Dialectology: In Honor of Peter Trudgill.
Amsterdam: John Benjamins.
Low, E. & Grabe, E. (1995). Prosodic patterns
in Singapore English. Proceedings of the
Intonational Congress of Phonetic Sciences,
Stockholm 13-19 August 1995, 636-639.
Quist P. (2000) Ny kbenhavnsk multietnolekt. Om sprogbrug blandt unge i sprogligt
og kulturelt heterogene miljer. Danske
Talesprog, 143-212. Copenhagen: C.A.
Reitzels Forlag.
Trudgill P. (1974) Linguistic Change and Diffusion: Description and Explanation in
Sociolinguistic Dialect Geography.
Language in Society 2, 215-246.
Udofot I. (2003) Stress and rhythm in the
Nigerian accent of English: A preliminary
investigation. English World-Wide 24: 2,
201-220.
Differences
Despite the similarities perceived between
Rinkeby Swedish and Rosengrd Swedish by
adolescents in Malm, many are surprised to
hear that the Malm adolescents perceive the
soccer player Zlatan Ibrahimovic as a speaker
of Rosengrd Swedish (and not simply a
speaker of the Malm dialect). How large is the
difference between SMG and the regional dialect? How large is the difference between e.g.
Rosengrd Swedish and Scanian? Although
Rosengrd Swedish clearly contain a number of
non-regional features, not all speakers of
Rosengrd Swedish use all of those features,
and many features of Rosengrd Swedish are
not distinct from the regional dialect (e.g. the
word accents). Swedish on Multilingual
Ground should, therefore, only be seen as an
overarching term for a number of related but
distinct varieties. SMG in Malm appears to be
Scanian on Multilingual Ground (which incidentally is reflected in the Advance Patrol
members artist name Gonza Blattesknska).
40
Abstract
The results of an acoustic analysis and a perceptual evaluation of the role of prosody in
spontaneously produced ja and s in
Swedish and Italian are reported and discussed
in this paper. The hypothesis is that pitch contour, duration cues and relative intensity can be
useful in the identification of the different
communicative functions of these short expressions taken out of their context.
The results of the perceptual tests run to verify
whether the acoustic cues alone can be used to
distinguish different functions of the same lexical items are encouraging only for Italian s,
while for Swedish ja they show some confusions among the different categories.
Introduction
Short expressions, such as "mh" "ah" and yes or
no are widely produced during spontaneous
conversation and seem to carry a variety of communicative functions. For instance (Gardner 2000)
reports to have recognized eight main types of
"mm" in corpora of spoken English. However
(Cerrato 2003) reports that one of the most common function that these short expressions carry out
is that of feedback. Feedback can have different
nuances of meaning; therefore the same expression
can be produced in several contexts, to convey
different communicative functions. It seems possible that the specific dialogue function of short utterances is reflected in their suprasegmental characteristics.
The focus of this paper is on the role of prosody in signaling specific dialogue functions for ja
in Swedish and s in Italian, (i.e. yes) which are
frequently used in natural conversational interaction
and which are essential for a smooth progressing of
the communication process.
Ja and s are used by dialogue participants
to indicate that the current listener is following and
willing to go on listening, or that he/she is following,
but willing to take the turn, or to signal that the listener has understood what has been said so far, is
still paying attention to the dialogue, prompting the
speaker to go on. Moreoever ja and s, can
be used to answer yes-no questions.
Analyses
A functional categorization of Italian s
and Swedish ja was first carried out using the
labels reported in table 1. The categorization was
carried out listening to the short expressions and
considering the explicit function that they were carrying in the given context. Short expressions were
1 The two Italian dialogues were recorded in a sound-treated room at the University of
Naples, they are part of the Italian corpus called CLIPS. More information about the CLIPS
corpus are available on the web page: http://www.cirass.unina.it/ingresso.htm
2 The two Swedish dialogues are not part of a big corpus. They were recorded in a soundtreated room at Stockholm University for the special purpose of analysing pre-aspiration
phenomena in Swedish stop consonants. More information on the Swedish map-task dialogues in Helgason, P. Preaspiration in the Nordic Languages: Synchronic and Diachronic
Aspects. Doctoral thesis, Stockholm University, 2002.
41
Label
Comment
Feedback Continuation
FCI
FCY
I want to go on
you go on
Feedback
Acceptance
FA
Answer
Positive
RP
Positive answer to a
polar question
Function
label
Swedish
Italian
FCY
Rising
lengthening
FCI
Flat F0,
(lengthening for speak 1)
FA
Rising F0
Falling F0
RP
Rising
and
long
F0 Falling F0
(if in context)
Falling
and
short
(if in isolation)
F0,
Perceptual test
The test consisted of two sub-tests, one with the
Italian stimuli submitted to 8 Italian listeners and
one with the Swedish stimuli submitted to 8 Swedish listeners.
The test material consisted of 8 stimuli for each
category organized in two blocks of 34 stimuli for
Italian and 34 stimuli for Swedish (the first two
stimuli in each block being dummies). No manipulations were performed on the stimuli, in order to
preserve their naturalness; however for the categories RP and FCI there were not enough instances
of stimuli per speaker hence some of them were
played twice. Before the experimental session the
participants were given written instructions and
were involved in a short training session to familiarise with the task. During the experimental session,
the stimuli were presented individually over headphones, with randomized order and after each
presentation the listener chose, on the answering
sheet, one of the 4 available labels (reported in table 1) for the function that they believed the stimulus carried out in the conversation.
The results, under the form of confusion matrices, for the Italian listeners judging the s of the 2
Italian speakers, are reported in Table 4a and 4b.
For the Italian stimuli all the recognition rates appear to be above chance level. In Italian FA and
RP are confused with each other. This might depend on the fact that they have similar acoustic
characteristics, in particular similar pitch contour
and duration. The only difference consisting in the
higher intensity of RP stimuli (+4 dB). FCY for
Italian speaker 1 gets high recognition rates, this
maybe due to lengthening.
Table 3 shows the average duration in milliseconds of Italian s for the 4 functions.
The results for the Swedish listeners judging the
ja of the 2 Swedish speakers are reported in
Table 5a and 5b. For the Swedish stimuli not all
the recognition rates appear to be above chance
level. RP is in fact not distinguished. This might depend on the fact that in Swedish positive answers
speak1
speak2
FA
Fci
FCY
RP
speak1
250
speak2
200
150
100
50
0
FA
FCi
Fcy
RP
43
References
Cerrato L. & D Imperio M. (2003), Duration and
tonal characteristics of short expressions in Italian, In Proceedings of the ICPhs Barcellona
03.
Cerrato L. (2003) A comparative study of verbal
feedback in Italian and Swedish map-task dialogues, In Proceedings of the Nordic Symposium on the comparison of spoken languages,
2003, 99-126
Cerrato L. (2004) A coding scheme for the annotation of feedback phenomena in conversational
speech
LREC Workshop on Models of Human Behaviour for the Specification and Evaluation of
Multimodal Input and Output Interfaces, Lisboa, 25-28
Gardner R. (2001), When Listeners Talk John
Benjamins Publishing Company
House D. (2005) Phrase-final rises as a prosodic
feature in wh-questions in Swedish humanmachine dialogues, accepted in Speech Communication
Khler, K.J. (2004) Pragmatic and attitudinal
meanings of pitch patterns in gemran syntactically marked questions. In Fant G. et al (eds)
From traditional phonology to modern speech
processing, 205-214 Beijing Foreign Language
Teaching and Research Press.
Ohala J.J. (1983) Cross language use of pitch: an
ethological view, Phonetica 40, 1-18
Sjlander K, Beskow J. (2000) aveSurfer - an
Open Source Speech Tool, Proceedings
ICSLP 2000, Bejing, China
Acknowledgements
A special thanks to my supervisor David House,
for inspiring discussions about the results of this
study.
Conclusions
The aim of this study was to investigate the acoustic and perceptual characteristics of spontaneously
produced ja and s in Swedish and Italian, in
order to find out whether acoustic cues can be
used to render and to recognize different communi44
Abstract
This paper reports on a comparison of prosodic
variables from oral presentations in a first and
second language. Five Swedish natives who
speak English at the advanced-intermediate
level were recorded as they made the same
presentation twice, once in English and once in
Swedish. Though it was expected that speakers
would use more pitch variation when they
spoke Swedish, three of the five speakers
showed no significant difference between the
two languages. All speakers spoke more quickly
in Swedish, the mean being 20% faster.
Introduction
Two earlier contributions to the Annual Swedish Phonetics Conference have outlined ideas
for a feedback mechanism for public speaking.
Briefly, Hincks 2003 proposed that speech
technology be used to support the practice of
oral presentations. Speech recognition could
give feedback on repeated segmental errors
produced by non-natives as well as provide a
transcript of the presentation, which could then
be processed for lexical and syntactic appropriateness. Speech analysis could give feedback
on the speakers prosodic variability and speaking rate. Hincks 2004 presented an analysis of
pitch variation in a corpus of second language
student presentation speech. Pitch variation was
measured as the standard deviation of F0 for
10-second long segments of speech, normalized
by dividing by the mean F0 for that segment.
This value was termed PVQ, for pitch variation
quotient. Hincks (forthcoming) reports on the
results of a perception test of speaker liveliness,
where a strong correlation (r = .83, n = 18, p <
.01) was found between speaker PVQ and perceptions of liveliness from a panel of eight
judges.
Though automatic feedback on the prosody
of public speaking could be useful for both first
and second language users, the abovementioned studies have been done on a corpus
of L2 English, where native Swedish students
of Technical English were recorded as they
made oral presentations as part of their course
Method
The goal for the data collection used for this
paper was to have a corpus where the same
speaker used both English and Swedish to
make the same presentation, with the same visual material. Because class time could not be
wasted with having students hear the same
presentation twice, the Swedish recordings
needed to be made outside the classroom. All
students studying English at KTH in the fall of
2004 nearly 100 studentswere contacted
and asked whether they would like to participate. They were told that they would first be
recorded in the classroom as they made their
presentations in English, and that they would
then meet in groups and make the same presentations in Swedish to each other. They were offered 150 SEK as compensation for the extra
time it would take. Unfortunately, only five
students were able to participate. These five,
three males and two females, were all intermediate students. They were first recorded in their
45
0.26
0.24
English
0.22
Swedish
0.20
0.18
0.16
0.14
0.12
0.10
0.08
0.06
M1
M2
M3
F1
F2
Speaker
Temporal measures
The male speakers spoke for a shorter length of
time when making the presentation in Swedish
than when using English, as shown in Figure 2.
700
English
600
Swedish
Seconds
500
400
300
200
100
0
M1
M2
M3
F1
F2
Speaker
Speaking rate
Part of the reason the speakers could make their
presentations in a shorter period of time is that
they spoke on average 20% more quickly. Figure 3 shows the speaking rate per speaker in
syllables per second. The mean speaking rate in
English was 2.97 sps, and for Swedish was 3.58
sps. M3, the only student to use a lot more
pitch variation in Swedish than in English, also
spoke much more quickly in Swedish. Note
also that the two females are more stable in
their speaking rates, and that the fastest and
slowest speakers in one language maintain their
ranking in the other language.
Results
Pitch variation quotients
The mean PVQs per speaker for the two presentations are shown in Figure 1. For three of
the five speakers, there was very little difference in the PVQs when using English and when
using Swedish. Only one speaker, M3, had significantly lower PVQ speaking English, but another, F1, had lower PVQ when speaking
Swedish. Though there are only five speakers,
the mean values reflect the same range as that
found in the all-English corpus, with a low of
about 0.11 and a high of about 0.24.
46
5.0
4.5
English
4.0
Swedish
3.5
3.0
2.5
2.0
Language or performance?
The result that three of five speakers showed no
significant difference in PVQ depending on the
language they were using would seem to indicate that PVQ measures are more speaker dependent than language dependent, at least for
native speakers of Swedish. The hypothesis that
the speakers would use less pitch variation
when speaking English was not at all born out
by the study. It seems that the PVQ depends
mostly on speaking style, and perhaps the energy one puts into performing in a certain
situation. The English presentation was a
higher-stakes event, where students were
speaking to more people and, most importantly,
receiving a grade on their work. Speaker F1
performed very well for her first presentation,
and with the high mean length of runs combined with higher-than average mean PVQ,
probably would have received high liveliness
ratings had her speech been part of the perception test. It is interesting that she was the only
student to have lower PVQ values and the only
student to have lower MLR values in Swedish
than in English. This could indicate that she in
some way put less effort into performance for
the Swedish presentation. Speaker M3, on the
other hand, was either hampered by using English or relatively unprepared when making the
first presentation. He could have benefited by
rehearsing with a feedback mechanism beforehand.
For the purposes of a thesis grounded in
computer-assisted language learning, these results throw a bit of a wrench in the works. The
problems I am proposing to help may not depend on the use of a second language, but on
more basic features of speaking style. On the
other hand, at advanced levels of language
courses, it is difficult to separate the needs of
first and second language users. Furthermore,
many native speakers as well as non-natives
obviously have problems achieving an engaging speaking style, and it has never been my
intention to propose a device restricted to nonnative use.
1.5
1.0
0.5
0.0
M1
M2
M3
F1
F2
Speaker
14
Mean length of runs
Swedish
12
10
8
6
4
2
0
M1
M2
M3
F1
F2
Speaker
Discussion
This study was performed on a small group of
speakers, and any results should be interpreted
with care. The students who participated were
paid volunteers, and in that sense cannot be
considered as representative of the population
to the same extent as the speakers recorded for
47
Further work
A small study is being planned to test the perception of liveliness in these speakers as they
used the two languages.
The corpus described in this chapter could
be augmented by a small number of speakers
over the period of several terms and could provide a wealth of further opportunities for language study. Comparison of the English and
Swedish transcripts will allow examination of
aspects such as how the speakers use pitch
movement in utterances that are comparable
content-wise. This could provide insight into
transfer of Swedish intonational patterns to
English. It is possible that with more speakers,
statistically significant differences in PVQ
could still be found. The differences in mean
speaking rate should also be further investigatedthe 20% difference found in this group
would be interesting to pursue. Does the average Swedish speaker of English manage to say
only 80% of what a native speaker can say during the allotted time at a conference? Documenting such information about first and second language use would give valuable evidence
for those in positions of developing language
policy.
References
Hincks, R. (2003). Tutors, tools and assistants
for the L2 user. Phonum 9: 173-176, Ume
University Department of Philosophy and
Linguistics.
Hincks, R. (2004). Standard deviation of F0 in
student monologue. Proceedings of Fonetik
2004, Stockholm, Department of Linguistics, Stockholm University.
Hincks, R. (forthcoming). Measures and perceptions of liveliness in student oral presentation speech: a proposal for an automatic
feedback mechanism. Accepted for publication in System.
Kormos, J. and M. Dnes (2004). Exploring
measures and perceptions of fluency in the
speech of second language learners. System
32: 145-164,
Mennen, I. (1998). Can language learners ever
acquire the intonation of a second language?
Proceedings of STiLL 98, Marholmen, Sweden, KTH Department of Speech, Music and
Hearing.
Pickering, L. (2004). The structure and function
of intonational paragraphs in native and nonnative speaker instructional discourse. English for Specific Purposes 23: 19-43,
Sjlander, K. and J. Beskow (2000). WaveSurfer: An open source speech tool. Proceedings of ICSLP 2000,
http://www.speech.kth.se/snack/.
Acknowledgements
My thanks to David House, the student
speakers and especially to teacher Beyza Bjrkman, whose encouragement was important in
getting five volunteers for this study. This work
was funded by the Unit for Language and
Communication.
48
Abstract
Translation Studies cover a subfield of Applied
Linguistics and are concerned with the scientific study of translation and interpreting in its
various media and forms: oral vs. written, simultaneous vs. consecutive, literary vs. technical, human vs. machine, direct vs. relais, remote vs. in situ, etc. While linguistics in general
have a long tradition of both theoretical and
experimental research into various aspects of
translation and interpreting, the phonetics and
phonology of this specialized form of intercultural communication have, until very recently,
attracted only little attention within the scientific community. The purpose of this paper is to
summarize some recent findings of this research and to indicate potential directions for
further studies into the phonetics and phonology of translation.
Interpreting
Interpreting, both in the simultaneous and in the
consecutive mode, involves linguistic choices
that have to be made by the interpreter at all
levels of language processing at a time when
the source language text, mostly oral speech,
has yet to be completed. Contents and structure,
topic and focus, verbal references, phrasal attachements, presuppositions and often even the
actual goals and intentions of the speaker, may,
at the very extreme, be entirely unresolved at
the time of the original utterance when the interpreter has to perform.
On the other hand, empirical studies of the time
constraints of simultaneous interpreting show
that the dcalage, i.e. the time delay between
the source language input by the original
speaker and the target language output by the
interpreter, that is acceptable to normal listeners
should not exceed two to three seconds2.
This double bind forces the professional interpreter at the very least to develop, in addition to
his or her linguistic, mnemotic and anticipatory
skills a high degree of vocal and articulatory
expertise in order to be able to continuously adjust to the speech rate properties and vocal demands of the actual situation.
In addition, as shown among others by Goldmann-Eisler (1972), ernov (1978), Shlesinger
(1994) and Ahrens (2004), professional inter
Translation
Translation1, in the narrow sense, covers all
forms of the transfer of meaning from a source
language text into one or more target languages.
Both written and oral texts are translated, as
long as they are presented as a whole in a fixed,
finished and thus permanent form.
Clearly, the translation of written texts does not
normally involve any choices at the phonetic or
phonological level. However, as shown among
others by Paz (1971), Lefevere (1975), Kohlmayer (1996) and Weber (1998), expressive
texts including poetry, lyrics and drama, but
also scripted speech, video narrations and advertisements need to be translated in view of
their readability and their potential use in stage
performance. The successful transfer of rhyme,
rhythm, pausing patterns, alliteration, accentuation and word play, based on the segmental
and/or suprasegmental qualities of lexical and
phrasal units, will often be crucial to the useful-
49
References
Ahrens B. (2004) Prosody beim Simultandolmetschen. Frankfurt am Main: Peter Lang,
Europischer Verlag der Wissenschaften.
ernov G.V. (1978) Teorija i praktika sinchronnogo perevoda. Moskva: Me dunarodnie otnoenia
Fodor I. (1976) Film Dubbing: Phonetic; Semiotic, Esthetic and Psychological Aspects,
Hamburg: Buske
Gile D. (1995) Regards sur la recherch en interprtation de conference. Lille: Press Universitaire
Goldmann-Eisler F. (1972) Segmentation of
Input in Simultaneous Translation. Journal
of Psycholinguistic Research 1/2, 127-140
Huber D. (1990) Prosodic transfer in spoken
language interpreting. Proceedings of the International Conference on Spoken Language
Processing. ICSLP90 (Kobe, Japan), 509512.
Kohlmayer R. (1996) Oscar Wilde in Deutschland und sterreich. Untersuchungen zur
Rezeption der Komdien und zur Theorie
der Bhnenbersetzung. Tbingen: Niemeyer
Notes
1
Not to complicate matters, I neglect to include a lengthy discussion of the various uses and ambiguities of the term translation in this and other scientific disciplines such as physics,
biogenetics, economics, theology, history (cf. translatio imperii) and others. Suffice to say that even within the restricted
scope of translation studies per se, translation as a scientific
term is used somewhat incoherently both in the narrow sense as
translation proper (versttning, bersetzung, traduction) and,
in the wide sense, as the generic term to cover the whole field
of translation and interpreting (tolkning, Dolmetschen, interpretation).
2
50
Abstract
Automatic speech recognition measures have
been investigated as scores of segmental pronunciation quality. In an experiment, contextindependent hidden Markov phone models were
trained on native English and Swedish read
child speech respectively. Among various studied scores, a likelihood ratio between the
scores of forced alignment using English phoneme models and the score of English or Swedish phoneme recognition had the highest correlations to human judgments. The best measures
have the power of evaluating the coarse proficiency level of a child but need to be improved
for detailed diagnostics of individual utterances
and phonemes.
Theory
An approximation in this work is that pronunciation quality is composed of two components, knowledge and ability. The knowledge
component reflects the speakers knowledge of
the correct phonetic transcription of a written
text. The ability component reflects the
speakers ability to pronounce the phonemes of
the target language correctly.
The knowledge score, Sk, can be formulated as
a confidence measure that the speaker has chosen the correct transcription, TrCorrect, in his
spoken utterance (U) of the written text. This is
modeled by:
Introduction
Automatic evaluation of foreign language pronunciation presents possibilities for computerassisted language learning as well as for prediction of speech recognition performance in a
non-native language. Although children constitute a very large portion of foreign language
learners, speech technology research in this
domain has previously been mainly focused on
adults. The current work has been produced as
part of the EU project PF-Star, in which one
goal was to assess the current possibilities of
speech technology for children.
Previous work has used the fact that the better you are at pronouncing the new language,
the more the utterances should resemble sounds
from the target language instead of the mother
tongue (Eskenazi, 1996; Matsunaga, Ogawa,
Yamaguchi and Imamura, 2003). However, the
pronunciation quality of read speech does not
depend solely on the ability to produce the phonemes correctly; it also depends on knowledge
of how the words should be pronounced. Spectral quality and time-related scores have shown
high correlation with human judgment
(Neumeyer, Franco, Digalakis and Weintraub,
2000; Cucchiarini, Strik and Boves, 2000).
The foreign language considered in this report is English and the mother tongue is Swedish and also Italian in some cases. The scoring
procedure used context-independent phoneme
Sk =
P(U | TrCorrect , T )
P(U | TrCorrect , T )
=
(1)
max[P(U | Tri , T )]
P(U | TrBest , T )
i
Sa =
P(U | TrCorrect , T )
P (U | TrCorrect , M )
(2)
if the correct phonetic transcription of the written text to be pronounced is known. M is the
set of mother language phoneme models. If the
51
P(U | TrBest , T )
P(U | TrBest , M )
(3)
P (U | TrCorrect , T )
P(U | TrBest , M )
(4)
Implemented Scores
In this report we present results of the following
pronunciation score parameters:
Knowledge:
English forced alignment / English phoneme
recognition (EFA/EPR)
Ability:
English phoneme recognition / Swedish phoneme recognition (EPR/SPR)
Combined:
English forced alignment / Swedish phoneme recognition (EFA/SPR)
Fraction language use (FRAC); see below.
Rate of speech (ROS)
Utterance duration (DUR)
Experiments
Recognition performance tests and pronunciation evaluation experiments have been performed. Word recognition tests used the SVE
and ENG development sets both containing
children of ages eight and nine only. The language model allowed any word to follow any
other with equal probabilities. The word insertion penalty was experimentally set to minimize
WER. In phoneme recognition tests the penalty
was non-optimized and equal to zero.
The English and Swedish phoneme models
were trained on ENG-tr and SVE1, respectively. The phoneme models consist of three
states. The 39 elements of the acoustic feature
vector are the 13 lowest mel frequency cepstral
coefficients (MFCC) including number 0, and
their first and second order time derivatives.
The mel filterbank is computed with 25 ms
Hamming window at a frame rate of 10 ms. The
output likelihood values of the recognizer are
logarithmic, which turns the implementation of
ratio between scores into subtraction instead of
division.
The scores were measured in different ways,
including or excluding non-speech intervals before and after the utterance and optional pauses
between words. In this report, the presented
In FRAC, both language model sets are active in parallel and the score is the percentage
of English language models selected by the recognizer.
52
scores include non-speech intervals, which generally performed better. Several other combinations of scores, models and normalization techniques have been studied by Oppelstrup (2005).
The pronunciation scores were correlated
with human judgment of the utterances. The
SWENG and ITEN speech files were scored by
an English teacher with phonetic training. Segmental and prosodic qualities were judged separately. Each utterance was scored on a threegraded scale: 3 for correct pronunciation, 2 for
small errors and 1 for erroneous utterances. To
get grade per child, the average of all graded
sentences was calculated. At the time of the experiments, the ENG database had no human
judgments but were given the score 3, assuming
that all English children had a correct pronunciation. Afterwards, judgments have been made
also on the English children and some results
including these are given in this report.
Test set
Score
S
I
S+I
S+I+E
EFA/EPR 0.20 0.61 0.65 0.70/0.75
EPR/SPR 0.18 0.26 0.56 0.80/0.68
EFA/SPR 0.20 0.37 0.60 0.82/0.72
FRAC
0.22 0.09 0.48 0.82/0.71
ROS
0.18 -0.25 0.43 0.57/0.42
DUR
-0.11 -0.12 -0.54 -0.47/-0.47
Results
Word and phoneme recognition
Results of the word and phoneme recognition
experiments are shown in Table 1 and 2, respectively. The error rates are generally quite
high, which is not surprising considering the
combined difficulties of child, non-native
speech from different databases and a highperplexity language model.
EFA / EPR
0
Table 1. Word error rates for the Swedish and English test sets.
Test
SVE2
ENG-te
SWENG
ITEN
Voc. size
1051
1097
617
629
-200
Automatic
Training
SVE1
ENG-tr
ENG-tr
ENG-tr
-100
WER %
94
54
79
85
-300
-400
-500
-600
1
1.5
2.5
3.5
3.5
Human judgment
EPR/SPR
3000
2500
2000
SVE2
97
72
1500
Automatic
Test
Training ENG-te SWENG ITEN
ENG-tr
66
103
119
SVE1
92
86
93
1000
500
0
-500
Pronunciation scoring
Correlation with human judgment was low for
single utterances but was increased when averaging the scores of all utterances by a child. Table 3 lists correlations between automatic
scores and human judgment of segmental qual-
-1000
-1500
1
1.5
2.5
Human judgment
Figure 1. Automatic vs human judgment of Swedish, Italian and English children for EFA/EPR (top)
and EPR/SPR (bottom).
53
Discussion
The low accuracy of word and phoneme recognition even for native English children indicates
that there is low discrimination between the
phoneme models. Child speech recognition is
very difficult in itself and the small size of the
training material allowed only contextindependent phoneme models to be trained.
Another difficulty was the varying recording
conditions in the databases. These problems
makes the pronunciation evaluation uncertain.
Another fact that may lower the correlation
with human judgment is that the human listener
and the scoring algorithms have different targets for correct pronunciation. Whereas the
human listener is likely to compare with neutral
British English, the reference models for the
system are trained on children with different
regional accent.
As was expected from previous studies, correlation increased with the amount of available
data. Correlation for single utterances was
lower than for averages of all utterances from a
speaker.
The correlation within the Swedish children
was quite small. An explanation may be that the
scoring algorithms are not sensitive to the limited pronunciation variation in this group. The
correlation among Italians is larger and the
highest overall correlation is achieved when
including children from all the Italian, Swedish
and English sets. It is interesting to note that the
Swedish phoneme models seemed to work
equally well as native models for Italian as for
Swedish children.
A separate procedure will probably be necessary to reject utterances that match poorly to
both nominator and denominator models in the
likelihood ratios, since the likelihood ratio of
these utterances will be quite random.
Acknowledgement
This work was conducted as a Master of
Science thesis at TMH, KTH, Stockholm,
within the EU project PF-STAR, Preparing Future Multisensorial, Interaction Research. The
human pronunciation judgments were performed by Becky Hinks.
References
Blomberg, M. and Elenius D. Collection and
recognition of childrens speech in the PFStar project. Phonum 9, Dept. of Philosophy
and Linguistics, Ume University, 2003.
Cucchiarini, C., Strik, H. and Boves, L. Different aspects of expert pronunciation quality
ratings and their relation to scores produced
by recognition algorithms. Speech Communication, Vol 31, pp 109-119, 2000.
Eskenazi, M. Detection of foreign speakers
pronunciation errors for second language
learning preliminary results. Proc. of ICSL
96, vol 3, 1996.
Iskra, D. Grosskopf, B., Marasek, K., van der
Heuvel, H., Diehl, F, and Kissling, A. Speecon speech databases for consumer devices: Database specification and validation.
Proc. of ICSLP 02, 2002.
Matsunaga, S. Ogawa, A., Yamaguchi, Y. and
Imamura, A. Non-native English speech
recognition using bilingual English lexicon
and acoustic models. Proc. of ICME 03, pp
625-628, 2003.
Neumeyer, L., Franco, H. , Digalakis, V. and
Weintraub, M. Automatic scoring of pronunciation quality. Speech Communication,
Vol 30, pp 83-93. 2000.
Oppelstrup L. Speech Recognition used for
Scoring of Childrens Pronunciation of a
Foreign Language, M. Sc. Thesis
TMH/KTH, Stockholm, 2005.
Conclusion
The best scoring functions correlate well
enough with human judgments to allow coarse
grading of a childs pronunciation quality. The
context-independent models used are too insensitive, however, to allow scoring on the utterance or phoneme level. The most important improvement would be to use context-dependent
phoneme models, trained on a large corpus with
recordings of children with correct pronunciation and accent.
54
Abstract
This is a preliminary report of a study of some
linguistic and interactive aspects available in a
adult-child dyad where the child is partially
hearing impaired, during the ages 8 - 20
months. The investigation involves a male
child, born with Hemifacial Microsomia. Audio
and video recordings are used to collect data
on child vocalization and parent-child interaction. Eye-tracking is used to measure eye
movements when presented with audio-visual
stimuli. SECDI forms are applied to observe
the development of the childs lexical production. Preliminary analyses indicate increased
overall parental interactive behaviour. As babbling is somewhat delayed due to physical limitations, signed supported Swedish is used to
facilitate communication and language development. Further collection and analysis of data
is in progress in search of valuable information
of the linguistic development from a pathological perspective of language acquisition.
Introduction
The typical linguistic development during infancy can be regarded as the result of the interaction between biological and environmental
factors that leads to the childs language converts to the surrounding language. According to
the Ecological Theory of Language Acquisition
(Lacerda et al., 2004a), early language acquisition is an emergent consequence of multi- sensorial embodiment of the information available
in ecological adult-infant interaction settings.
In agreement with this theory, the basic linguistic referential function emerges from at least
two of the sensory dimensions available in the
speech interaction scene (Lacerda, 2003;
Lacerda, Gustavsson & Svrd, 2003). If there
are restraining biological conditions or a lack of
adequate interaction with the environment, the
childs linguistic development generally will
deviate from the expected age dependent competence of communication. During typical circumstances, a one-year old child starts to use
adult-like word forms. By two years of age, the
55
Method
A Swedish mother is recorded monthly while
spontaneously interacting with her child. On
two occasions the father has participated during
the recording substituting the mother.
Subject
The subject is a Swedish, male infant from the
age of 8 months to 20 months with his mother
and father. The child was born with Hemifacial
Microsomia, i.e. was born without left outer
and middle ear, no zygomatic or mandible bone
structure on the left side of the face. He has
also a slightly cleft soft palate and a split uvula.
The child was fed by sub-glottal probe until
seven weeks of age and by nasal probe up to 8
months of age. The boy has one older sister.
Recording sessions
Recording sessions take place in a laboratory at
the Department of Linguistics, Stockholm University. The mother receives a selection of toys,
with verbal instructions indicating the significance of using onomatopoetic sounds when appropriate.
Procedure
A digital video camera, Panasonic NV-DS11,
focusing on the boy and his parent was used.
Both parent and child were recorded by a
Fostex CR200 Compact Disc Recorder, with
wireless microphones, Sennheiser Microport
Transmitters, attached to their clothes, connected to a Sennheiser Microport Receiver
EMI1005. Audio-visual perception is studied
by Tobii (www.tobii.com), an eye-tracking sys56
References
Anvil: www.dfki.de/~kipp/anvil
Bahrick, L. E. (2004). The development of perception in a multimodal environment. In
G.Bremmer & A. Slater (Eds.), Theories of
Infant Development (1:rst ed., pp. 90-120).
Oxford: Blackwell.
Brent, M.R. & Siskind, J.M. (2001) The role of
exposure to isolated words in early vocabulary development. Cognition, 81, B33-B44.
Dunn, L.M. & Dunn, L.M. (1981) Peabody Picture Vocabulary Test Revised. American
Guidance Service. Circle Pines, Minnesota.
Eriksson M. & Berglund, E. (1999) Swedish
early communicative development inventories: words and gestures. First Language,
19, 55-90.
Fernald, A., Taeschner, T., Dunn, J., Papousek,
M., de Boysson-Bardies, B. & Fukui, I
(1989) A Cross-language study of prosodic
modifications in mothers and fathers
speech to preverbal infants. Journal of Child
Language, 16, 477-501.
Fenson, L., Dale, P.S., Reznick, J.S., Thal, D.,
Bates, E., Hartung, J., Pethick, S. & Reilly,
J. (1993). The MacArthur Communicative
Development Inventories: Users guide and
technical manual. San Diego, Ca. Singular.
Lacerda, F. (2003) Phonology: An emergent
consequence of memory constrains and sensory input. Reading and Writing: An Interdisciplinary Journal, 16, 41-59.
Lacerda, F., Gustavsson, L. & Svrd, N. (2003)
Implicit linguistic structure in connected
speech. PHONUM 9, Ume, Sweden, 6972.
Lacerda, F., Klintfors, E., Gustavsson, L.,
Lagerkvist, L. Marklund, E. & Sundberg, U.
(2004a) Ecological Theory of Language
Acquisition. In. Berthouse, L., Kozima, H.,
Prince, C., Sandini, G., Stojanov, G., Metta
G & Balkenius, C. (Eds.) Proceedings of the
Fourth International Workshop on Epigenetic Robotics (EPIROB 2004) Lund university Cognitive Studies, 117, 147-148.
Lacerda, F., Marklund, E., Lagerkvist, L., Gustavsson, L., Klintfors, E. & Sundberg, U.
(2004b) On the linguistic implications of
context-bound adult-infant interactions. In.
Berthouse, L., Kozima, H., Prince, C.,
Sandini, G., Stojanov, G., Metta G & Balkenius, C. (Eds.) Proceedings of the Fourth
International Workshop on Epigenetic Robotics (EPIROB 2004) Lund university
Cognitive Studies, 117, 149-150.
Conclusion
As a consequence of the childs congenital
physical handicap, the mothers interactive behavior seemed to increase. The childs verbal
production is impaired but steadily improving.
Passive verbal language seems to be present,
and an active form of verbal language with well
articulated words will probably come in time as
the physical impediments are attended to. Further collection and analysis of data will give
hopefully valuable information of the linguistic
development from a pathological perspective of
language acquisition.
57
58
Abstract
On the basis of previous, small-scale analyses, it was hypothesized that most Swedish
children develop an adult-like command of
quantity-related durational patterns between
18 and 24 months of age. In this study, VC
structures produced in stressed position by
several Swedish 18- and 24-month-olds were
analyzed for durational correlates of the complementary V:C and VC: quantity patterns.
Durations were typically reminiscent of the
adult norm suggesting that, at 18 months of
age, Swedish children have acquired a basic
command of the durational correlates of the
quantity contrast. In consequence, quantity
development must start well ahead of that
age. It was also found that voicing had a considerable, adult-like effect on the duration of
postvocalic consonants at both ages. This effect was smaller in the American controls,
again indicating the presence of a languagespecific phonetic pattern. The effect of voicing
on preconsonantal vowel duration was relatively moderate. The effect was also present
in the American 24-monthers, but less substantially than commonly observed in adults
speech. No significant voicing-induced vowel
lengthening effect was found in the American
18-monthers.
Methods
Subjects were drawn from a larger group of
Swedish and American English children at
the ages of 6, 12, 18, 24 and 30 months. Audio and video recordings were made as described in Engstrand et al. (2003). These recordings were subsequently digitized and
stored on dvd disks. The present study is
based on disyllabic words produced by
Introduction
11 Swedish 18-monthers
11 Swedish 24-monthers
14 American 18-monthers
13 American 24-monthers.
Results
In this section, durational tendencies are reported; a full statistical treatment will be presented elsewhere.
Mean vowel-to-consonant durational ratios
pertaining to the Swedish 18- and 24monthers are summarized in table 1 for the
long vowels (the V:C pattern) and in table 2
for the short vowels (the VC: pattern). For
example, table 1 shows that the V:/C average
ratio was 1.86 for the 18-month-olds, and that
this average was based on a total of 43 productions. Eight of the 11 18-year-olds produced measurable instances of the V:/C pattern. One child turned out to be a far outlier
and was left out. The inclusion of 7 out of 11
child averages is marked as 7(11) in the right
column of the table. The mean value for the
24-month-olds was similar, 1.96, but this
mean was based on more (124) productions.
This means that, on average, the long vowels
had almost twice the duration of the following
consonants. For both age groups, in other
words, the durational relationships are not far
from what can be observed in adult speech
(e.g., Elert 1964).
Table 1. Mean vowel-to-consonant ratios for disyllabic words with long vowels produced by
Swedish 18- and 24-month-olds.
Age
(mos.)
18
24
# children
7(11)
9(11)
Table 2. Mean vowel-to-consonant ratios for disyllabic words with short vowels produced by
Swedish 18- and 24-month-olds.
Age
(mos.)
18
24
# children
11(11)
10(11)
The corresponding data for the 18-montholds are presented in figure 2. Again, filled
and unfilled circles stand for V:C and VC:
60
Age
Vowel Cons.
(mos.) length type
Long
Vd
18
Vl
Short
Vd
Vl
Long
Vd
24
Vl
Short
Vd
Vl
Ratio
2.08
1.18
1.28
0.51
2.29
1.53
1.11
0.70
Std.
0.67 30
0.54 13
0.65 75
0.18 72
0.62 65
0.59 50
0.16 103
0.31 105
Table 4. Mean vowel-to-consonant ratios for disyllabic words containing voiced or voiceless
postvocalic consonants. Subjects: American 18and 24-month-olds.
Age
Cons.
(mos.)
type
18
Voiced
Voiceless
24
Voiced
Voiceless
Ratio
1.70
1.23
2.01
1.31
Std.
0.68
0.37
0.91
0.79
N
83
51
128
77
24
Mn
Std
N
Mn
Std
N
V:
266
83
C
126
68
Voiceless
V:
253
77
14
289
83
Voiced
V
202
40
9
117
29
36
C
171
61
245
73
Voiceless
V
119
47
6
192
73
32
C:
161
74
200
54
C:
282
112
51
175
70
16
145
55
266
140
Acknowledgment
65
This work was supported by grant 421-20031757 from the Swedish Research Council
(VR) to O. Engstrand.
Age
(mos.)
18
24
References
Voiced Voiceless
V
C
V
C
Mean 153 116 166 152
Std
35 37 83 51
N
22
17
Mean 199 115 161 143
Std
61 34 46 49
N
25
28
Abstract
We report auditory observations on /r/ approximations produced by 11 Swedish 2-yearolds. About half of the 1291 expected instances
of /r/ were either realized as vocoids or just
dropped. Most of the contoid realizations were
approximants or fricatives whereas taps, flaps,
trills, laterals, nasals and stops occurred marginally. This was roughly consistent with the
phonetic norms for the ambient language. The
most frequently occurring places of articulation
were coronals, palatals and, to some extent,
glottals. Some of this place variation could be
explained in terms of number of attempted
word types suggesting that both vocabulary
size and ambient-languge-like /r/ productions
constitute different aspects of linguistic maturity in young children.
Introduction
Rhotics (r sounds) are well known for their unusually wide range of variation in terms of
manner and place of articulation (e.g., Lindau
1985, Ladefoged and Maddieson 1996, Demolin 2001). More or less common variants are
voiced or voiceless vocoids, approximants,
fricatives, trills, taps and flaps produced at various places of articulation. Whereas the rhotics
tend to occupy the liquid slot adjacent to the
syllable nucleus and, thus, have a common distribution in terms of the sonority hierarchy
(Jespersen 1904), they clearly lack an invariant
acoustic basis (such as a lowered F3; see, e.g.,
Lindau 1985). To be sure, the rhotics may be
phonetically related in terms of a Wittgensteinian family resemblance such that /r/1 resembles /r/2 that resembles /r/3 an so on up to
/r/n; however, /r/1 and /r/n may not have a single
phonetic property in common (Lindau 1985).
But even though the family resemblance metaphor may characterize relationships between
category members in an interesting way, it does
not serve to delimit the category in the first
place.
The apparent lack of unity among the rhotics is bound to cause trouble on childrens path
to spoken language. However, our knowledge
Methods
The subjects used in this study were 11 normally developing Swedish 24-monthers, 6 girls
and 5 boys. The children were drawn from a
larger group of approximately 60 Swedish and
60 American English children at the ages of 6,
12, 18, 24 and 30 months. Subjects and recordings were described in detail in Engstrand
et al. (2003). In summary, audio- and videorecordings were made in nursery-like, soundtreated rooms in Stockholm and Seattle. All
children were accompanied by a parent (usually
the mother). Both parents were representative
of the regional standard spoken in the Stockholm and Seattle areas, respectively. The audio
and video signals were subsequently digitized
and stored on dvd disks.
63
Results
Out of the 1291 expected instances of /r/, 613
(47 percent) were realized as contoids. Whether
the remaining /r/s had a vocoid realization or
were just dropped was often hard to determine
reliably. At a rough estimate, however, approximately 10 percent were vocoids and 43
percent were dropped. The distribution of manners and places of articulation for the contoid
realizations is shown in table 1.
Table 1. Distribution of manners and places of articulation across the subject group (percent of all contoid
/r/ realizations, N=613).
N=613
Approx
Labio-dental
0.0
Fricative
0.3
Tap or
flap
0.0
Lat.
approx
0.0
Detal/alveolar
23.2
5.2
4.2
Retroflex
Palatal
Velar/uvular
Glottal
Total
5.7
25.8
0.2
0.0
54.8
15.3
0.8
1.3
9.3
32.3
0.0
0.0
0.0
0.0
4.2
Stop
Nasal
Trill
Total
0.0
0.0
0.0
0.3
3.1
1.3
1.3
0.3
38.6
0.0
0.0
0.0
0.0
3.1
0.2
0.0
0.0
2.3
3.8
0.0
0.0
0.0
0.0
1.3
0.2
0.0
0.0
0.0
0.5
21.4
26.6
1.5
11.6
100
Table 2. Mean percentages, number of subjects and ranges of variation for the respective manner/place
combinations. Phonetically unlikely or impossible sound types are marked with an asterisk.
Approx
Fricative
Mean
0.0
0.4
# subj.
0
1
Range
0
0
12.6
5.4
Alveolar/ Mean
# subj.
8
5
dental
Range
2.3-4.2
2.4-40.9
Mean
2.1
5.5
4
2
Retroflex # subj.
Range
1.7-17.6 28.3-32.1
Mean
34.8
1.0
# subj.
9
2
Palatal
Range
0.5-71.1
1.7-9.8
Mean
0.2
1.5
Velar/
# subj.
1
2
uvular
Range
0
7.3-9.1
Mean
*0.0
16.3
# subj.
0
10
Glottal
Range
0
0.8-83.3
*Unlikely or impossible sound types.
Labiodental
Tap or
flap
0.0
0
0
2.0
3
2-4-16.7
0.0
0
0
0.0
0
0
0.0
0
0
*0.0
0
0
Lat.
appr
*0.0
0
0
3.7
7
1.6-7.9
0.0
0
0
0.0
0
0
0.0
0
0
*0.0
0
0
Stop
0.0
0
0
1.4
4
1.1-6.7
0.5
1
0
0.0
0
0
0.0
0
0
5.4
7
0.8-33.3
Nasal
Trill
0.0
*0.0
0
0
0
0
6.8
0.1
3
2
3.3-66.7 0.5-0.8
0.0
0.1
0
1
0
0
0.0
*0.0
0
0
0
0
0.0
0.0
0
0
0
0
*0.0
*0.0
0
0
0
0
lars and retroflexes), glottals and dorsals (palatals, velars and uvulars), respectively. The
straight lines, which are linear approximations to the data points for the respective
types, suggest an increase in the number of
coronal /r/ realizations as a function of the
number of produced word types (r = 0.70). In
contrast, the glottal realizations exhibit the
opposite, negative trend (r = -0.63). For the
dorsals, the effect is negligible (r = -0.18). In
this sense, then, children who displayed a larger /r/ vocabulary seemed to conform phonetically to the ambient language to a higher
degree than did children with a smaller /r/ vocabulary.
Acklowledgments
This work was supported by grant 20038460-14311-29 from the Swedish Research
Council (VR) to O. Engstrand.
Notes
1. Names in alphabetical order.
References
Demolin D. (2001) Some phonetic and phonological observations concerning /R/ in
Belgian French. In H. Van de Velde and
R. van Hout (eds.), r-atics. Sociolinguistic, phonetic and phonological characteristics of /r/. Etudes & Travaux, Institut des
langues vivantes et de phontique, Universit Libre de Bruxelles, 63-73.
Engstrand O., Williams K. and Lacerda F.
(2003) Does babbling sound native? Listener responses to vocalizations produced
by Swedish and American 12- and 18month-olds. Phonetica 60, 17-44.
Fox A.V. and Dodd B.J. (1999) Der Erwerb
des phonologischen Systems in der
deutschen Sprache. Sprache-StimmeGehr 23, 183-191.
Jespersen, Otto. 1904. Phonetische Grundfragen. Leipzig and Berlin: Teubner.
Ladefoged P. and Maddieson I. (1996) The
Sounds of the Worlds Languages. Oxford: Blackwell.
Lindau M. (1985) The story of /r/. In Fromkin
V.A. (ed) Phonetic Linguistics: Essays in
Honor of Peter Ladefoged, 157-168. Orlando: Academic Press.
Muminovic D. and Engstrand O. (2001) /r/ in
some Swedish dialects: preliminary observations. Working Papers (Dept.
Linguistics, Lund University) 49, 120123.
Vihman M.M. (1993) Variable paths to early
word production. Journal of Phonetics 21,
61-82.
Conclusions
Auditory observations on 11 Swedish 2-yearolds have shown a high degree of variation in
the phonetic realization of /r/. On the whole,
however, approximants and fricatives constituted the dominating manners of articulation,
whereas taps, flaps, trills, laterals, nasals and
stops occurred marginally. This is roughly in
accordance with the phonetic norms for the
ambient language (cf. Muminovic and Engstrand 1991 for similar patterns in a number
of Swedish dialects). The most frequent
places were coronal, palatal and, to some extent, glottal. The glottal articulations were
counter to expectations since they are foreign
to central Swedish. So are [j]-like /r/ realizations, but these were nevertheless expected
from informal observations. Some of the
place variation could be explained in terms of
vocabulary size in the sense of number of
attempted /r/ words. It may be that vocabulary
66
Abstract
F0 measurements were made of disyllabic
words produced by several Swedish and American English 18- and 24-month-olds. The Swedish 24- and 18-monthers produced accent contours that were similar in shape and timing to
those found in adult speech. The Swedish 18monthers, however, produced very few words
with the acute accent. It is concluded that most
Swedish children have acquired a productive
command of the word accent contrast by 24
years of age and that, at 18 months, most children display clear tonal ambient-language effects. The influence of the ambient language is
evident in view of the F0 contours produced by
the American English children whose timing of
F0 events tended to be intermediate between the
Swedish grave and acute contours. The relative
consistency with which grave accent contours
were produced by the Swedish 18-monthers
suggest that some children are influenced by
the ambient language well before that age.
Methods
Subjects were drawn from a larger group of approximately 60 Swedish and 60 American English children at the ages of 6, 12, 18, 24 and 30
months. Audio and video recordings were made
as described in Engstrand et al. (2003). The
present study is based on recordings of
11 Swedish 24-month-olds (6 girls, 5 boys)
13 American 24-month-olds (6 girls, 7
boys)
11 Swedish 18-month-olds (6 girls, 5 boys)
16 American 18-month-olds (9 girls, 7
boys),
i.e., a total of 51 children, including the 24monthers used in Engstrand and Kadin (2004).
All disyllabic words with stress on the first syllable were analyzed according to criteria described in Engstrand and Kadin (2004). F0 was
measured at five points in time: at 1) the acoustic onset of V1, 2) the F0 turning-point in V1 (if
the F0 contour was monotonic throughout the
vowel, the turning-point was assigned the value
of the onset), 3) the acoustic offset of V1, 4) the
acoustic onset of V2, and 5) maximum F0 in V2
(if F0 declined throughout the vowel, maximum
F0 was assigned the value of the onset). A Fall
parameter was defined as the F0 difference between V1 turning-point and offset, and a Rise
parameter was defined as the F0 difference between V2 maximum and V1 offset. All measurements were made using the Wavesurfer program package.
Introduction
Swedish has a contrast between a grave and
an acute tonal word accent. The acute accent
is associated with a simple, one-peaked F0 contour. The grave accent typically has a twopeaked F0 contour with a fall on the primary
stress syllable and a subsequent rise towards a
later syllable in the word (Bruce 1977, Engstrand 1995, 1997).
A preliminary report on Swedish childrens
acquisition of the word accents was presented
in Engstrand and Kadin (2004). The results,
which were based on 6 Swedish and 6 American English 24-month-olds, suggested that, at
that age, Swedish children are well on the way
to establishing a productive command of the
accent contrast. The present study was carried
out to test this preliminary conclusion using
additional subjects. In addition, previous studies (Engstrand et al. 1991, Engstrand et al.
2003) have suggested that most 17-18-montholds display a much less consistent use of the
Results
Auditory judgments first suggested that a majority of the words produced by the Swedish
children (both 18- and 24-month-olds) had a
grave-like tonal contour and that, in general,
67
Measurement results are summarized in table 1-5 (a full statistical treatment will be reported elsewhere). The tables present means
and standard deviations for the Fall and Rise
parameters. The bottom line of each table
shows grand means and standard deviations. In
the left column, SW and AM represent Swedish
and American English, respectively, 18 and 24
indicate the respective ages in months, and F
and M stand for sex (female or male). The last
figure is a reference number that identifies the
individual child. Thus, for example, SW24F1
stands for a Swedish 24 months old girl with
the reference number 1.
Child
SW18F1
SW18F2
SW18F3
SW18F4
SW18F5
SW18F6
SW18M1
SW18M2
SW18M3
SW18M4
SW18M5
Grand
mean
Child
SW24F1
SW24F2
SW24F3
SW24F4
SW24F5
SW24F6
SW24M1
SW24M2
SW24M3
SW24M4
SW24M5
Grand
mean
GRAVE ACCENT
Fall (Hz)
Rise (Hz)
Mean SD Mean SD
69
62
102 121
60
29
69
40
62
45
70
66
113
88
93 129
72
41
99 124
77
38
127 125
102 141
392 286
59
42
70
84
101 122
106 136
70
59
42
78
60
35
50
44
76
64
111
112
N
21
22
19
26
23
12
2
23
23
21
23
GRAVE ACCENT
Fall (Hz)
Rise (Hz)
Mean SD Mean SD
41 44
139 180
59 41
94
67
30 28
1,0
65
64 17
33
24
114 107
52 219
27 18
70
98
43 46
58
85
23 38
45
44
39 33
38 134
35 29
83
34
80 93
-3,0
27
50
45
55
89
N
2
14
4
4
7
2
17
4
6
13
2
75
215
Swedish grave words produced by the 24month-olds consistently displayed positive values for both the Fall and the Rise parameters
(table 1). This means that 1) F0 declined from a
turning-point in the primary stress vowel reaching a relatively low value at the end of that
vowel, and 2) rose to resume a relatively high
position in the second vowel resulting in a
two-peaked F0 contour.
Acute productions by the Swedish 18monthers were too few to provide a basis for
reliable generalizations. However, parameter
values tended to differ from those for the grave
productions and to resemble those pertaining to
the 24-monthers.
Child
SW24F1
SW24F2
SW24F3
SW24F4
SW24F5
SW24F6
SW24M1
SW24M2
SW24M3
SW24M4
SW24M5
Grand mean
ACUTE ACCENT
Fall (Hz)
Rise (Hz)
Mean SD Mean SD
29 11
46 78
96 49
11 2,0
42 9,5
-3,7 25
-79 106
-29 22
-52 1,7
16
2,7
35
31
-73
57
43
15
-37
27
AMERICAN ENGLISH
Fall (Hz)
Rise (Hz)
Mean SD Mean SD N
AM18F1
41 68
-47 67 5
AM18F2
19 27
84 171 6
AM18F3
0
-11
1
AM18F4
39 49
-33 190 10
AM18F5
30
52
1
AM18F6
27 24
-48 102 12
AM18F7
28 22
-27 18 4
AM18M1
11
159
1
AM18M2
36 36
24 73 8
AM18M3
36 25
25 83 10
AM18M4
44 78
4,8 79 5
Grand mean
28 41
17 98 63
Child
N
0
6
7
3
9
0
0
7
0
5
0
37
The above tables have shown differences between accent types and ambient languages in
terms of the Fall and Rise parameters. Thus,
timing has so far been disregarded. However,
timing of F0 events in relation to segmental
structure is crucial as illustrated in figure 1. The
figure shows mean data for the Swedish and
American English children in both age groups
(symbols are explained in the figure legend).
Grave and acute productions are shown for the
Swedish 24-monthers. F0 values are timealigned to the first measurement point, and the
data points are connected by smoothed lines
that bear a certain resemblance to authentic F0
contours. The measurement points correspond
to acoustic events as described above.
AMERICAN ENGLISH
Child
Fall (Hz)
Rise (Hz)
Mean SD Mean SD N
AM24F1
69 82
-53 175 12
AM24F2
58 41
-4,5 45 16
AM24F3
33 23
-2,8 57 13
AM24F4
55 71
-80 81 15
AM24F5
32 31
-28 57 13
AM24F6
51 24
-2,9 76
7
AM24M1
20 27
-6,4 58 16
AM24M2
50 75
5,6 27 16
AM24M3
29 36
-40 82 16
AM24M4
93 149 -0,71 90
7
AM24M5
26 21
31 45 10
AM24M6
27 26
2,6 47
9
AM24M7
34
17
1
Grand mean
45 51
-10 70 151
months, many Swedish children begin to produce grave-like F0 contours and to mark the appropriate words with these contours. Engstrand
et al. (2003) reached a similar conclusion on
the basis of listening tests. Based on those studies as well as preliminary analyses of the present material, Engstrand and Kadin (2004) hypothesized that acquisition of the Swedish tonal
word accents typically takes place in the 18-24
months age interval. However, the relative consistency with which grave accent contours were
produced by the present 18-monthers would
suggest that some children are influenced by
the ambient language well before this age. This
is in agreement with results of listening tests
suggesting occasional grave-like tone contours
as early as at 12 months of age (Engstrand et al.
2003).
Time (ms)
Acknowledgment
Summary and conclusions
This work was supported by grant 2003-846014311-29 from the Swedish Research Council
(VR) to O. Engstrand.
References
Bruce G. (1977) Swedish Word Accents in
Sentence Perspective. Lund: Gleerup.
Engstrand O. (1995) Phonetic interpretation of
the word accent contrast in Swedish. Phonetica 52, 171-179.
Engstrand O. (1997) Phonetic interpretation of
the word accent contrast in Swedish: Evidence from spontaneous speech. Phonetica
54, 61-75.
Engstrand O., Williams K. and Strmqvist S.
(1991) Acquisition of the tonal word accent
contrast, Actes du XIIme Congres International des Science Phontiques, Aix-enProvence, vol. 1, pp. 324-327.
Engstrand O., Williams K. and Lacerda F.
(2003) Does babbling sound native? Listener
responses to vocalizations produced by
Swedish and American 12- and 18-montholds. Phonetica 60, 17-44.
Engstrand O. and Kadin G. (2004) F0 contours
produced by Swedish and American 24month-olds: implications for the acquisition
of tonal word accents. Proceedings of the
Swedish Phonetics Conference held in Stockholm 26-28 May 2004, pp. 68-71.
Abstract
Previous investigations have proposed that
nasality in consonants are more perceptu ally stable than place of articulation in
constrained conditions. This paper investi gates the progression of initial consonant
clusters from a reduced to an adult- like
form in terms of manner and place of articulation in the speech of children between the age of 1;6 and 3;5. The results
show an earlier onset of stable production
of manner compared to place for in both
full clusters and in the reduced form. The
results are interpreted as evidence of the
importance of perceptual salience of segmental properties in the acquisition initial
consonant clusters.
Introduction
Initial sC clusters occur frequently in
spontaneous Swedish (Bannert & Czigler
1999) and are the refore a predominant
feature of the ambient language of children learning Swedish. Previous reports
concerning children's productions of
word- initial word- initial sC clusters have
shown that early output forms of the sec ond consonant of the cluster may have a
deviating phonetic quality com pared to
the adult model form (see McLeod, van
Doorn and Reed (2001) for a summary of
discovered trends in consonant cluster acquisition).
In clusters with a plosive as the second
element, the reduced form may involve an
change in place of articulation caused by
an application of a hypothesized fronting
rule. Non- plosive consonants may, in ad dition, be substituted by a consonant with
a different manner of articulation com pared to the target consonant, e.g.
through application of a stoping process .
For adult speakers, the articulatory features of place and manner of articulation
have been shown in the literature to be
correlated regarding their perceptual sta bility. In a study the perceptual response
71
material.
Based on the tabulated progressions of
each target onset, the productions investi gated in this study were extracted accord ing to two criteria: 1) that the initial out put forms produced by the child was not
produced in an adult- like manner in
terms of the feature set of the consonant,
and 2) that a progression should be ob served in the data in terms of place or
manner of articulation. As a results of
these criteria, productions made by seven
subjects, three female and four male, were
selected for further analysis. The age
ranges of the investigated subjects at the
time of recording are presented in table 1.
Age of acquisition of stable and adultlike production was determined separate ly for the articulatory features place and
manners as well as for the complexity of
the syllable onset. Furthermore, the age of
adult- like production of the second element of the target cluster (stop or nasal
consonant) was established for place and
manner of articulation. For all investigat ed features, onset of a stable production
was determined to be the session at which
four out of the following five productions
was produced with the same value in the
investigated featur es.
Method
Speech material
The data set investigated in the present
paper was extracted from a corpus con sisting of 5311 productions collected in
order to investigated the development in
output forms of word- initial consonant
clusters in Swedish children between the
age of 1;6 and 3;6. In this corpus, record ings were conducted on a monthly basis
in a sound- treated recording studio. The
target words were elicited by the accom panying adult using black- and- white pic ture prompts.
Procedure
A narrow transcription of the produc tions were subsequently produced by the
author. The transcribed segment labels
were then substituted by a feature vector
describing the segment in terms of articu latory features, including place and man ner of articulation. Consonant segments
in the onset of the first syllable of the pro duction were marked by the position in
the onset and the progression of the target
words t ([t h o:]), snr ([sno:r]) and skal
([skA:l]) were tabulated according to target
word and subject's age. 159 productions
of skal , 132 productions of snor and 198
productions of st were included in the
Speaker
First session
Last session
F1
105
151
F2
109
158
F3
77
128
M1
79
130
M2
124
178
M3
90
142
M4
84
129
Results
The ages when a steady production of
place and manner in the C consonant as
well as in the full sC cluster are presented
in figure 1. For subjects F1, F2 and M3,
place and manner of articulation was sta ble in singleton consonants from the on set of sampling. The bottom circles of F1
72
Speake r
F1 F2 F3 M1 M2 M3 M4
SkalCC man
0 13
0 10
Sto CC man
6 10 13
Sno CC man
10
SkalCC place
StoCCplace
0 35 29
SnoCC place
0 38
0
4 15
Speaker
F1 F2 F3 M1 M2 M3 M4
Skal ma nner *
8 41 24 *
Sto: manner *
15 23 21
Snormanner *
19 30 21
Skal place
Sto: place
51 28 23
Snors place
24 30 36
34
73
Acknowledgements
The author would like to thank the children who participated in this study and
the children's parents for bringing the
children to the recording studio and for
participating in the elicitation of the target
words.
References.
Bannert, R. and Czigler, P. E. (1999) Variations in consonant clusters in Standard
Swedish. PHONUM 7.
McLeod, S., van Doorn, J. and Reed, V. A.
(2001) Consonant Cluster Develop ment in Two- Yeas-Olds: General
Trends and Individual Difference.
Journal of Speech, Language and Hear ing Research 44. 1144- 1171.
Miller, J. L. and Eimas, P. D. (1977) Studies
on the perception of place and manner
of articulation: A comparison of the
labial- alveolar and nasal- stop distinc tions. Journal of the Acoustical Society
of America 61(3), 835-845.
Singh, S. (1971) Perceptual similarities and
minimal phonemic differences. Journal
of Speech and Hearing Research 14,
113-124.
Singh, S. and Black, J. W. (1966) Study of
twenty- six intervocalic consonants as
spoken and recognized by four lan guage groups. Journal of the Acoustical
Society of America 39, 372-387.
74
Abstract
Aim
The aim of the project is to study the pronunciation problems of the students. The following
questions, among others, will be answered:
(1) What role does the first language (L1) play
for the learning of the target languages pronunciation? Special concern will be given to
each learners dialect or regional variety of the
Standard languages.
(2) What role does the pronunciation of the
second or third language play?
(3) Which phonological and phonetic targets
are easy and which are difficult related to the
structural similarities in both languages?
(4) Which interplay between the various difficulties is to be discovered? Which implications
can be observed?
The answers to these questions will constitute the scientific basis on which new and well
adopted learning material can be developed
later by didactic and pedagogical experts for
both language groups.
Introduction
In second language learning research of adults
it is agreed that in the area of phonology and
phonetics a clear negative transfer, interference,
can be observed. This is due to the influence of
the first language (L1). However, this is not the
only reason for foreign accent; interlanguage,
too, plays an important rle. Only recently research has paid attention to the special case of
learning in a multiple language setting.
A few years ago, beginner courses in German were introduced at the academic level in
Sweden. In the German speaking countries, due
to a long tradition, beginner courses in Swedish
attract many students. For both groups, the
teaching of pronunciation is allotted only a
small amount of time. As a consequence of this,
the learners' target language is characterized by
a strong accent which in most cases becomes
fossilised. In order to prevent this, a pronunciation programme should be constructed that is
tailored just for the special preconditions of the
learners, namely their first language (L1) and
the multiple contexts: all learners have already
learned at least one foreign language. An extraordinary challenge lies in the fact that German
and Swedish are linguistically very close to
each other. Therefore it should be rather easy to
help the learners to a good pronunciation right
from the start.
Research situation
Research on second language learning has centred around the question whether or not the first
language affects the target language. Experience tells us that transfer and interference do
occur when pronunciation, i.e. the learning of
phonology and phonetics, is concerned. While
the literature is abundant with studies of syntax
and morphology, the learning of pronunciation
has not been studied to a greater extent. Hammarberg has studied certain aspects of Swedish
as a second language (1985, 1988, 1997). He
(1993) made a study of third language acquisition investigating his co-author. Since the middle of the seventies, Bannert has done research
on several aspects of learning Swedish pronunciation (1979a, b; 1980, 1984, 1990) and on the
German sound systems and prosody (1983,
1985, 1999).
A large and extensive survey "Optimizing
Swedish Pronunciation" (Bannert 2004) was
carried out in the late seventies in Lund. Swedish being the target language, the pronunciation
difficulties of 70 adult learners representing 25
75
Consonants
Swedish: voiceless fricatives [, ] tjugo
(twenty), sju (seven), retroflexes [, , , =, ]
fort (fast), bord (table), mars (March), barn
(child), Karl (Charles).
German: voiceless fricatives [S, ] Schuh
(shoe), mich (me), glottal stop [/] Theater
(theatre).
first languages were studied. The survey included also German represented by three learners from different regions: Northern Germany,
Bavaria and Switzerland.
Due to a grant from Vetenskapsrdet, it was
possible to conduct several initiating pilot studies for the project. Recordings of students in
Ume and Freiburg were made and analysed.
Students were interviewed about their introspection of their pronunciation difficulties and
Think-aloud protocols were written. A demo
version
of
the
database
(www.ling.umu.se/FIST) was programmed
showing the labels to be used. Socio- and psycholinguistic background variables were collected. Thus the project is based on a secure
and safe ground.
Prosody
Swedish: two word accents [acute, grave] 'buren - 'bren (the cage - borne), focus accent
manifested separately, complementary length
of stressed vowel and consonant, stress pattern
(speech rhythm)
German: short consonants, word accentuation,
stress pattern (speech rhythm).
Theoretical approach
Phonological processes
Swedish: retroflexation of [r] + [, , , =, ]
across morpheme and word boundaries: mer tid
(more time), har du (do you have), nr som
(whenever), har nu (have now).
German: final devoicing of [b, d, g] to [p, d, k]:
Sieb (strainer), Rad (wheel), Steig (path); initial
[s] to [z] See (lake); [s] to [S] in consonant clusters initially: Stein (stone), springen (jump);
voicing of medial [s] to [z]: lesen (read); deletion of unstressed [e] in endings -el, -en:
Himmel (heaven), Zeiten (times); vocalisation
of post vocalic [r] to []: Wasser (water); assimilation of place of articulation of [n] to [m],
[N] after deletion of unstressed [e]: Lippen
(lips), Banken (banks).
From long experience we know that phonological transfer is typical of the language learning
processes, especially of adult learners. This
characteristic phenomenon of foreign accent is
caused by the phonological system, including
orthography, of L1. However, with our student
groups, interferences from L2 and L3 must also
be responsible for the deviating pronunciation.
Furthermore, contributions of the learners interlanguage (Selinker 1972) are to be expected.
Therefore each deviating feature in the performance of the students will be coded. Each
deviating sample in the speech signal will be
labelled according to these codes. Thus it is
easy to cross search the whole material and do
a thorough inspection and statistics of the observations. This will allow us to make quantified statements about the learning processes.
Grapheme-Phoneme-Relationships
Swedish: <o> signifies [u] and [o] skola
(school), sova (sleep); <> signifies always [o]
mla (paint), tta (eight); <g, k, sk> becomes
palatalised to [J, , ] gift (poison), klla (well),
skinka (ham); initial <h> is not pronounced in
<dj-, gj-, hj-, lj-> djup (deep), gjuta (pour), hjul
(wheel), ljuga (lie).
German: <o> signifies always [o] Sohn (son),
Sonne (sun); <z> signifies [ts] Zahn (tooth);
final <b, d, g> become [p, t, k] (final devoicing): Sieb (strainer), Rad (wheel), Steig (path).
Contrastive aspects
The phoneme systems of vowels and consonants, phonological processes are rather similar
in both languages; however, prosody and the
grapheme-phoneme relationships show some
differences. Only the salient differences will be
pointed out.
Vowels
Swedish: long [A:] gata (street) and [:] duk
(cloth) , short [] hundra (hundred).
German: lax and short [I, Y, U] Mitte (middle),
Htte (hut), Mutter (mother), long [a:] Vater
(father), diphthong [aO] Haus (house).
Pronunciation norms
The impression of foreign accent, to the greatest extent, is caused by segmental and prosodic
deviations from the pronunciation norm of the
target language. This is spoken with parts of the
76
Preliminary results
A representative choice of deviations for each
group is shown in the following tables. Group
results are presented according to their frequencies of appearance. Together with the code
number and the frequency of appearance of
each deviation, the target symbol and its replacement (deviation) is given.
Swedish
target
replacement
A
a
Er
P
u
V
Vr
stressed
wrong
syllable
ys
yt
Vr
0
o
y
-d#
t
u
u <o>
Os
Ot
<g>
g
-v#
f
e
E
-b#
p
V:
V
V
V:
S
<o>
u
s
z
sk
S
-g
Coding system
Each deviant pronunciation from the norms defined above, due to different causes, is labelled
by a special mark, a code number, separately
for each language. Code numbers are listed for
different areas of interest: vowels, single consonants, consonant clusters, phonological processes, prosody, grapheme-to-phoneme relationships and use of first language forms. Although
a number of deviations is identical in both languages, language users show a great variety of
different labels. Most of the observed code
numbers and their labels are presented below
(results). The coding system allows different
statistical treatments of the data, especially the
quantifying of deviations. Thus it is possible to
calculate each learner's profile of pronunciation
difficulties, those for each kind of material:
read aloud texts, descriptions of pictures and a
narrative, for the male and female learners as
sub groups and finally for all the learners together. This will be an important aspect and the
basis for the construction of a self-going program for the learning of pronunciation.
77
code
frequency
S114
S308
S110
S309
S501
85
83
47
39
36
S104
S231
S112
S107
S302
S108
S111
S105
S409
S304
S102
S301
S116
S117
S305
S213
S113
S212
S416
S415
S404
33
27
27
25
23
18
15
14
11
11
11
10
10
8
7
6
6
5
4
4
3
p
A
a
P
s/Vpal
/<k>Vpal
-t-t=
b
e
d
-g
<sk->
rt
p
A
y
S
k
ts
d
rn
S
p
ei
t
0
rs
sk
dZ
S221
S201
S118
S115
S109
S418
S408
S236
S234
S223
S211
S204
S119
S235
S232
S224
S414
S216
3
3
3
3
3
2
2
2
2
2
2
2
2
1
1
1
1
1
g/Vpal
-er#
S
e
Obstr
+Voice
S
s
a
-ieht
ie
ie
-zts
-er
-ert
d
-er9
x
E
Obstr
-Voice
s
x
z
A
aE
ie
S
z
-r
-et
t
T402
T317
T222
T111
T218
3
3
3
3
2
T215
T214
T212
T112
T418
T415
T414
T223
T220
T219
T216
T209
2
2
2
2
1
1
1
1
1
1
1
1
Acknowledgement
German
target
replacement
"zs
<-er>
-er
-V
-Vr
-z-sts
s
SC
sC
V:
V
yt
ys
-t#
d
-(e)l
-el
C
C:
V
V:
e
E
/
0
Pt
Ps
E/-r
-p#
-b
SCC
sCC
o
O
y
u
-k#
g
u
-ig/
g
u
y
stressed
wrong
syllable
rs
h
h
code
frequency
T206
T308
T314
T203
T207
T202
T304
T104
T101
T302
T306
T505
T105
T110
T504
T113
T102
T301
T305
T103
T411
T303
T208
T412
T409
T405
T501
170
147
114
113
58
43
35
35
32
27
24
15
15
13
12
12
12
11
9
9
8
8
8
6
5
5
4
T320
T413
4
3
References
Bannert Robert. 1979a. Ordstruktur och prosodi. I: Svenska i invandrarperspektiv, pp.
132-173. Hyltenstam Kenneth (ed.). Lund.
Bannert Robert. 1980. Phonological strategies
in the second language learning of Swedish
prosody. PHONOLOGICA 1980, pp. 29-33.
Dressler W.U., Pfeiffer O.E. and Rennison
J.R. (ed.). Innsbruck.
Bannert Robert. 1984. Prosody and intelligibility of Swedish spoken with a foreign accent.
Acta Universitatis Umensis 59, pp. 8-18.
Elert Claes-Christian (ed.).
Bannert Robert. 2004. P vg mot svenskt uttal
(including CD-rom). Lund: Studentlitteratur.
Bannert Robert. 1999. (with Johannes
Schwitalla). usserungssegmentierung in
der
deutschen
und
schwedischen
gesprochenen Sprache. Deutsche Sprache.
Zeitschrift fr Theorie und Praxis 4, pp.
314-335.
Duden. 2001. Aussprachewrterbuch. Mannheim: Dudenverlag.
Hedelin Per. 1997. Norstedts svenska uttalslexikon. Stockholm: Norstedts.
Selinker Larry. 1972. Interlanguage. International Review of Applied Linguistics 10,
20978
Abstract
A comparative analysis of new dialect data on
word accents in Dalarna and accent contours
published by Meyer (1937, 1954) revealed
differences indicating a change, primarily in
the realization of the grave accent. The change,
a delayed grave accent peak, is tentatively seen
as a result of a spread towards north-west of
word accent patterns formerly characterizing
dialects of south-eastern Dalarna.
Background
79
A pilot study
Method
New recordings were made of speakers
between 20 and 50 years of age, all having
lived in Leksand and Rttvik for most of their
lives. Data were collected from both female and
male speakers, in total 11 from Leksand and 13
from Rttvik.
The speakers were recorded in their homes
(or at work or school). The material consisted
of two words produced in isolation, Polen
/'po:len/ Poland (acute), and plen /'p:len/
the pole (grave). They were elicited in
random order (together with other words, not
reported on here) through cards with the
respective words written on them. Each speaker
produced at least five repetitions of each word,
but some produced as many as eight and even
more.
Digitized versions of the material were
analyzed and the location of the f0-peak was
measured (in msec) relative to the VC
boundary. A position of the peak before and
after the boundary resulted in negative and
positive values, respectively. In addition to
absolute durations, percentages were calculated
(the distance (in msec) of the peak from the VC
boundary relative to the duration of the
segment before or after the boundary) to
neutralize speaking rate variation. Peak
positions were sometimes difficult to identify;
many contours had plateaus rather than peaks,
and laryngealizations and other voice quality
features added to the problems. Dubious cases
were therefore eliminated and the reported data
Results
Table 2 shows the individual results (absolute
mean durations and standard deviations) for the
two target words produced by the Leksand and
Rttvik speakers. (The number of each word
analyzed for each speaker varied between 5 and
14.) Apart from individual durational
differences, the pattern is the same for all but
one of the speakers; the acute word has its peak
located before and the grave word after the VC
boundary. (Although measurements were made
of peak positions both in terms of absolute
durations and percentages relative to the VC
boundary, only absolute durations are reported
here, as very similar patterns resulted from the
two types of measurement.)
Table 2. Timing of grave and acute accent peaks
(means and standard deviations) by 5 Leksand (L1L5) and 6 Rttvik (R1-R6) speakers. Negative
values represent peaks before and positive values
peaks after VC boundary.
Speakers
L1
L2
L3
L4
L5
R1
R2
R3
R4
R5
R6
31
37
23
21
11
19
-93
-56
-84
-66
-42
-21
31
31
34
17
22
6
Conclusions
A comparative analysis of present-day dialect
data on word accents in Dalarna and accent
contours published by Meyer (1937, 1954) has
revealed differences indicating a change in the
realization of the grave accent. This change, a
delayed grave-accent peak, is tentatively seen
as a result of a spread towards north-west of
accent patterns formerly characterizing dialects
of the south-east of Dalarna. Clearly, however,
this assumption has to be confirmed by
extending the material for analysis.
Acknowledgements
We are grateful to Olle Engstrand and Gunnar
Nystrm for allowing us to include figure 2 in
this study. This work has been supported by a
grant from the Bank of Sweden Tercentenary
Foundation, 1997-5066.
Notes
1. This volume was published posthumously.
References
Bruce G. and Grding E. (1978) A prosodic
typology for Swedish dialects. In Grding
E., Bruce G. and Bannert R. (eds) Nordic
Prosody, 219-228. Department of Linguistics, Lund University.
Engstrand O. and Nystrm G. (2002) Meyers
accent contours revisited. TMH-QPSR 44,
17-20.
Fransson L. (2004) Fyra daladialekters ordaccenter i tidsperspektiv: Leksand, Rttvik,
Malung och Grangrde. Thesis work in
phonetics, Ume University.
Grding E. (1977) The Scandinavian word
accents. Malm: CWK Gleerup.
Grding E. and Lindblad P. (1973) Constancy
and variation in Swedish word accent
patterns, Working Papers, 7. Phonetics
Laboratory, Lund University.
Meyer E. A. (1937) Die Intonation im
Schwedischen I. Stockholm: Fritzes frlag.
Meyer E. A. (1954) Die Intonation im
Schwedischen II: Uppsala: Almqvist &
Wiksell.
Abstract
The paper addresses the issue of extraction of
implicit information conveyed by systematic
audio-visual contingencies. A group of adult
subjects was tested on a simple inference task
provided by short film sequences. The video
materials were encoded and submitted to processing by two neural networks (NN) that simulated the results of the adult subjects. The results indicated that the adult subjects were extremely efficient at picking up the underlying
information structure and that the NN could
also perform acceptably on both classification
and generalization tasks.
Introduction
Language acquisition can be described as a
process through which infants derive the underlying linguistic structure of their ambient language. In spite of the complexity and variability of the language input it is an undeniable fact
that within about two years of life, typical infants are able to pick up the linguistic regularities of the ambient language. Making sense of
linguistic information that is implicitly conveyed in a diversity of speech communication
situations appears to be such an insurmountable
task that researchers are prone to consider that
some sort of initial guidance is necessary to
home in on the ambient languages underlying
principles (Chomsky, 1968; Pinker, 1994). The
present paper attempts to challenge this established view by sketching a scenario in which
linguistic information may be derived in the
absence of pre-knowledge or dedicated linguistic biases. Indeed language can be seen as an
emergent consequence of the interplay between
the infant and its environment, where the richness and structure of the sensory flow may contain enough information to trigger language development (Jusczyk, 1985; Elman, Bates, Karmiloff-Smith, Parisi, & Plunkett, 1997). More
explicitly the language acquisition hypothesis
to be tested in this paper relies on the assumption that linguistic structure is implicit in the
83
Nela
Dule
(Red
Cube)
Nela
Bima
(Red
Circle)
Lame
Dule
(Yellow
Cube)
Lame
Bima
(Yellow
Circle)
1.00
0.80
0.60
0.40
0.20
0.00
Nela (Red)
Dule (Cube)
Lame (Yellow)
Bima (Circle)
Figure 2. Reference data provided by the adult subjects. Percentage correct discoveries of the meaning
of the non-words representing the colours and the
shapes of the objects. Only in two cases were there
errors made by the subjects.
96
Output
48
2
96
Output
Hidden
Visual Input
Hidden
48
Auditory Input
Hidden
Input
Figure 3. Schematic architecture of the auto association NN aimed at simulating a priori knowledge.
The task of the NN was to reproduce its input at the
output level. The number of units in each layer is
indicated by the figures to the left of rectangles.
Nela
Duma
(Red
Cone)
Lame
Guma
(Yellow
Cylinder)
Neme
Dule
(Blue
Cube)
Gale
Bima
(Green
Circle)
Discussion
The ultimate goal of this study is to investigate
how human infants might be able to extract im85
To be sure, the stimuli used in this first experiment are likely to be too simple to fully demonstrate relevant language development relying
on general-purpose associative mechanisms.
Therefore our current experiments with infants
are being conducted using audio-visual contingencies that attempt to replicate ecologically
relevant communication settings.
Acknowledgements
This work was supported by grants from the
Swedish Research Council, the Bank of Sweden Tercentenary Foundation and Birgit & Gad
Rausings Foundation.
References
Chomsky N. (1968). Language and mind. New
York: Harcourt Brace Jovanovich.
Elman J., Bates E., Karmiloff-Smith A.,
Parisi D., & Plunkett K. (1997) Rethinking
innateness. Cambridge, Massachusetts: MIT
Press.
Jusczyk P. (1985) On characterizing the development of speech perception. In Mehler J. &
Fox R. (eds), Neonate cognition: Beyond the
blooming, buzzing confusion Hillsdale, New
Jersey: Lawrence Erlbaum, 199299.
Lacerda F. (2003) Phonology: An emergent
consequence of memory constraints and sensory input. Reading and Writing: An Interdisciplinary Journal, 16, 4159.
Lacerda F., Klintfors E., Gustavsson L.,
Lagerkvist L., Marklund E., & Sundberg U.
(2004a) Ecological Theory of Language Acquisition. In Genova: Epirob 2004, 147148.
Lacerda F. and Lindblom B. (1997) Modelling
the early stages of language acquisition. In
Olofsson . and Strmqvist S. (eds), Crosslinguistic studies of dyslexia and early language development. Office for official publications of the European Communities, 14
33.
Lacerda F., Marklund E., Lagerkvist L., Gustavsson L., Klintfors E., & Sundberg U.
(2004b) On the linguistic implications of
context-bound adult-infant interactions. In
Genova: Epirob 204, 149150.
Pinker S. (1994) The Language Instinct: How
the Mind Creates Language. (1 ed.) New
York: William Morrow and Company, Inc.
1.00
0.80
0.60
0.40
0.20
0.00
Nela (Red)
Dule (Cube)
Lame (Yellow)
Bima (Circle)
Figure 6. Results of the NN performance. Percentage correct generalizations of the meaning of the
non-words representing the colours and the shapes
of the objects. The results indicate that generalization of shapes was slightly more robust than generalization of colours.
86
Abstract
Our abilitiy to estimate speaker age was investigated with respect to stimulus duration and
type as well as speaker gender in four listening
tests with the same 24 speakers, but with four
different types of stimuli (ten and three seconds
of spontaneous speech, one isolated word and
six concatenated isolated words) Results
showed that the listeners' judgements were
about twice as accurate as chance, and that
stimulus duration and type affected the judgements. Moreover, stimulus duration affected the
listeners judgments of female speakers somewhat more, while stimulus type affected the
judgments of male speakers more, indicating
that listeners may use different strategies when
judging female and male speaker age.
Introduction
Most of us are able to make fairly accurate estimates of an unknown speakers chronological
age from hearing a speech sample (Shipp and
Hollien, 1969; Linville, 2001). This paper adresses the question of how much and what kind
of speech information we need to make as good
estimates of speaker age as possible.
Background and previous studies
In age estimation, the accuracy depends, among
other things, on the precision required and on
the duration and type of the speech sample
(prolonged vowel, read speech etc.). The less
acoustic information present in a speech sample, the more difficult the task, but even with
very little information, listeners are still not reduced to random guessing. Speaker and listener
characteristics, including gender, age group, the
speaker's physiological and psychological state,
and the listener's experience or familiarity with
similar speakers (dialect etc.) may also influence the accuracy (Ramig and Ringel, 1983;
Linville, 2001). Consequently, some speakers
may be more difficult to judge than others.
A considerable amount of research has been
devoted to the issue of age recognition from
speech (Ptacek and Sander, 1966; Huntley et
al., 1987; Braun and Cerrato, 1999; Linville,
The sum, mean and median values of the errors for all speakers in the four tests as well as
for the baseline are shown in Table 2. In all
tests, the listeners' judgements of women were
more accurate than those of men. The highest
accuracy was obtained for the female 10 second
stimuli (6.5), while the male 6 word stimuli received the lowest accuracy (15.3). Moreover,
the listeners tended to overestimate the younger
speakers, and to underestimate the older speakers.
Table 2. Sum, mean and median error values for all
speakers in the four tests and for the baseline (BL).
Table 1. Test number, stimuli set, number of listeners, and gender and age distribution of the listener
groups in the four tests.
Test (stimuli)
1 (10 sec.)
2 (3 sec.)
3 (6 words)
4 (1 word)
N
31
33
37
37
F
18
22
33
33
Test
sum
mean
median
1 (10s)
196.5
8.2
7.2
Results
Accuracy
Figure 1 displays the mean absolute error, i.e.
the average of the absolute difference between
perceived age (PA) and chronological age (CA)
88
Discussion
Despite the limited number of stimulus durations and types investigated, a few interesting
results were found. These are discussed below,
along with a few suggestions for future work.
Accuracy
The listeners performed significantly better
than the baseline estimator (about twice as
good) in three of the tests, which is in line with
previous work. However, it remains unclear
what accuracy levels can be expected from listeners' judgements of age. Differences in
speakers' CA have to be taken into account as
well. A mean absolute error of 10 years could
be considered less accurate for a 30 year old
speaker (a PA of 20 could be regarded as 20/30
= 66.7\% correct), compared to an 80 year old
speaker (a PA of 70 could be regarded as 70/80
= 87.5\% correct). There is a need for a better
measure of accuracy for age estimation tasks.
The fact that three different listener groups participated in the tests may also have influenced
the accuracy.
In all of the four tests, the listeners' estimations of women were more accurate than those
of men, perhaps because the listeners were
mainly women. However, the influence of listener gender on performance in age estimation
tasks is still unclear. Although most researchers
have not reported any difference in performance between male and female listeners, some
studies have found females to perform better
than males, while others still have found male
listeners to perform somewhat better (Braun
and Cerrato, 1999). Another explanation could
be that the male speaker group contained a
larger number of atypical speakers, who consequently would be more difficult to judge, than
the female speakers. Shipp and Hollien (1969)
found that speakers who were difficult to age
estimate had standard deviations of nine years
and over. Perhaps such a measure can be used
to decide whether speakers are typical representatives of their CAs or not.
Stimulus type
Stimulus type also influenced the age estimations significantly (F(1,68)=61.143, (p<.05).
The listeners' judgments of the male speakers
were more accurate for the spontaneous stimuli
than for the word stimuli. Lower mean absolute
errors were obtained for the two sets of spontaneous stimuli (9.9 and 11.6) compared to the
two sets of word stimuli (15.3 and 15.1). This
effect was not observed for the female speakers. Here, the mean absolute error for the 6
word stimuli (7.9) was lower than for the 3 second spontaneous stimuli (9.7), but higher than
the longer spontaneous stimuli (6.5). The interaction of speaker gender and stimulus type was
significant (F(1,68)=39.296, p<.05).
Listener cues
Most of the listeners named several cues, which
they believed had influenced their age judgements. Dialect, pitch and voice quality affected
the listeners' estimates in all four tests, while
semantic content influenced the judgements in
the tests with spontaneous stimuli. A common
listener remark in the tests with spontaneous
stimuli concerned speakers talking about the
past. They were often judged as being old, regardless of other cues. Additional listener cues
included speech rate, choice of words or
phrases and experience or familiarity with similar speakers (age group, dialect etc.).
Stimulus effects
In this study, longer durations for the most part
yielded higher accuracy for the listeners' age
estimates. This raises the question of optimal
durations for age estimation tasks. When does a
further increase in duration for a specific
speech or stimulus type no longer result in a
89
References
Braun A and Cerrato L. (1999) Estimating
speaker age across languages. Proceedings
of ICPhS 99 (San Francisco), 13691372.
Brckl M and Sendlmeier W. (2003) Aging
female voices: An acoustic and perceptive
analysis. Proceedings of VOQUAL 03
(Geneva), 163168.
Cerrato L, Falcone M and Paoloni A. (1998)
Age estimation of telephonic voices. Proceedings of the RLA2C conference (Avignon), 2024.
Higgins M B and Saxman J H. (1991) A comparison of selected phonatory behaviours of
healthy aged and young adults. Journal of
Speech and Hearing Research 13, 10001010.
Huntley R, Hollien H and Shipp T. (1987) Influences of listener characteristics on perceived age estimations. Journal of Voice 1,
4952.
Linville S E. (2001) Vocal Aging. San Diego:
Singular Thomson Learning.
Mller C, Wittig F and Baus J. (2003) Exploiting speech for recognizing elderly users to
respond to their special needs. Proceedings
of Eurospeech 2003 (Geneva), 13051308.
Murry T and Singh S. (1980) Multidimensional
analysis of male and female voices. JASA
68 (5), 12941300.
Ptacek P H and Sander E K. (1966) Age recognition from voice. Journal of Speech and
Hearing Research 9, 273277.
Ramig L A and Ringel R L. (1983) Effects of
physiological aging on selected acoustic
features. Journal of Speech and Hearing Research 26. 2230.
Schtz S. (2005) Prosodic cues in human and
machine estimation of female and male
speaker age. In G. Bruce & M. Horne (Eds.)
Nordic Prosody: Proceedings of the IXth
Conference, Lund, 2004. Frankfurt am
Main: P. Lang, 215223.
Shipp T and Hollien H.(1969) Perception of the
aging male voice. Journal of Speech and
Hearing Research 12, 703710.
Bruce G, Elert C-C, Engstrand O and Eriksson
A. (1999) Phonetics and phonology of the
Swedish dialects - a project presentation
and a database demonstrator. Proceedings
of ICPhS 99 (San Francisco), 321324.
Abstract
This study is concerned with effects of age of
onset (AO) of acquisition on the production of
Voice Onset Time (VOT) among near-native L2
speakers. 41 L1 Spanish early and late learners
of L2 Swedish, who had carefully been
screened for their nativelike L2-proficiency,
participated in the study. 8 native speakers of
Swedish served as control group. Spectral
analyses of VOT were carried out on the subjects production of the Swedish voiceless stops
/p t k/. The preliminary results show an overall
age effect on VOT in the nativelike L2 speakers production of all three stops (answer to
Research Question 1). Among the late learners
only a small minority exhibits actual nativelike
L2 behavior (answer to Research Question 2).
Finally, far from all early L2 speakers do pass
as native speakers of their L2 regarding the
production of voiceless stops (answer to Research Question 3).
Introduction
Several studies on infant perception have
shown that first language (L1) phonetic categories are already established during the first year
of life (e.g. Eimas et al. 1971, Werker et al.
1984). Further evidence that very early exposure is of importance in L1 development comes
from children who were deprived from verbal
input due to inflammation of the middle ear
during their first year of life. Ruben (1997) reports that these children showed significantly
less capacity for phonetic discrimination compared to children with normal hearing during
infancy when they were tested at the age of
nine years. From these findings it has been concluded that there may exist a critical period for
phonetic/phonological acquisition and that this
critical period may already be over at the age of
one year (Ruben 1997).
One classical issue in the field of language
acquisition concerns whether this theory of the
existence of a critical period can be applied to
second language (L2) acquisition. More precisely, the question is if L2 learners typically
fail to acquire phonetic detail because of lack
(1) Is there a general age effect on VOT production among L2 learners who are perceived by native listeners as native speakers of Swedish?
(2) Are there late L2 learners who produce
voiceless stops with an average VOT
within the range of native-speaker VOT?
(3) Do all (or most) early L2 learners produce
voiceless stops with an average VOT
within the range of native-speaker VOT?
91
Method
Subjects
A total of 41 native speakers of Spanish (age
21-52 years), who had been selected on the criterion that native listeners perceive them as
mother-tongue speakers of Swedish, participated in this study. The nativelike subjects
mean length of residence (LOR) in Sweden was
24 years (range 12-44 years) and their age of
onset (AO) of L2 acquisition varied between 1
and 19 years. Furthermore, the subjects
showed an educational level of no less than
senior high school and they all had acquired the
variety of Swedish spoken in the great Stockholm area.
A control group was added consisting of 82
native speakers of Swedish who had been
matched with the experimental group regarding
present age, educational level and variety of
Swedish.
Results
Since it is a well-known fact that VOT varies
with place of articulation (see, e.g. Lisker &
Abramson, 1964) results for the three voiceless
stops are presented separately.
Figures 1-3 show the subjects average
VOT-values (in ms) plotted against their age of
onset (AO).
160
140
VOT (ms)
120
100
80
60
40
20
0
0
10
12
14
16
18
20
120
100
80
60
40
20
0
0
10
12
14
16
18
20
VOT (ms)
10
12
14
16
18
20
92
30
VOT (%)
25
20
15
10
5
0
0
10
12
14
16
18
20
VOT (%)
25
20
15
10
5
0
0
10
12
14
16
18
20
The present study has revealed that among subjects, who have been selected on the criterion
that native listeners perceive them as mothertongue speakers of Swedish, there exists a weak
but statistically significant correlation between
AO and VOT production. In other words, these
findings confirm that there is a general age effect on voiceless stop production even among
apparently nativelike L2 speakers (Research
Question 1).
The analysis of the post-puberty group has
shown that only two (AO=14 years) out of 10
late L2 learners show average VOTs within the
range of native speaker VOT for all three stops.
Furthermore, eight late learners do not produce
VOTs within the range of native speaker VOT
regarding all three stops, and three of these sub-
VOT (%)
25
20
15
10
5
0
0
10
12
14
16
18
20
93
References
Abrahamsson, N., Stlten, K. & Hyltenstam, K.
(in press), Effects of age on voice onset
time: The production of voiceless stops by
near-native L2 speakers. To appear in: S.
Haberzettl (ed.), Processes and Outcomes:
Explaining Achievement in Language
Learning. Berlin: Mouton de Gruyter.
Eimas P. D., Siqueland E. R., Jusczyk, P., and
Vigorito J. (1971) Speech perception in infants. Science 171, 303-306.
Flege J. E. (1991) Age of learning affects the
authenticity of voice-onset time (VOT) in
stop consonants produced in a second language. Journal of the Acoustical Society of
America 89:1, 395-411.
Lisker L. and Abramson A. (1964) A crosslanguage study of voicing in initial stops:
Acoustical measurements. Word 20, 384422.
Kessinger R. H. and Blumstein S. E. (1997) Effects of speaking rate on voice-onset time in
Thai, French, and English. Journal of Phonetics 25, 143-168.
Miller J. L., Green K. P., and Reeves A. (1986)
Speaking rate and segments: A look at the
relation between speech production and
speech perception for the voicing contrast.
Phonetica 43, 106-115.
Ruben R. J. (1997) A time frame of critical/sensitive periods of language development. Acta Otolaryngologica 117, 202-205.
Werker J.F. and Tees, R.C. (1984) Crosslanguage speech perception: Evidence for
perceptual reorganization during the first
year of life. Infant Behaviour and Development 7, 49-63.
Zampini M. L. and Green K. P. (2001) The
voicing contrast in English and Spanish:
The relationship between perception and
production. In: Nicol J. L. (ed) One Mind,
Two Languages. Oxford: Blackwell.
Acknowledgements
This study was in part supported by The Bank
of Sweden Tercentenary Foundation, grant no.
1999-0383:01.
Notes
1. A more detailed description and discussion
of this study will be given in Abrahamsson,
Stlten & Hyltenstam (in press).
94
Abstract
Experimental methodology
This is an experimental study of tonal correlates of prosodic phrasing and focus production in Greek. The results indicate: (1) the tonal correlates of phrasing are a rising tonal
command at phrase boundaries and a deaccentuation of the preboundary lexical stress; (2)
the tonal correlates of focus are a local tonal
range expansion aligned with the stressed syllable of the last lexical unit in focus and a
global tonal range compression, which is most
evident for the speech material after focus; (3)
phrasing and focus have significant interactions, according to which the phrasing tonal
command is suppressed as a function of focus
production at the same linguistic domain.
One experiment was designed in order to investigate distinctive phrasing and focus structures.
The speech material consists of two compound
test sentences with a phrasing distinction as
well four focus distinctions. The phrasing distinction involves the attachment of a surface
subject to either subordinate or main clause.
The focus distinctions involve one neutral production as well as three productions with focus
on different constituents of the test sentences.
The neutral production of the test sentences had
no contextual information whereas the focus
productions of the test sentences were preceded
by a question which elicited focus in different
constituents of the test sentences.
The two test sentences were {tan peze
bla, i mara javaze arxa}(When (he) was
playing football Maria was studying Ancient
(Greek)) and {tan peze bla i mara, jvaze
arxa} (When Maria was playing football (he)
was studying Ancient (Greek)). Thus, the noun
Maria is the subject of the subordinate and
main clause in pre-comma and post-comma
position respectively. With different elicitation
questions, focus was assigned on the test sentences in three different ways, i.e. on the subordinate clause, on the main clause and on the
subject Maria.
Two female students of the Linguistics Department at Athens University produced the
speech material in five repetitions at normal
speech tempo. The speech material was directly
recorded in to a computer disc and analysed
with the Waveserfer software package.
Three tonal measurements were taken at
each syllable, i.e. at the beginning, middle and
end, regardless the segmental structure of syllable. This methodology normalizes tonal
measurements with reference to temporal and
tonal alignments of produced utterances.
Introduction
This study is within a multifactor research context in linguistic structuring. We examine the
relation between sound and meaning as a function of linguistic distinctions and linguistic
structures in an integrated experimental framework, which is in the spirit of the ISCA Workshop Experimental Linguistics (see Botinis,
Charalabakis, Fourakis and Gawronska, 2005).
Phrasing and focus are abstract linguistic
categories with distinctive functions in linguistic structuring. The basic functions of phrasing
and focus are the segmentation of continuous
speech into a variety of meaningful linguistic
units and the marking of variable linguistic
units as more important than others respectively. We do have basic knowledge with reference to both phrasing and focus from earlier
research (e.g. Botinis, 1989, Fourakis, Botinis
and Katsaiti, 1999, Botinis, Bannert and
Tatham, 2000, Botinis, Ganetsou and Griva,
2004) but we do not have any knowledge with
reference to phrasing and focus interactions at
the same linguistic domains.
In this study, we present production data
whereas perception research with reference to
phrasing and focus interactions is being carried
out. In the remainder of the paper, the experimental methodology is presented next followed
by results and concluded by discussion.
Results
The results of this study, in accordance with
the experimental methodology described in the
previous section, are presented in average values of the tonal measurements in Figure 1.
95
400
300
200
100
0
tan
pe
ze
la /
ma
va
ze
ar
a/
va
ze
ar
va
ze
ar
A/
va
ze
ar
1a
350
300
250
200
150
100
tan
pe
ze
la
ma
2a
400
350
300
250
200
150
100
TAN
PE
ZE
LA /
ma
1b
400
350
300
250
200
150
100
TAN
PE
ZE
LA
MA
2b
Figure 1. Continuous next page.
96
350
300
250
200
150
100
tan
pe
ze
la/
MA
VA
ZE
AR
a/
VA
ZE
AR
va
ze
ar
A/
va
ze
ar
1c
400
350
300
250
200
150
100
tan
pe
ze
la
ma
2c
350
300
250
200
150
100
tan
pe
ze
la /
MA
1d
350
300
250
200
150
100
tan
pe
ze
la
MA
2d
Figure 1. Average values of tonal measurements as a function of prosodic phrasing (1-2), indicated
by solidus (/), and focus productions (a-d), indicated by capital letters (see text).
97
References
Botinis A (1989) Stress and Prosodic Structure
in Greek. Lund University Press.
Botinis A., Bannert R., and Tatham M. (2000)
Contrastive tonal analysis of focus perception in Greek and Swedish. In Botinis A.
(ed.), Intonation: Analysis, Modelling and
Technology, 97-116. Dordrecht: Kluwer
Academic Publishers.
Botinis A., Ganetsou S., Griva M., and Bizani
H. (2004) Prosodic phrasing and syntactic
structure in Greek. The XVIIth Swedish
Phonetics Conference, 96-99, Stockholm,
Sweden.
Fourakis M., Botinis A., and Katsaiti M. (1999)
Acoustic characteristics of Greek vowels.
Phonetica 56, 28-43.
Abstract
This is an experimental study of syntactic and
tonal correlates of focus in Greek and Russian.
Three experiments were carried out the results
of which indicate: First, the dominant word
order is SVO in both Greek and Russian.
Second, focus distinctions have inverse word
order effects, according to which syntactic
elements of focus elicitations are dislocated at
sentence beginning and sentence end for Greek
and Russian respectively. Third, focus has a
local tonal range expansion and a global tonal
range compression in both Greek and Russian.
Introduction
This study is in the spirit of the forthcoming
ISCA Workshop Experimental Linguistics to
be held in Athens, Greece, in 2006 (see Botinis
et al. 2005, this volume). Three experiments
were carried out the main questions of which
are (1) which is the unmarked word order? (2)
which are the word order correlates of focus
production? (3) which are the tonal correlates
of focus production? These questions are also
related to contrastive linguistics and language
typology with reference to sentence structure
production in Greek and Russian.
Experimental methodology
The basic language material of the three
experiments in this study consists of controlled
speech situations, in which experimenters from
Athens and Saint Petersburg for Greek and
Russian respectively were asked to produce
utterances with reference to pictures on the
computer screen in apparent agent-action-goal
semantic relations. The language material was
directly recorded into computer disc and tonal
analysis was carried out with Waveserfer.
The main objective of the first experiment
was to investigate unmarked word order of
written sentence production. Lexical words
corresponding to syntactic categories subject
(S), verb (V) and object (O) were copied from
the basic language material and were written in
99
Results
The results of the three experiments described
in the previous section are shown in Figures 1
and 2 with reference to syntactic and tonal
correlates of focus distinctions respectively.
In Figures 1a and 1b, SVO is the dominant
word order structure in unmarked written
production in both Greek (1a) and Russian (1b),
with marginal word order variability across
speakers age and gender.
In Figures 1c and 1d, the neutral elicitation
of spoken productions has a dominant SVO
structure in Russian (1d) but not in Greek (1c).
OVS
VSO
SVO
800
240
600
160
VSO
OVS
VOS
400
80
200
0
Female
Male
Female
Adults
Female
Male
Male
Adults
Children
Male
Children
Russian
Greek
a
SVO
Female
VOS
OVS
SVO
80
160
60
120
40
80
20
40
SOV
VSO
OSV
OVS
[N]
[S]
[VP]
[O]
[N]
Male
[S]
[VP]
[O]
[N]
Female
[S]
[VP]
[O]
[VP]
[O]
Russian
c
VOS
[S]
Female
Greek
SVO
[N]
Male
d
OVS
VSO
SVO
VOS
OVS
VSO
240
200
160
120
80
40
0
400
320
240
160
80
0
[N]
[N]
[S]
[V]
[VP]
[O]
[S]
[V]
[VP]
[O]
Russian
Greek
f
e
Figure 1. Greek (left) and Russian (right) word order of basic syntactic categories as a function of
speakers age and speakers gender written production (a-b), focus elicitations of spoken production
(c-d) and focus elicitation of written production (e-f).
100
Greek
Russian
g
h
Figure 2. Tonal structures of variable word order and distinctive focus productions of the sentences
/o ertis ftixni ti lba/ (The worker repairs the lamp) and /mljtik njisjt glbus/ (The boy carries
the globe) in Greek (left) and Russian (right) respectively (capital letters indicate focus).
101
References
Botinis A. (1989) Stress and Prosodic Structure
in Greek. Lund University Press.
Botinis A., Bannert R., and Tatham M. (1998)
Contrastive tonal analysis of focus perception
in Greek and Swedish. In Botinis A. (ed)
Intonation: Analysis, Modelling and
Technology. Dordrecht: Kluwer Academic
Publishers.
Botinis A., Charalabakis Ch., Fourakis M., and
Gawronska B. (2005) Athens 2006 ISCA
Workshop on Experimental linguistics (this
volume).
Hirst D., and Di Cristo A. (eds) (1998) Intonation
Systems. Cambridge University Press.
Svetozarova N. (1998) Intonation in Russian. In
Hirst D. and Di Cristo A. (eds) Intonation
Systems, 261274. Cambridge University
Press.
Yoo H-Y. (2003) Ordre des Mots et Prosody.
PhD dissertation, University of Paris 7.
102
Kitamura 2000, Katagiri, Sugito, and NaganoMadsen 1999, 2001). This is because these
studies deal with recordings of real-life communication and the samples were therefore not
systematically varied or distributed, and nor
were they easily analysable, due to overlap of
utterances.
In order to overcome the difficulties mentioned above and to balance the phonetic data
on back channels in Japanese, we present a
study of another kind: well controlled simulated utterances recorded in a good acoustic
environment. The kinds of back channels presented in this study are of the unrepeatable
back channel type, following Nagano-Madsen
and Sugitos classification based on the
phonological form. Unrepeatable back channels
look more like a proper utterance, whereas repeatable back channels are of /so:so:/, /haihai/
yes, yes type.
The first back channel dealt with in the present study is /a-soo-desu-ka/ Is that so? I see..,
which was the second common back channel
after /N:/ yes (Nagano-Madsen and Sugito
1999). Phonologically, it contains the H*L accent in the /soo/. The second type /yamada-sandesu-ka/ Is it Mr Yamada? / I see, it is Mr Yamada is classified as echo back channel
where a keyword in the previous utterance is
repeated as back channel (in this case Mr Yamada). This type of back channel shows a
deeper concern from the listener and is frequently used where a stream of conversation
becomes lively, with quick turn taking (Sugito
etc.). Phonologically, it contains unaccented
word /yamada/. In addition /are-desu-ka/ it is
that? It is that, which is similar to
/yamadasandesuka/ but shorter, is also included.
Abstract
Attitudinally-varied back channel utterances
were simulated by six professional voice actors
for Japanese. Contrary to the general assumption that a pitch accent language like Japanese
cannot vary the tonal configuration for attitudinal variation as in a stress/intonation language, all the speakers differentiated two kinds
of tonal configurations. Further variation was
achieved by phrasing utterances differently on
pitch and timing dimensions, and by adding a
rising or non-rising terminal contour.
Introduction
In many stress-accent languages that have
been traditionally classified as intonational languages, attitudinal meaning is expressed by
means of finely defined tonal contours. In contrast, pitch-accent languages such as Japanese
are assumed not to be able to choose contour
types for attitudinal meaning or emotion, because the languages use lexically fixed accent
shapes (Mozziconacci 2000). Thus, apart from
the variation in terminal contour, the dimensions where intonation can vary are pitch range
and phrasing (Beckman and Pierrehumbert
1986).
Contrary to traditional belief and assumptions, one of the findings of the present paper
is the systematic variation in utterance internal
tonal configuration in order to express attitudinal meaning in back channels.
Material
Our speech material consists of high quality
recordings of 6 professional voice actors (3
males and 3 females) who were in their 30s or
40s at the time of recording. Each of them produced 3 back channel utterances with neutral
103
U
V
W
X
Y
Z
NEU
H
H
H
H
H
H(L)
Q
H
H
H
H
H
H
JOY
LH
LH
LH
H
H
LH
SUS
LH
LH
LH
LH
LH
LH
DIS
LH
LH
H
H
LH
H
Results
Auditory and acoustic analyses revealed that
speakers modified several parameters in order to
produce attitudinally-varied back channels in
Japanese. This included variations in tempo,
tonal configuration, pitch range, vowel quality,
voice quality, and clarity of articulation. Of
these, the most notable systematicity was observed for tonal configuration and phrasing in
the pitch and time dimensions. The rest of the
paper will focus on these aspects.
Tonal configuration
Contrary to the general assumption that tonal
configuration cannot vary in a pitch accent language like Japanese, all six speakers were found
to use two tonal configurations. These contours are further differentiated in phrasing in
the pitch and time dimensions when expressing
various attitudes.
For the utterance /asoodesuka/ is that so? / I
see, where /soo/ is associated with the lexical
H*L accent, it is interesting to note that the
expected F0 fall was largely missing. Only in
one occasion (speaker Z for NEU) was there a
very slight F0 fall, and all other cases were
produced with either a level H or a rising LH
contour. Maekawa (2004), whose data included basically the same utterance /soodesuka/
(without the initial interjection /a/) for his
study of paralinguistic information in Japanese, noted this change from the H*L to the
LH pattern in some of the utterances.
104
Phrasing
There was fairly good agreement among the
speakers as to how the two tonal configurations were phrased in the pitch and time dimensions. With one exception, the attitude
JOY always had the highest F0 peak, while the
lowest peak was typically found for DIS. SUS
and DIS were spoken more slowly than other
types of attitudes and therefore had a longer
duration. Figure 3 shows a typical example of
phrasing for four attitudes by speaker Y.
Figure 2a,b. F0 contours for JOY, Q, and NEU
(above) and SUS and DIS (below) for speaker U.
peak F0 value in Hz
450
400
U
W
V
Z
Y
X
350
300
250
200
150
100
50
NEU
JOY
DIS
SUS
attitude
105
duration in ms
1400
U
W
v
Z
Y
X
1200
1000
800
600
References
Ayusawa T. (editorial supervision). (2001) Accent and intonation in Tokyo Japanese.
CALL Sub-teaching material series, Japanese prosody Vol.1. National Institute of
Multimedia Education.
Beckman ME and Pierrehumbert JB. (1986)
Japanese prosodic phrasing and intonation
synthesis. Proceedings of the 24th Meeting
of the Association for Computational Linguistics
Katagiri, Y., M. Sugito, and Y. Nagano-Madsen
(1999) The forms and prosody of back
channels in Tokyo and Osaka Japanese. In
the Proceedings of The XIIIIth International Conference on Phonetic Sciences,
San Francisco.(CD).
Katagiri, Y., M. Sugito, & Y. Nagano-Madsen
(2001) An analysis of forms and prosodic
characteristics of Japanese 'Aiduti' in dialogue (in Japanese). In Bunpoo to Onsei
(=Speech and Grammar) III, 263-274.
Tokyo: Kuroshio Publication.
Maekawa K. (2002) Production and perception
of paralinguistic information. Proceedings
of
International
Conference:Speech
Prosody 2004, Nara, 367-374.
Maekawa K, Igarashi Y, Kikuchi E, and Yoneyama S. (2004) Intonation labelling for
Spoken Japanese Corpus, version 1.0 (in
Japanese). Electronical document for The
corpus of Spontaneous Japanese.
Mozziconacci, S. (2000) Prosody and Emotions. In Proceedings Online: ISCA Workshop on Speech and Emotion.: A conceptual framework for research.
Nagano-Madsen, Y. and M. Sugito. (1999) Analysis of back channel items in Tokyo and
Osaka Japanese (in Japanese). In Japanese
Linguistics 5, 26-45. Tokyo: National Language Research Institute.
Sugito, M., Y. Nagano-Madsen, and
M.,Kitamura.(1999) The pattern and timing
of the repeat-back channels in natural dialogue in Japanese (in Japanese). In Bunpoo
to Onsei (=Speech and Grammar) II, 3-18.
Tokyo: Kuroshio Publication.
400
200
0
NEU
JOY
DIS
SUS
attitude
for
Discussion
Some of the findings reported here agree with
those reported in Maekawas (2002) study on
paralinguistic information in Japanese. The
present study revealed more systematic details
in the way tonal configuration was varied in
conveying attitudinally varied utterances.
Although the choice of tonal configuration is
limited to two basic types, Japanese speakers
were found to vary phrase internal F0 contours
systematically in order to express attitudinally-varied back channels. The test material
contained both accented (H*L) and unaccented
words. In the case of the accented word /soo/
in /asoodesuka/, the lexical accent was altered
to either H or LH, the former being typically
used for NEU. In unaccented words, variation
in tonal configuration was achieved by modifying the rate of initial F0 rise. What is consistent in both cases is that in the contours used
for unmarked attitude NEU, F0 reaches its
peak earlier than in marked attitude such as
SUS. According to the X-JToBI (Maekawa et
al 2004), Japanese ToBi labelling, H- is
introduced to indicate the onset of F0 plateau.
It can be analysed that both accented and unaccented word can be expressed by delayed Hfor marked attitudes. Exactly on which mora
the initial and final F0 maxima are placed varies
between speakers. It should be also reminded
that the general assumption that the phrasal His on the second mora was not attested in the
present data even for utterances with NEU and
Q type. This phenomenon needs further investigation.
Apart from the phrase internal tonal variations
described above, there are differences in phrasing and terminal contours. These prosodic characteristics are further modified by vowel quality,
voice quality, and clarity of articulation. Identifi106
Abstract
We present an experiment where subjects were
asked to listen to Swedish human-computer dialogue fragments where a synthetic voice makes
an elliptical clarification after a user turn. The
prosodic features of the synthetic voice were
systematically varied, and subjects were asked
to judge the computers actual intention. The
results show that an early low F0 peak signals
acceptance, that a late high peak is perceived
as a request for clarification of what was said,
and that a mid high peak is perceived as a request for clarification of the meaning of what
was said. The study can be seen as the beginnings of a tentative model for intonation of
clarification ellipses in Swedish, which can be
implemented and tested in spoken dialogue systems.
Introduction
Detection of and recovery from errors is important for spoken dialogue systems. To this
effect, system hypotheses are often verified explicitly or implicitly: the system makes a clarification request or repeats what it has heard.
These error handling techniques are often perceived as tedious, and one of the reasons for
this is that they are often constructed as full
propositions, verifying the complete user utterance. In contrast, humans often use short elliptical constructions for clarification Purver et
al. (2001) show that 45% of the clarification
requests in British National Corpus (BNC) are
elliptical. A dialogue system using word level
confidence scores could use elliptical clarifications to focus on problematic fragments, making the dialogue more efficient (Gabsdil, 2003).
However, the interpretation of ellipses is often
dependent both on context and on prosody, and
the prosody of clarification requests has not
been greatly studied.
We present an experiment in which subjects
were asked to listen to Swedish dialogue fragments where the computer makes elliptical
clarifications after user turns, and to judge what
was actually intended by the computer. The
study is connected to the HIGGINS spoken dialogue system (Edlund et al., 2004). The primary
107
Clarification
Clarification is part of a process called grounding (Clark, 1996) or interactive communication
management (Allwood et al., 1992). In this
process, speakers give positive and negative
evidence or feedback of their understanding of
what the interlocutor says. A clarification may
often give both positive and negative evidence
showing what has been understood as well as
what is needed for complete understanding.
Clarification requests may have both different forms and different readings (i.e. functions).
In a study of the BNC, Purver et al. (2001)
studied the form and function of clarification
requests. According to their scheme, the form
of clarification ellipses studied in this paper, as
exemplified in Table 2, is called reprise fragments.
We will use a distinction made by both
Clark (1996) and Allwood et al. (1992) in order
to classify possible readings of reprise fragments. They suggest four levels of action that
take place when speaker S is trying to say
something to hearer H:
Acceptance: H accepts what S says.
Understanding: H understands what S
means.
Perception: H hears what S says.
Contact: H hears that S speaks.
Metod
Three test words comprising the three colors:
blue, red and yellow (bl, rd, gul) were synthesized using an experimental version of
LUKAS diphone Swedish male MBROLA
voice (Filipsson & Bruce, 1997), implemented
as a plug-in to the WaveSurfer speech tool
(Sjlander & Beskow, 2000).
For each of the three test words the following prosodic parameters were manipulated: 1)
Peak POSITION, 2) Peak HEIGHT, and 3) Vowel
DURATION. Three peak positions were obtained by
time-shifting the focal accent peaks in intervals
of 100 ms comprising early, mid and late peaks.
A low peak and a high peak set of stimuli were
obtained by setting the accent peak at 130 Hz
and 160 Hz respectively. Two sets of stimuli
durations (normal and long) were obtained by
lengthening the default vowel length by 100
ms. All combinations of three test words and
the three parameters gave a total of 36 different
stimuli. Six additional stimuli, making a total of
42, were created by using both the early and
late peaks in the long duration stimuli which
created a double peaked stimuli. A possible
Table 3: Interpretations that were significantly overrepresented, given the values of the parameters
POSITION and HEIGHT, and their interactions. The
standardized residuals from the 2-test are also
shown.
POSITION
Early
Mid
Late
HEIGHT
High
Low
POSITION*
HEIGHT
Early*Low
Mid*Low
Mid*High
Late*High
Interpretation
ACCEPT
CLARIFYUNDERSTANDING
CLARIFYPERCEPTION
Interpretation
CLARIFYUNDERSTANDING
ACCEPT
Interpretation
Std. resid.
3.1
4.6
3.6
Std. resid.
3.2
4.0
Std. resid.
ACCEPT
ACCEPT
CLARIFYUNDERSTANDING
CLARIFYPERCEPTION
3.4
3.4
5.6
4.4
ACCEPT
CLARIFYUNDERSTANDING
Number of votes
CLARIFYPERCEPTION
40
30
20
10
0
early
mid
HIGH
late
early
mid
late
LOW
Results
There were no significant differences in the distribution of votes between the different colors
(red, blue, and yellow) (2=3.65, dF=4,
109
p>0.05), nor were there any significant differences for any of the eight subjects (2=19.00,
dF=14, p>0.05). Neither had the DURATION parameter any significant effect on the distribution of votes (2=5.72, dF=2, p>0.05). Both
POSITION and HEIGHT had significant effects on
the distribution of votes, which is shown in Table 3 (2=70.22, dF=4, p<0.001 resp. 2=59.40,
dF=2, p<0.001). The interaction of the parameters POSITION and HEIGHT also gave rise to significant effects (2=121.12, dF=10, p<0.001),
as shown in the bottom of Table 3. Figure 2
shows the distribution of votes for the three interpretations as a function of position for both
high and low HEIGHT. Results from the doublepeak stimuli were generally more complex and
are not presented here.
Discussion
The most interesting result in this experiment
from both a spoken dialogue system perspective
and a prosody modeling framework concerns
the strong relationship between intonational
form and meaning. For these single-word utter-
Acknowledgements
This research was carried out at the Centre for
Speech Technology, a competence centre at
KTH, supported by VINNOVA (The Swedish
Agency for Innovation Systems), KTH and participating Swedish companies and organizationsand was also supported by the EU project
CHIL (IP506909).
110
References
Allwood, J., Nivre, J., & Ahlsn, E. (1992). On
the semantics and pragmatics of linguistic
feedback. Journal of Semantics, 9, 1-26.
Bolinger, D. (1989). Intonation and its uses:
Melody in grammar and discourse. London:
Edward Arnold.
Clark, H. H. (1996). Using language. Cambridge: Cambridge University Press.
Edlund, J., Skantze, G., & Carlson, R. (2004).
Higgins a spoken dialogue system for investigating error handling techniques. In
Proceedings of ICSLP, 229-231.
Filipsson, M. & Bruce, G. (1997). LUKAS - a
preliminary report on a new Swedish speech
synthesis. Working Papers 46, Department
of Linguistics and Phonetics, Lund University.
Gabsdil, M. (2003). Clarification in spoken dialogue systems. In Proceedings of the AAAI
spring symposium on natural language generation in spoken and written dialogue.
Grding, E. (1998). Intonation in Swedish, In
D. Hirst and A. Di Cristo (eds.) Intonation
Systems. Cambridge: Cambridge University
Press, 112-130.
Ginzburg, J. & Cooper, R. (2001). Resolving
ellipsis in clarification. In Proceedings of
the 39th meeting of the ACL.
House, D. (2003). Perceiving question intonation: the role of pre-focal pause and delayed
focal peak. In Proc 15th ICPhS, Barcelona,
755-758
Ladd, D. R. (1996). Intonation phonology.
Cambridge: Cambridge University Press.
Purver, M., Ginzburg, J., & Healey, P. (2001).
On the means for clarification in dialogue.
In Proceedings of SIGDial.
Rodriguez, K. J. & Schlangen, D. (2004). Form,
intonation and function of clarification requests in German task oriented spoken dialogues. In Proceedings of Catalog '04 (The
8th Workshop on the Semantics and Pragmatics of Dialogue, SemDial04), Barcelona,
Spain.
Schlangen, D. (2004). Causes and strategies for
requesting clarification in dialogue. In Proceedings of SIGDial.
Sjlander, K. & Beskow, J. (2000). WaveSurfer
a public domain speech tool, In Proceedings of ICSLP 2000, 4, 464-467, Beijing,
China.
Abstract
Three different scales which have been used to
measure perceived prominence are evaluated in
a perceptual experiment. Average scores of
raters using a multi-level (31-point) scale, a simple binary (2-point) scale and an intermediate
4-point scale are almost identical. The potentially finer gradation possible with the multilevel scale(s) is compensated for by having multiple listeners, which is a also a requirement for
obtaining reliable data. In other words, a high
number of levels is neither a sufficient nor a necessary requirement. Overall the best results were
obtained using the 4-point scale, and there seems
to be little justification for using a 31-point scale.
Introduction
Method
The purpose of this paper is to evaluate the use
The speech material chosen to evalof different scales for measuring the perceived
uate the scales was two short monoprominence of syllables and words. In this inlogues from the Danish DanPASS project
vestigation only word-level prominence is con(http://www.cphling.dk/pers/ng/danpass.htm),
sidered.
both recordings of a map task activity. The two
Prominence, as perceived by groups of
monologues, by two different male speakers,
raters, has been measured on different types of
included a total of 123 words. The monologues
scale: some use a 31-point scale from 0 to 30,
were divided into shorter phrases which were
first described in Fant & Kruckenberg (1989).
presented via a web page (one phrase per page).
The strength of this scale is that it allows for very
The raters could hear the phrase as many times
fine gradation of the perceived prominence, even
as they wanted by pressing a play button,
for a single rater, but this also makes the task
and indicated their judgment by clicking the
quite difficult. Others, e.g. Wightman (1993),
appropriate scale point. Time consumption and
have proposed to use instead a simple binary (2a count of sound file playbacks were recorded
point) scale (0 or 1) and use the cumulative (or
for each phrase.
average) score of each word as an expression of
its level of prominence, which results in much
A large group of raters participated in the exsimpler task for the raters. The disadvantage of
periment and were randomly assigned to a spethis simple scale is that it may force raters to concific scale. Equally sized groups of 19 raters
flate items which they perceive as different, but
(the size of the smallest group) were selected for
within the same category, which could lead to a
the analyses. The instructions to the raters were
reduced or lost ability to distinguish variations in
presented from the web page and were identical
perceved prominence at either end of the promifor all three groups, except for the details about
nence continuum. For example accented words
the specific scale. The concept of prominence
with or without special emphasis. In addition,
was explained and exemplified, and raters were
the level of gradation you achieve with this scale
advised that prominence might be a question of
is directly proportional to the number of raters:
more or less. 0 represented no prominence, but
to get the same gradation as is (potentially) posno other scale points were defined. Prominent
sible with the scale from 0 to 31 you need 30
words could be assigned values up to the scale
raters. As a possible compromise between these
maximum. Raters using the 2-point scale were
two scales one could use a 4-point scale (e.g.
informed that they could not grade their ratings
from 0 to 3). While this scale is much simpler
but were given a forced choice.
111
Results
Reliability
Note: the phrase the 2/4/31-point scale is used
in the following as shorthand expressions of the
prominence ratings obtained from the group of
listeners using the 2/4/31-point scale.
The reliability of the data was tested by calculating Cronbachs coefficient, which expresses the extent to which the scores of the individual raters covary. The coefficients for all
three groups are high (from 0.94 to 0.96) and the
difference between them is nonsignificant (M =
1.02, p > 0.05).
Comparison of prominence ratings
The first question to be addressed is whether the
prominence ratings on the three scales express
the same relations between words. In order to be
able to make direct comparisons all scores were
normalised by dividing each value with the scale
maximum (1, 3 or 30, respectively), which fits all
data to a normalised scale of 0 to 1 without affecting the relations between scores. These values were then plotted on a line chart for simple
visual inspection. An example diagram of one
phrase is shown in Fig. 1.
Perceived prominence
2-pt. =
4-pt. =
31-pt. =
0.75
0.5
0.25
0
til
till
du kommer
you come
til
to
det
the
nste kryds
next intersection
The diagrams showed a high level of agreement across the three scales, which was further
tested in a correlation analysis (Spearmans ).
The result can be seen in Table 1.
Table 1: Correlation coefficients (Spearmans )
across all three scales
Correlation
2-pt
4-pt
4-pt 31-pt
0.933 0.926
0.964
and the 31-point scale. The preliminary conclusion is clear: raters arrive at approximately
the same rank order of perceived prominence regardless of the scale used.
It appears from Fig. 1 that the 2-point scale
displays somewhat larger variation in values between the scale minimum and maximum than the
4-point scale and especially the 31-point scale.
This was in fact a general trend demonstrating
a certain compression of values on the 31-point
scale (and to a lesser degree the 4-point scale),
while the 2-point scale has more mean values
near the scale extremes. Analyses of the distribution of scores (inter-quartile range for each rater
and visual inspection of x-y plots) showed that
many raters on the 31-point scale assigned most
ratings to a restricted sometimes very restricted
range of the scale, either at the lower, the middle or the higher end of the scale. There are
therefore no mean values at the scale extremes,
although there were many individual scores near
the minimum and maximum values.
Obtaining significant differences
One very important aspect of choosing a scale is
whether it will affect the ability to obtain statistically significant differences between test items.
The hypothesis might be that scales with too few
points (most notably the 2-point scale) would
mask subtle perceptual differences which could
be brought out with more scale points.
This suitability of the three scales for quantitative analysis was tested by examining the
association between perceived prominence and
three linguistic phenomena: part of speech membership, information structure and a specific
acoustic correlate, namely F0 . The purpose was
to see if the data obtained by using three different scales will lead to different conclusions about
linguistic structure.
Comment on the statistical procedures
Since it is not possible to compare results directly across scale types we simply decided to
use the statistical procedures which were felt to
be most appropriate for each individual scale.
This resembles quite well the choice which researchers would be forced to make when they are
making a choice about scale type.
For all scales we have decided to use nonparametric methods. For significance testing on
the 2-point scale we use the Fisher exact test or
a chi-square test with corrections for continuity
(when n > 40), and for the other two scales we
use the Wilcoxon-Mann-Whitney test with correction for ties (WMW).
1
2
3
4
5
6
7
8
9
Scale
Part of speech
Adjectives
Nouns
Interjections
Adverbs
Verbs
Pronouns
Conjunctions
Articles
Prepositions
n
9
28
3
12
13
16
10
2
30
2-point
Ranked
x
adj
0.92
n
0.78
int
0.60
{
adv
0.58
v
0.34
{
pron
0.33
conj
0.17
{
art
0.13
{
prep
0.10
Parts of speech
The mean prominence ratings of nine parts of
speech are listed in Table 2, ordered according
to their ranking on each scale. These ranking are
very similar for all three scales. The only difference which can be detected is the relegation of
prepositions to ninth place on the 2-point scale,
instead of the seventh place it holds on the other
two scales. (The different ranking of pronouns
and verbs on the 31-point scale is irrelevant.)
Most of the differences between the classes are
significant: except for two cases on the 31-point
scale (see the table caption) all differences between classes which are not adjacent in the rankings are significant, and of the differences between adjacent classes four are nonsignificant
on the 2-point scale, two are nonsignificant on
the 4-point scale, and three are nonsignificant on
the 31-point scale (giving a total of five differences which are not significant for this scale).
These figures are quite similar, with a small bias
in favour of the 4-point scale, where the highest
number of significant differences was found.
Information structure
Chafe (1994) states that new information is more
prominent than non-new information. To test
the validity of this statement we compared the
prominence ratings of all words carrying new information with the most prominent word carrying non-new information in the same phrase (20
cases), thus testing the hypothesis that new information is more prominent than other information
(H1 ). H0 states that the perceived prominence of
the new information is less than or equal to that
of the given/accessible information.
In four cases (three on the 31-point scale) the
new information is not more prominent than the
non-new information, in which case H0 cannot
113
4-point
Ranked
x
adj
0.73
n
0.66
int
0.50
adv
0.38
v
0.30
{
pron
0.30
prep
0.21
conj
0.13
{
art
0.12
31-point
Ranked
x
adj
0.67
n
0.63
{
int
0.58
adv
0.40
pron
0.35
{
v
0.35
prep
0.28
conj
0.24
{
art
0.22
Scale
2-pt
4-pt
31-pt
0.593
0.626
0.606
Two main questions were asked about the influence of scale type on ratings of perceived promiReferences
nence: 1) do we get the same prominence relations in utterances, as expressed in mean valFant, G. and Kruckenberg, A., Preliminaries to
ues and rankings, and 2) does scale type affect
the study of Swedish prose reading and readour ability to make observations about statistiing style, STL-QPSR 2/1989:183, 1989.
cally significant differences between words. The
Wightman, C., Perception of multiple levels
overall conclusion must be that the perceived
of prominence in spontaneous speech, ASA
prominence relations in the utterances are very
126th Meeting Denver 1993 (abstract).
similar whether expressed on a 2-point scale,
Chafe, W., Discourse consciousness, and time:
a 4-point scale or a 31-point scale. The difthe flow and displacement of conscious exferences are small and are mostly caused by a
perience in speaking and writing, The Unitendency for some raters to prefer a restricted
versity of Chicago Press, 1994.
range within a multi-level scale. The differences
are also relatively small when it comes to statistical testing of observations, but it does seem
that raising the number of scale points from two
to four yields slightly better results: there are
more significant differences between the part of
114
Abstract
Is the duration of the post-vocalic consonant in
stressed syllables an important property when
teaching Swedish as a L2? Is it a cue to the
discrimination of /VC/ and /VC/ words or a
buffer for proper syllable duration, or both?
Four Swedish words, providing two minimal
pairs with respect to phonologic quantity, and
containing the vowel phonemes // and //,
were gradually changed temporally from /VC/
to /VC/ and vice versa. Manipulations of
durations were made in two series one with
changing of vowel duration only, and one with
changing of vowel and consonant duration in
combination. 30 native Swedish listeners
decided whether they perceived test words as
original quantity type or not. The results show
that the duration of the post-vocalic consonant
had substantial influence on how the listeners
categorized the test words. The study also
includes naturalness judgements of the test
words, and here the proper post-vocalic
consonant duration had a positive influence on
the listeners judgements of naturalness for //
but not for //.
Introduction
Teaching and learning the pronunciation of a
second
language
comprises
many
considerations as to what phonetic features are
more or less important in order to on the one
hand make oneself understood, and on the
other hand not to disturb the listener. Bannert
(1984) states:
to improve pronunciation when learning a
foreign language, linguistic correctness
has been the guiding principle. It seems
however, that hardly any consideration has
been given to the native listeners problems of
understanding foreign accent.
In the past 20-25 years, a simplified description
of Swedish prosody for pedagogical use has
appeared in a number of teaching media
(Kjellin 1978, Fasth & Kannermark 1989,
Slagbrand & Thorn 1997). The description is
115
Method
Stimuli
The test words in the present study are mta
[mta] to measure, mtta [mta] to satisfy
skuta [skta] boat, skutta [skta] to
scamper. These words provide two minimal
pairs with respect to phonologic quantity. One
pair - contains the vowel phoneme //, and the
other pair contains the vowel phoneme //. The
words were recorded in a fairly damped room
in the present authors home, using a Rde NT3
condenser microphone and a Sony MZ-N710
mini-disc player. The speaker was a Swedish
male speaking central standard Swedish
(Stockholm variety). The test words were
pronounced within a carrier phrase: Det var
jag menade It was .. that I meant
Vowel and consonant durations in the test
words were manipulated in Praat (Boersma &
Weenink 2001). All stimuli were given
stepwise vowel duration change. Half of the
stimuli kept a constant consonant duration,
identical with the original quantity type, and the
other half were given stepwise consonant
duration changes based on original values for
non-original quantity type. The manipulated
durations are shown in table 1.
Listeners
30 native speakers of Swedish listened to the 48
stimulus words, marking whether they
perceived them as /VC/ or /VC/. The listeners
were between 23 and 60 of age, and had
different regional varieties of Swedish as their
L1. None of them had any hearing deficiencies
that affected their perception of normal speech.
Presentation
The 48 stimuli were presented in random order,
in the carrier phrase, preceded by the reading of
stimulus number. The test was presented from
116
30
25
20
15
188
168
148
128
108
88
10
[mta]
153
153
153
153
153
153
Original
136
156
176
196
216
236
[mta]
334
334
334
334
334
334
V
C
141
166
121
166
101
166
81
166
61
166
41
166
V
C
166
312
186
312
206
312
226
312
246
312
266
312
Original
Original
[skta]
Original
[skta]
188
[mta]
Original
[mta]
Original
[skta]
Original
[skta]
V
C
188
234
168
254
148
274
128
294
108
314
88
334
V
C
136
253
156
233
176
213
196
193
216
173
236
153
V
C
141
232
121
252
101
272
81
292
61
312
41
332
V
C
166
246
186
226
206
206
226
186
246
166
266
146
148
128
108
88
Vowel duration
168
141
121
101
81
61
Vow el duration
41
156
176
196
Vowel duration
216
236
Result
In both the vowel lengthening series and the
vowel shortening series, the complementary
consonant manipulation seems to have distinct
influence on the listeners perception of /VC/ or
/VC/ (figure 1 and 2). Listeners start to perceive stimuli as non-original quantity type at
lower degree of vowel duration change when
the post-vocalic consonant duration follows the
complementary pattern.
For
//,
the
complementary manipulation seems to make
166
186
206
226
246
Vowel duration
266
117
References
Bannert R. (1984) Prosody and intelligibility of
Swedish spoken with a foreign accent.
Nordic Prosody III. Acta Universitatis
Umensis, Ume Studies in the Humanities
59, 7-18.
Bannert R. (1986) From prominent syllables to
a skeleton of meaning: a model of
prosodically guided speech recognition. In
Proceedings of the XIth ICPhS Tallinn, 7376.
Behne D. Czigler P. and Sullivan K. (1997)
Swedish Quantity and Quality: A
Traditional Issue Revisited. In Phonum 4,
Dept of Linguistics, Ume University.
Boersma P. & Weenink D. (2001) Praat a
system for doing phonetics by computer.
http://www.fon.hum.uva.nl/praat/
Fant G. and Kruckenberg A. (1994). Notes on
stress and word accent in Swedish, KTH,
Speech Transmission Laboratory, Quarterly
Progress and Status Report 2-3, 125-144.
Fasth C. & Kannermark A. (1989) Goda
grunder. Kursverksamhetens frlag, Lund.
Grding E. (1974). Den efterhngsna prosodin.
I: Teleman & Hultman Sprket I bruk.
Liber, Lund.
Hadding-Koch K. & Abramson A. (1964).
Duration versus spectrum in Swedish
vowels: Some perceptual experiments. I
Studia Linguistica 18. 94-107.
Kjellin O. (1978) Svensk prosodi i praktiken.
Hallgren & Fallgrens studidefrlag.
Uppsala.
Slagbrand Y. & Thorn B. (1997) vningar i
svensk basprosodi. Semikolon, Boden.
Thorn B. (2001). Vem vinner p lngden? Tv
experiment med manipulerad duration i
betonade stavelser. D-uppsats i fonetik.
Institutionen fr filosofi och lingvistik.
Ume universitet.
Thorn B. (2003). Can V/C-ratio alone be
sufficient for discrimination of V:C/VC: in
Swedish? A perception test with
manipulated
durations.
Phonum
(Department
of
Phonetics,
Ume
University) 9, 49-52.
Conclusion
The result shows that the duration of the postvocalic consonant is more than a means to
assign the proper length to stressed syllables. It
does obviously play a distinctive role for the
perception of quantity type in the present
material. Since the involved vowels represent
the maximal (//) and the minimal (//) spectral
differences between long and short vowel
allophone in the Swedish vowel inventory, the
result indicates that the duration of the postvocalic consonant functions as a general
complementary cue to the perception of
quantity type in Swedish.
The ambiguous contribution from correct
consonant duration to naturalness for //, can
probably be accounted for by the already
damaged naturalness caused by changing of
durations with intact spectral properties. In the
case of //, the listeners were probably not
disturbed by incorrect vowel timbre, and
could consequently enjoy the adjusted
consonant duration easier.
Since there is already enough evidence for
the greater duration of stressed syllables in
Swedish, it can be assumed that the duration of
the post-vocalic consonant contributes to
perception of quantity, word stress and in
most cases improved naturalness. This in turn
makes it reasonable to regard both vowel and
consonant duration as important properties
when learning Swedish as a second language.
118
Abstract
In this paper, an experiment is reported which
was carried out to investigate gender differences in the ability to infer emotional content
from speech. Fourteen professional actors
(eight men, six women) produced simulated
emotional speech data representing the most
important basic emotions (three emotions in
addition to neutral). Each emotion was simulated when reading aloud a semantically neutral text. Fifty-one listeners (27 males, 24 females) were asked to listen to the speech samples and choose (among the four options) the
most appropriate emotional label describing
the simulated emotional state. The female listeners were consistently better at discriminating the emotional state from speech than the
male subjects. The results suggest that females
are emotionally more sensitive than males, as
far as emotion recognition from voice is concerned.
Introduction
Phoneticians, speech scientists and engineers
are taking increasing interest in the role of the
expression of emotion in speech communication. In addition to so-called basic emotions,
other global speaker-states are investigated, for
example, irritation and trouble in communication (Batliner et al. 2003). A major approach in
basic (phonetic) research has been to investigate the vocal parameters of specific emotions,
and these parameters are now understood relatively well. Nowadays, the role of the vocal expression of emotion is gaining increasing importance in the computer speech community,
for example, in the applied context of the automatic discrimination/classification of emotional
content from speech (ten Bosch 2003). It can be
argued that, after a long exploratory stage, the
study of the vocal expression of emotion is
reaching a level of maturity where the main focus is on important applications, particularly
those involving human-computer interaction.
In the study of the vocal communication of
emotion, an important taking-off point is the
base-line data, i.e. the human emotion discrimination performance level. There is now a
relatively large literature on the human
discrimination of emotions from speech:
reviewing over 30 studies of the subject
conducted up to the 1980s, Scherer (1989)
concludes that an average accuracy percentage
of about 60 % can be obtained in experiments
where listeners are to infer emotional content
from vocal cues only (without any help from
lexis etc.). In a recent large-scale cross-cultural
study (Scherer et al. 2001), an accuracy level
rate of 66 % was found, across emotions
(neutral, anger, fear, joy, sadness and surprise)
and cultural contexts (Europe, Asia and the
US). In a western cultural context, vocal
recognition of six emotions (neutral, anger,
fear, joy, sadness and disgust) was 62 %.
Typically, in investigations of the human
discrimination of emotions, a standard speech
sample (an utterance or a short passage) is
used: the same lexical content is produced (often by actors) with different simulated emotions
and test subjects are asked to choose the most
appropriate emotional label for each sample
(among the intended emotion categories). The
emotions investigated in these studies usually
represent basic emotions: it is argued that
certain emotions at least fear, anger, happiness, sadness, surprise and disgust are the
most important or basic emotions (because they
are seen to represent survival-related patterns
of responses to events in the environment).
Although the vocal expression of emotions
has been investigated rather intensively, at least
as far as simulated data is concerned (and empirical evidence has cumulated indicating how
well basic emotions can be discriminated by
human listeners in different cultures), there has
been little research on inter-subject differences
(within a culture) in emotion discrimination
ability. Usually, the emotion recognition performance level of a group of test subjects is reported as a single numerical value, without
making any intra-group distinctions. Thus there
is very little reported empirical evidence concerning possible differences between female
119
Results
Tables 1-9 show the results of the experiment:
the emotion discrimination performance of the
subjects is first presented in toto (female and
male subjects listening to female and male
speakers), and then the results are broken down
into sub-categories (females listening to all
speakers, females listening to females only,
etc.). Each table is a confusion matrix, where
the column on the left indicates the intended
emotions and the rows indicate the recognized
emotions. The underlined percentages indicate
the average discrimination accuracy for each
specific emotion. The average emotion recognition performance level in each setting is given
as the TOTAL percentage.
Speech data
For the purposes of the present study, simulated
emotional speech data was collected. Fourteen
professional actors (eight men, six women)
produced the speech data. The speakers were
aged between 26 and 50 (average age was 39);
all were speakers of the same northern variety
of Finnish. The speakers read out a phonetically rich Finnish passage of some 120 words
simulating three basic emotions, in addition to
neutral: sadness, anger and happiness/joy. The
text was emotionally completely neutral, representing matter-of-fact newspaper prose. The
recordings were made in an anechoic chamber
using a high quality condenser microphone and
a DAT recorder to obtain a 48 kHz, 16-bit recording. The data was stored in a PC as wav
format files. Each monologue was divided into
five consecutive segments of equal duration for
discrimination experiment purposes: thus there
were a total of 280 emotional speech samples
with an average duration of 13 seconds (five
samples for four emotions by fourteen speakers).
Human
ments
discrimination
Table 1. Emotion discrimination from voice: females and males listening to females and males.
experi-
A performance test for human emotion discrimination was performed in the form of listening tests. The listeners were students in a
junior high school, aged between 14 and 15.
Fifty-one subjects (27 males, 24 females) participated as volunteers. All were speakers of the
same northern variety of Finnish (the actors
were speakers of the same variety of Finnish).
The listening tests took place in a classroom
where the subjects heard the speech data (280
Angry
Happy
2.6 %
2.1 %
Sad
12.9 %
85.3 %
1.0 %
0.8 %
Angry
14.9 %
2.9 %
76.9 %
5.3 %
Happy
24.3 %
5.4 %
3.3 %
67.0 %
Table 2. Emotion discrimination from voice: females and males listening to males.
Angry
Happy
2.9 %
2.2 %
Sad
14.7 %
83.9 %
1.1 %
0.4 %
Angry
14.6 %
1.2 %
78.2 %
6.0 %
Happy
26.2 %
5.3 %
3.6 %
64.9 %
120
Table 3. Emotion discrimination from voice: females and males listening to females.
Angry
Happy
Angry
Happy
2.2 %
2.0 %
3.7 %
2.5 %
Sad
10.5 %
87.1 %
0.9 %
1.4 %
Sad
15.9 %
81.8 %
1.8 %
0.5 %
Angry
15.3 %
5.3 %
75.2 %
4.2 %
Angry
16.4 %
1.8 %
74.9 %
6.9 %
Happy
21.7 %
5.5 %
3.0 %
69.8 %
Happy
28.3 %
5.6 %
4.7 %
61.4 %
Angry
Happy
3.3 %
2.4 %
Sad
14.4 %
83.0 %
1.6 %
1.0 %
Angry
16.8 %
3.8 %
73.3 %
6.1 %
Happy
26.8 %
6.0 %
4.3 %
62.8 %
Angry
Happy
2.0 %
1.9 %
Sad
13.2 %
86.3 %
0.2 %
0.2 %
Angry
12.5 %
0.5 %
82.0 %
5.1 %
Happy
23.9 %
4.9 %
2.4 %
68.8 %
Table 5. Emotion discrimination from voice: females listening to females and males.
Angry
Happy
Angry
Happy
1.8 %
1.8 %
2.7 %
2.3 %
Sad
11.1 %
87.9 %
0.3 %
0.7 %
Sad
12.5 %
84.6 %
1.3 %
1.6 %
Angry
12.7 %
1.9 %
81.1 %
4.3 %
Angry
17.3 %
6.5 %
71.2 %
4.9 %
Happy
21.4 %
4.6 %
2.2 %
71.7 %
Happy
24.9 %
6.6 %
3.9 %
64.7 %
121
Angry
Happy
1.6 %
1.7 %
Sad
8.2 %
90.1 %
0.5 %
1.3 %
Angry
12.9 %
3.8 %
80.0 %
3.3 %
Happy
18.2 %
4.2 %
2.0 %
75.6 %
than the male listeners/speakers is not surprising. A relevant concept in this context may, in
fact, be empathy. Psychological research (see
e.g. Tannen 1991) has shown that female superiority in empathizing is manifested in interaction by the following trends, for example: females speech involves much more direct talk
about feelings and affective states than guy
talk, females are usually more co-operative
and reciprocal in conversation than males, and
females are much quicker to respond empathically/emotionally to the distress of other people. It has been shown that, form birth, females
look longer at faces, and particularly at peoples eyes, while males are more prone to look
at inanimate objects (Connellan et al. 2001).
The results of this study support the consensus view that, emotionally, females are more
sensitive than males; this time concrete evidence is presented for the vocal (prosodic, nonlexical) communication of emotion. To draw
more far-reaching conclusions, however, we
need more speakers to produce the speech data,
so that we can exclude the possible effect of
speaker-specific idiosyncrasies on the results of
the listening tests.
References
Batliner A., Fischer K., Huber R., Spilker J.
and Nth E. (2003) How to find trouble in
communication. Speech Communication 40,
117-143.
Connellan J., Baron-Cohen S. Wheelwright S.,
Batki A. and Ahluwalia J. (2001) Sex differences in human neonatal social perception. Infant behavior and Development 23,
113-118.
Scherer K. R. (1989) Vocal correlates of emotion. In Wagner H. and Manstead A. (eds.)
Handbook of Psychophysiology: Emotion
and Social Behavior, 165-197. London:Wiley.
Scherer K.R., Banse R. and Walbott H.G.
(2001) Emotion inferences from vocal expression correlate across languages and cultures. Journal of Cross-Cultural Psychology
32, 76-92.
Tannen D. (1991) You just dont understand:
Women and men in conversation. London:Virago.
ten Bosch L. (2003) Emotions, speech and the
ASR framework. Speech Communication
40, 213-225.
122
Abstract
Experimental methodology
The speech material under investigation consists of disyllabic nonsense words in the context
of a meaningful carrier phrase. The words have
a CVCV segmental structure where the first
vowel (V) is one of the five Greek vowels, i.e.
{i, e, a, o, u} in the carrier phrase to klb sVsa
pzi kal musik (The club sVsa plays good
music). The nonsense key words were produced
with lexical stress either on the first or second
syllable and the speech material was produced
in normal tempo with no prosodic brake on an
individual basis.
The speakers were six persons with cerebral
palsy dysfunction and six persons with no
known pathologies (henceforth called the mobility factor) with standard Athenian Greek
pronunciation. Each group was comprised of
three female and three male speakers.
Acoustic analysis was carried out with the
use of Wavesurfer and measurements were
made of the vowel durations from the waveform. The results were subjected to statistical
analysis with the StatView software package
and ANOVA tests were carried out.
In the remainder of this paper, the results
are presented next, followed by discussion and
conclusions.
Introduction
This is an experimental investigation of vowel
durations in Greek as a function of mobility,
gender, stress and vowel category. Two main
questions are addressed: (1) what are the effects
of the investigated factors? And (2) what are
the interactions between the factors?
Considerable research has been carried out
on Greek and contrastive prosody with regards
to temporal structures and vowel durations (see
e.g. Fourakis, 1986, Botinis, 1989). Thus, different vowels have different intrinsic durations,
according to which low vowels are longer than
high vowels and back vowels tend to be longer
than front vowels (Fourakis at al., 1999). Stress
has a temporal effect on vowels, according to
which stressed vowels are longer than unstressed vowels (Botinis, 1989, Botinis et al.,
2001, 2002). Gender has also a temporal effect
on vowels and thus vowels produced by female
speakers are longer than vowels produced by
male speakers. The effect of gender is a language-specific effect as it has been reported for
some languages, such as Greek and Albanian,
but not for others, such as English and Ukrainian (Botinis et al., 2003).
Our knowledge with regards to pathological
speech and cerebral palsy mobility dysfunction
is very limited and the main target thus of the
present investigation is to produce basic data
and initialize research on speech produced by
speakers with various pathologies.
123
Results
The results are presented in Figures 1-6, based
on the acoustic analysis and duration measurements of the total speech material in accordance
with the experimental methodology.
Figure 1 (next page) shows overall vowel
durations as a function of mobility and gender.
Vowels produced by speakers with cerebral
palsy were significantly longer than vowels
produced by speakers with no pathologies
(F(1,596)=40.08, p<.001). Vowels produced
by female speakers were longer than vowels
produced by male speakers (F(1,596)=14.18,
p<.001). The interaction was not significant.
160
Cerebral Palsy
Normal
120
120
80
80
40
40
Normal
Female
Male
160
Cerebral Palsy
160
Cerebral Palsy
Normal
Female
160
120
120
80
80
40
40
Male
Stressed
Unstressed
Female
160
Male
Stressed
160
120
120
80
80
40
40
Unstressed
Stressed
Unstressed
Figure 3. Overall vowel durations (ms) as a func- Figure 6. Individual vowel durations (ms) as a function of gender and stress.
tion of stress.
Discussion
In the present investigation, some old knowledge has been corroborated and some new
knowledge has been produced. The old knowledge concerns the vowel category durations as
well as the effects of stress and gender on
vowel durations and the new knowledge concerns the effects of cerebral palsy dysfunction
on vowel durations.
Our results indicate that the investigated
factors of mobility, gender and stress have a
significant effect on vowel durations, i.e. vow125
Conclusions
In accordance with the results of the present
investigation the following conclusions have
been drawn: First, each mobility factor, gender
factor and stress factor has a significant effect
on vowel durations. Second there are significant interactions between mobility and lexical
stress as well as between gender and lexical
stress but not between mobility and gender.
Thus, both mobility factor and gender factor
have considerably bigger effects on stressed
syllables than unstressed ones.
Acknowledgements
Our sincere thanks to the speakers with cerebral
palsy dysfunction as well as to our students at
the University of Athens for their participation
in the production experiments.
126
References
Beckman M. E. (1986) Stress and Non-stress
Accent. Dordrecht: Foris.
Botinis A. (1989) Stress and Prosodic Structure
in Greek. Lund University Press.
Botinis A., Bannert R. Fourakis M., and Dzimokas D. (2003) Multilingual focus production of female and male focus production.
6th International Congress of Greek Linguistics, University of Crete, Greece.
Botinis A., Bannert R. Fourakis M., and Pagoni-Tetlow S. (2002) Crosslinguistic segmental durations and prosodic typology. International Conference of Speech Prosody
2002, 183-186, Aix-en-Provence, France.
Botinis A., Fourakis M., Panagiotopoulou N.,
and Pouli K. (2001) Greek vowel durations
and Prosodic interactions. Glossologia 13,
101-123.
Botinis, A., Fourakis, M., and Prinou I (1999).
Prosodic effects on segmental durations in
Greek. 6th European Conference on Speech
Communication and Technology
EUROSPEECH 1999, vol. 6, 2475-78, Budapest, Hungary.
de Jong K. J. (2004) Stress, lexical focus, and
segmental focus in English: patterns of
variation in vowel duration. Journal of Phonetics 32, 493-516.
Di Cristo A., and Hirst D. J. (1986) Modelling
French micromelody: analysis and synthesis. Phonetica 43, 11-30.
Fant G., Kruckenberg A., and Liljencrants J.
(2000) Acoustic-phonetic analysis in Swedish. In Botinis A. (Ed.) Intonation: Analysis,
Modelling and Technology, 55-86.
Dordrecht: Kluwer Academic Publishers.
Fant G., Kruckenberg A., and Nord L. (1991)
Duration correlates of stress in Swedish,
French and English. Journal of Phonetics
19, 351-365.
Fourakis M. (1986) An acoustic study of the
effects of tempo and stress on segmental intervals in Modern Greek. Phonetica 43, 172188.
Fourakis M., Botinis A., and Katsaiti M. (1999)
Acoustic characteristics of Greek vowels.
Phonetica 56, 28-43.
Sluijter A. (1995) Phonetic Correlates of Stress
and Accent. The Hague: Holland Ac.
Graphics.
Abstract.
An acoustic study is carried out to see whether
the phenomenon of pharyngalization and/or
velarizsation is confined to the emphatic consonant and the adjacent vowels or it extends
over the whole word in Arabic. Measurements
in Hz of F1 & F2 of front unrounded vowels in
monosyllabic, bisyllabic and trisyllabic words
in ISA having emphatic vs. non-emphatic consonants were made. They showed significant
narrowing between F1 & F2 for vowels in the
vicinity of emphatic consonants than those in
the vicinity of non-emphatic consonants. This is
attributed to the secondary coarticulatory configuration formed in the pharyngeal region by
the projection of the root of the tongue toward
the back wall of the pharynx and possible lowering of the velum toward rising tongue dorsum
which prevails, though in different levels of
significance, over the other syllables of the
word.
127
F2-F1
Monosyllabic words
1.
/ seef/
V
1042
2.
/seef/
/faad/
/faad/
1571
582
896
U-test(2tailed)
0.002
P<0.01
0.002
P<0.01
128
2.
3.
F2-F1
V1
665
1492
0.002
V2
1492
1661
0.04
P<0.01
P<0.05
/'raakid/
602
840
0.002
P<0.01
747
690
1161
0.002
P<0.01
898
/'raakid/
824
1829
U-test(2tailed)
0.03
P<0.05
0.001
P<0.01
/'faatir/
/'faatir/
U-test(2tailed)
3000
2500
V1[a] F2
2000
V1[a] F1
V2[aa] F2
HZ
Bisyllabic
words
1.
/'saaib/
/ 'saaib/
U-test(2tailed)
#
V2[aa] F1
1500
F2_neutral
F3_neutral
1000
F1_neutral
V3[i] F2
500
Trisyllabic
# words
0
1
10
11
12
13
14
mm
F2-F1
V1
V2
/ta'baaiir/
610
720
/ta'baaiir/
988
850
U-test(2tailed) 0.002
0.009
P<0.01 P<0.01
V3
1818
1824
0.8
P>0.05
2.
/fa'saail/
558
/fa'saail/
1008
U-test(2tailed) 0.002
P<0.01
627
931
0.002
P<0.01
1643
1776
0.02
P<0.05
3.
/fa'raaid/
536
/fa'raaid/
695
U-test(2tailed) 0.002
P<0.01
500
683
0.002
P<0.01
806
1245
0.002
P<0.01
1.
V3[i] F1
In table 3 however, the picture is slightly different with trisyllabic words. The first word /
ta'baaiir /for example where the emphatic con-
129
Acknowledgements
I would like to thank Mohammed and Day Majeed for acting as informants for the speech material and to Raghad Majeed for her help in
bringing figure 1 to its final shape.
References
Abercrombie, D. (1967) Elements of General
Phonetics. Edinburgh University Press..
Ali, L. H. and Daniloff, R. E. (1972) A cineflorographic phonological investigation
emphatic sounds assimilation in Arabic.
Proceedings of the 7th International Congress of Phonetic Science .Montreal 1971,
Mouton 1972, 639-648.
Fant G. (1958) Modern instruments and methods for acoustic studies of speech. Proceedings of the 8th International Congress of
Linguistics, 282-358 Oslo University Press.
Fant G. (1960) Acoustics Theory of Speech
Production. The Hague, Mouton.
Firth J. R. (1957) Sounds and prosodies in Papers in Linguistics. 121-138. London. Oxford University Press.
Hassan Z. M. (1981) An experimental study of
vowel duration in Iraqi Spoken Arabic Unpublished Ph.D. thesis U.K. Dept. of Linguistics & Phonetics, University of Leeds.
Hassan Z. M. (2002) Gemination in Swedish
and Arabic with a particular reference to the
preceding vowel duration: an instrumental
and comparative approach. Proceedings of
Fonetik 2002 TMH-QPSR 44, 81-85.
Hyman L. M. (1975) Phonology: Theory and
Analysis. Holt, Rinehart and Winston.
Odisho. E. Y. (1975) The phonology and phonetics of the Neo-Aramaic as spoken by the
Assyrians in Iraq. Unpublished Ph.D. Thesis, Dept of Phonetics, University of Leeds.
Peterson & Lehiste (1960) Duration of syllable
nuclei in English. Journal of the Acoustical
Society of America, 32 (6) 693-703.
Watson,Janet C.E (2002) The Phonology and
Morphology of Arabic.Oxford University
Press.
130
Abstract
This paper refers to the forthcoming ISCA (International Speech Communication Association) Workshop Experimental Linguistics in
Athens, Greece, in 2006. The major objectives
of the Workshop are (1) the development of experimental methodologies in linguistic research
with new perspectives for the study of language, (2) the unification of linguistic knowledge
in relation to linguistic theory based on experimental evidence, (3) the design of multifactor linguistic models and (4) the integration of
interdisciplinary expertise in linguistic applications. Key knowledge areas ranging from cognition and neurophysiology to perception and
psychology are organised in invited lectures as
well as in oral and poster presentations along
with interdisciplinary panel discussions.
Background
The present paper refers to the forthcoming
workshop Experimental Linguistics, which is
an ISCA (International Speech Communication
Association) interdisciplinary workshop, to be
held in Athens, Greece, in 2006 (workshop details along with paper submission procedures
will be announced later this year). The workshop is organised under the joint aegis of the
University of Athens, Department of Linguistics, Greece, The Ohio State University, Department of Speech and Hearing Science, USA,
and the University of Skvde, School of Humanities and Informatics, Sweden.
The scientific study of language is the
backbone of a variety of established disciplines,
among them theoretical linguistics, experimental phonetics, computational linguistics and
language technology. Language is a complex
code system, the study of which is related to a
variety of knowledge areas ranging from cognition and neurophysiology to perception and
psychology. The code system of language consists of functional categories in variable combinations and relations with multiple interactions,
which determine the linguistic structure and the
communicative function of language.
131
of motor commands from the brain to, and coordinated control of, the speech production
mechanism. The acoustic signal is processed
and decoded by the auditory system and the
perceptual mechanism, which ultimately extract
the intended meaning. Consequently, the mutual relation between acoustic signal and intended meaning may be considered the very
basic functional anchoring of linguistic structure and language communication.
There are however several discrepancies between acoustic signal and intended meaning,
the most characteristic of which are the continuous vs. discrete, as well as the one-to-many,
relationships between the two. Thus, the acoustic signal is basically a continuous process
whereas the intended meaning is a structural
unit which consists of discrete functional categories. On the other hand, some variations of
the acoustic signal, even large ones, may have
no functional effect, whereas other variations,
even small ones, may have critical effects on
the transmission of intended meanings. Also,
the same parameter of the acoustic signal may
have different effects, whereas different acoustic parameters may have the same effect on intended meanings in different contexts.
A typical example of the case above is segmental categories where duration or aspiration
parameters may independently or in combination determine variable distinctions such as in
stop consonants, and may have critical effects
on produced words and thus meanings in a variety of languages. Duration may determine
several distinctions, such as lexical stress and
other prosodic categories, which also have
critical effects on produced meanings. Sentence
types and intonation forms are also typical examples in which, independently or in combination with other linguistic markers, dissimilar
intonation forms may define the same sentence
type, and, inversely, different sentence types
may be defined by similar intonation forms in
different contexts.
In relation to the question above where is
meaning derived from?, if we hypothesise that
meaning is basically derived from the acoustic
signal, we are consequently led to the question
how is meaning derived?. Linguistic theories
have historically posited a variety of functional
linguistic units such as phonemes, morphemes,
phrases and so forth, which the acoustic signal
is presumably organised into. However, how
much are these units determined by the acoustic
signal and how much by psycholinguistic proc-
Organisation
The Workshop is organized into invited speaker
lectures, original research presented in oral and
poster
presentations, and interdisciplinary
panel discussions. Both oral and poster papers
will undergo standard peer review from independent reviewers of the International Scientific Committee and a post-workshop volume is
planned with representative papers of key
knowledge areas, in addition to the published
proceedings which is common practice for ISCA
Workshops.
In accordance with the objectives and perspectives of the Workshop, key knowledge areas and language applications are shown in Table 1.
Table 1. Key linguistic areas and language applications of the Athens 2006 ISCA Workshop on Experimental linguistics.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Theory of language
Cognitive linguistics
Neurolinguistics
Speech production
Speech acoustics
Phonology
Morphology
Syntax
Prosody
Speech perception
Psycholinguistics
Pragmatics
Semantics
Discourse linguistics
Computational linguistics
Language technology
Key linguistic areas and language applications are organized into linguistic domains,
rather than theoretical premises and analysis
methodologies, in order to allow for interdisciplinary approaches and interaction perspectives.
The way we see and study language with reference to general linguistic theory, and the relation and interaction between phonetic production and produced meaning, are paid special
attention. Extensive discussions of the theoretical assumptions of key linguistic areas as related to models and experimental methodol-
134
Outlook
Language has been studied throughout human
history with various, sometimes overlapping,
perspectives and objectives, and with various
applications in mind. The pursuit of linguistic
knowledge has been the driving force behind
the study of language as one of the most defining characteristics of human beings. Language
has been the primary means of communication
in human societies and there have been several
applications critical to the development of human societies as we know them.
Among these applications, three have
marked the route of human societies. First,
early insights in basic forms and functions of
phonetic systems led to the development of
writing systems and the acquisition of reading
and writing skills as part of educational systems, probably the most fundamental language
application. Second, basic knowledge of voice
characteristics and voice signal transmission led
to the development of telephone systems and
distant voice communication. Third, the growth
of information technology and the advent of the
Internet paved the way for a variety of language
applications. Thus, in our days, linguistic research and language technology set common
goals and joint efforts into multifunctional language applications. However, the main precondition for the achievement of these goals and
the fruition of these efforts is that solid theoretical knowledge meets basic scientific criteria
and reflects linguistic reality.
As every era sets its conditions, our era sets
additional requirements and alternative methodologies for the study of language. A new
generation of linguists is to be educated and
equipped with a rich arsenal of experimental
methodologies and new perspectives in linguistic research, and the present Workshop is organised in order to discuss and set the groundwork for this in an interdisciplinary context.
Abstract
The most favoured solution to the problem of
quantity complementarity in Swedish has been
to claim that vowel length is phonemic and
consonant length is predictable (Linell, 1978).
Evidence from listeners perceptual behaviour
supports this over the reverse claim that it is
only consonant length which is distinctive
(cited in Czigler, 1998), a position that has
nevertheless been argued for (Eliasson & La
Pelle, 1973). However, there is a phonological
cost: the vowel inventory must be doubled. We
present an analysis based on positional criteria
to account for the phonetic facts reported in
instrumental studies such as Czigler (1998),
Hassan (2002), Strangert & Wretling (2003),
without the cost of additional phonemes. It
takes Trubetzkoys (1969) correlation of syllable contact and develops it according to more
recent functionalist principles of phonotactic
analysis (Mulder, 1989; Dickins, 1998). Vowel
and consonant length are predicted by whether
there is a consonant in the phonotactic position
immediately following the syllable nucleus.
Quantity complementarity in Swedish is compared to vowel and consonant length in Arabic
and shown to bear out Hassans (2003: 48) assertion that the phenomenon of length constitutes a systematic difference between the phonological systems of both languages.
:
:
:
]lame
lamm [
:] lamb
] road
vgg [
:] wall
] buy kpt [
] bought
135
136
onset
nucleus
postnuclear1
postnuclear2
137
Notes
1. By stressed syllable we mean one that is
not unstressed, i.e. it includes secondary stress
(or what has been called reduced main stress)
as well as primary stress.
2. Witting (1977) cites moln cloud as an example of CV:CC to argue that vowel length is
not predictable when followed by a cluster, but
the pronunciation [mo:ln] is described as a
regional exception by Czigler (1998: 23) and
marked as an exception by Linell (1978: 123),
we therefore discount it.
References
Behne, D.M., Czigler, P.E. & Sullivan, K.P.H.
(1996) Acoustic characteristics of perceived
quantity in Swedish vowels. Speech Science
and Technology 96,(Adelaide), 49-54.
Behne, D.M., Czigler, P.E. & Sullivan, K.P.H.
(1998) Perceived vowel quantity in Swedish: effects of postvocalic voicing. Proceedings of the 16th International Congress of
Acoustics and the 135th Meeting of the
Acoustical Society of America, (Seattle),
2963-64.
Czigler, P.E. (1998) Timing in Swedish VC(C)
sequences. PHONUM 5, Dept of Phonetics,
Ume University.
Dickins, J. (1998) Extended Axiomatic Linguistics. Berlin: Mouton de Gruyter.
Elert, C.-C. (1964) Phonologic Studies of
Quantity in Swedish. Uppsala: Almqvist &
Wiksell.
Eliasson, S. (1978) Swedish quantity revisited.
In Grding, E., Bruce, G. & Bannert, R.
(eds) Nordic Prosody. Dept of Linguistics,
Lund University. 111-122.
Eliasson, S. & La Pelle, N. (1973) Generative
regler fr svenskans kvantitet. Arkiv fr
nordisk filologi 88, 133-148.
Giegerich, H.J. (1992) English Phonology.
Cambridge: Cambridge University Press.
Ghalib, G.B.M. (1984) An experimental study
of consonant gemination in Iraqi Spoken
Arabic. Unpublished PhD Thesis, University of Leeds.
138
Author index
Ayusawa, Takako
103
Asu, Eva Liina
29
Bannert, Robert
75
Bjurster, Ulla
55
Blomberg, Mats
51
Bodn, Petra
37
Bonsdroff, Lina
59
Botinis, Antonis
95, 99, 123, 131
Cerrato, Loredana
41
Charalabakis, Christoforos
131
Edlund, Jens
107
Eklund, Petra
63
Elenius, Daniel
51
Engstrand, Olle
59, 63, 67
Fourakis, Marios
123, 131
Fransson, Linna
79
Ganetsou, Stella
95
Gawronska, Barbara
131
Griva, Magda
95
Gunnarsdotter Grnberg, Anna
5
Gustafsson, Kerstin
63
Gustavsson, Lisa
83
Hassan, Zeki Majeed
127, 135
Heselwood, Barry
135
Hincks, Rebecca
45
House, David
107
Huber, Dieter
49
Ivachova, Ekaterina
63
Jande, Per-Anders
25
Jensen, Christian
Kadin, Germund
Karlsson, Fredrik
Karlsson, sa
Kim, Yuni
Klintfors, Eeva
Kostopoulos, Yannis
Krull, Diana
Lacerda, Francisco
Lindh, Jonas
Nagano-Madsen, Yasuko
Nikolaenkova, Olga
Nolan, Francis
Oppelstrup, Linda
Orfanidou, Ioanna
Schaeffler, Felix
Schtz, Susanne
Segerup, My
Seppnen, Tapio
Skantze, Gabriel
Strangert, Eva
Stlten, Katrin
Sundberg, Ulla
Themistocleous
Thorn, Bosse
Toivanen, Juhani
Tndering, John
Vyrynen, Eero
139
111
67
71
63
9
83
99
33
55, 83
17, 21
103
99
29
51
123
1
87
13
119
107
79
91
55
99
115
119
111
119
140