1996 1997 Speech Perception

SPEECH PERCEPTION
RESEARCH ON SPOKEN LANGUAGE PROCESSING Progress Report No. 21 (1996-1997) Indiana University
Speech Perception1
Richard Wright, Stefan Frisch, and David B. Pisoni Speech Research Laboratory Department of Psychology Indiana University Bloomington, Indiana 47405
This work supported by NIH/NIDCD Training Grant DC00012 to Indiana University.
1 1
R. WRIGHT, S. F RISCH ,
AND D.B.
P ISONI
Speech Perception
Introduction
The study of speech perception is concerned with the process by which the human listener, as a participant in a communicative act, derives meaning from spoken utturances. Modern speech research began in the late 1940s, and the problems that researchers in speech perception have focused on have remained relatively unchanged since. They are: 1) variability in the physical signal and the search for acoustic invariants, 2) human perceptual constancy in the face of diverse physical stimulation, and 3) the neural representation of the speech signal. The goal of this chapter is to examine how these problems have been addressed by various theories of speech perception and describe how basic assumptions about the nature of the problem have shaped the course of research. Due to the breadth of information to be covered, this chapter will not examine the specifics of experimental methodology or survey the empirical literature in the field. There have been many detailed reviews of speech perception which supply further background on these topics (e.g., StuddertKennedy, 1974, 1976; Darwin, 1976; Pisoni, 1978; Lively, Pisoni, & Goldinger, 1994; Klatt, 1989; Miller, 1990; Goldinger, Pisoni, & Luce, 1996; Neary, 1997; Nygaard & Pisoni 1995). The process of speech perception may be limited to the auditory channel alone as in the case of a telephone conversation. However, in everyday spoken language the visual channel is also involved as well and the study of multi-modal speech perception and spoken language processing is one of the central areas of current research. While stimulus variability, perceptual constancy, and neural representation are core problems in all areas of perception research, speech perception is unlike other perceptual processes because the perceiver also produces spoken language and therefore has intimate knowledge of the signal source. This relationship, combined with the high communicative load of speech constrains the signal significantly and affects both perception and production strategies (Lieberman 1963; Fowler & Housman, 1987; Lindblom, 1990). Speech perception is also unique in its remarkable robustness in the face of a wide range of environmental and communicative conditions. The listeners remains remarkably constant in the face of a significant amount of production related variation in the signal. Furthermore, even in the worst of environmental conditions in which large portions of the signal are distorted or masked, the spoken message is recovered with little or no error. As we shall see, part of this perceptual robustness derives from the richness and redundancy of information in the signal, part of it lies in the highly structured nature of language, and part comes from the context dependent nature of spoken language. Extracting meaning from the acoustic signal may at first glance seem like a relatively straightforward task. It would seem to be simply a matter of identifying the acoustically invariant characteristics in the frequency and time domains of the signal that correspond to the appropriate serially ordered linguistic units (i.e. reversing the encoding of those mental units by the production process). From those units the hearer can then retrieve the appropriate lexical entries from memory. Although stated rather simply here, this approach is based on an assumption about the process of speech perception that has been at the core of most symbolic processing approaches (StuddertKennedy, 1976). That is, the process involves the segmentation of the signal into discrete and abstract linguistic units such as features, phonemes, or syllables. Before or during segmentation the extra-linguistic information is segregated from the intended message and is processed separately or discarded. For this process to succeed, the spoken signal must meet two conditions The first, known as the invariance condition, is that there is invariant information in the signal that is present in all instances that correspond to the perceived linguistic unit. The second, known as the linearity condition, is that the information in the signal is serially ordered so that information about the first linguistic unit precedes and does not completely overlap or follow information about the next linguistic unit and so forth.
2 2
SPEECH PERCEPTION
It has become apparent to speech researchers over the last 40 years that the invariance and linearity conditions are almost never met in the actual speech signal (Liberman, 1957; Chomsky & Miller, 1963; Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). This has led to several innovations that have achieved varying degrees of success in accommodating some of the variability and much of the nonlinearity inherent in the speech signal (Liberman, Cooper, Harris, & MacNeilage, 1963; Liberman & Mattingly, 1985; Blumstein & Stevens, 1980; Stevens & Blumstein, 1981). However, inter- and intra-talker variability remains an intractable problem within these conceptual/theoretical frameworks. Recent approaches that treat the signal holistically have proven promising alternatives. Much of the variability that researchers sought to strip away in traditional approaches contains important information about the talker and about the intended message. Recent approaches, while differing significantly in their view of perception, treat the signal as information rich. The information in the speech signal is both linguistic, the traditional message of the signal, and non-linguistic or indexical (Abercrombie, 1967; Ladefoged & Broadbent, 1957), information about the talkers immediate physical and emotional state, about the talkers relationship to the environment, the social context, etc. (Pisoni, 1996). Much of the variability and redundancy in the signal can be used to enhance the perceptual process rather than being discarded as noise (Klatt, 1976, 1989; Fowler, 1986; Goldinger, 1990; Johnson, 1997).
The Abstractionist/Symbolic Approach to Speech Perception

Traditional approaches to speech perception are based on ideas that originated in information theory and have treated the process of speech perception as distinct from word recognition, sentence understanding, and speaker recognition. In this view, the decoding of the speech signal into abstract symbolic units (i.e. features, phonemes, syllables) is the goal of speech perception, and the discrete units are then passed along to be used by higher level parsers that identify lexical items such as morphemes or words. Listeners are hypothesized to extract abstract, invariant properties of the acoustic signal to be matched to prototypical representations stored in long term memory (Forster, 1976; McClelland & Elman, 1986; Oden & Massaro, 1978; Samuel, 1982; Kuhl, 1991; Neary, 1992). In fact, most models of word recognition use either the phoneme or the syllable as the fundamental unit of processing (Marslen-Wilson & Welsh, 1978; Cutler & Norris, 1988; Norris & Cutler, 1995; Luce, 1986). These models implicitly assume some type of lowlevel recoding process. The assumption that speech is perceived in abstract idealized units has led researchers to search for simple first-order physical invariants and to ignore the problem of stimulus variability in the listeners environment (e.g., Blumstein & Stevens, 1980; Sussman, McCaffrey, & Matthews, 1991). In this view, variability is treated as noise. This means that much of the talker specific characteristics, or indexical information, that a listener uses to identify a particular talker, or a talkers state, is removed through a process of normalization, leaving behind the intended linguistic message (Studdert-Kennedy, 1974). In this view, normalization converts the physical signal to a set of abstract units that represent the linguistic message symbolically. The dissociation of form from content in speech perception has persisted in large part despite the fact that the both sources of information are carried simultaneously and in parallel in the acoustic signal and despite the potential gain that a listener may get from simultaneously receiving contextual information such as the rate of an utterance, or the gender, socioeconomic status, and mood of the talker. Following models of concept learning and memory (Jacoby & Brooks, 1984), this view of speech perception has been termed the abstractionist approach (Nygaard & Pisoni, 1995; Pisoni, 1997). As the abstractionist approach relies on a set of idealized linguistic units, it is useful to review the types of perceptual units that are commonly used and the motivations for abstract units in the first place.
3 3
AND D.B.
P ISONI
The use of abstract symbolic units in almost all traditional models of speech perception came about for several reasons, one being that linguistic theory has had a great impact on speech research. The abstract units that had been proposed as tools for describing patterns of spoken language, themselves a reflection of the influence of information theory on linguistics (Jacobson, Fant, & Halle, 1952), were adopted by many speech researchers (e.g., Liberman et al., 1957; Pisoni, 1978). This view can be summed up by a quote from Halle (1985): when we learn a new word we practically never remember most of the salient acoustic properties that must have been present when the acoustic signal struck our ears; for example, we do not remember the voice quality, the speed of utterance, and other properties directly linked to the unique circumstances directly surrounding every utterance. (p. 101) While linguistic theory has moved away from the phoneme as a unit of linguistic description to a temporally distributed featural or gestural array (Browman & Goldstein, 1990; Goldsmith, 1990; Steriade, 1993; Fowler, 1995; Frazier, 1995), many researchers in speech perception continue to use the phoneme as a unit of perception. Another reason for the use of abstract units lies in the nature of the speech signal. Because of the way speech is produced in the vocal tract, the resulting acoustic signal is continuously changing, making all information in the signal highly variable and transient. This variability, combined with the constraints on auditory memory, led many researchers to assume that the analog signal must be rapidly recoded into discrete and progressively more abstract units (Broadbent, 1965; Liberman et al., 1967). This process achieves a large reduction of data that was thought to be redundant or extraneous into a few predefined and timeless dimensions. However, while reduction of redundancy potentially reduces the memory load, it increases the processing load and greatly increases the potential for an unrecoverable error on the part of the hearer (Miller, 1962; Klatt, 1979). Furthermore, there is evidence that much of the information in the signal that was deemed extraneous is encoded and stored by the memory system and subsequently used by the hearer in extracting meaning from the spoken signal (Peters, 1955; Creelman, 1957; Pisoni 1990; Palmeri, Goldinger & Pisoni, 1993). An additional motivation for postulating abstract units comes from the phenomenon of perceptual constancy. Although there is substantial contextual variation in the acoustic signal, the hearer appears to perceive a single unit of sound. For example, a voiceless stop consonant such as /t/ that is at the beginning of a word, as in the word top, is accompanied by a brief puff of air at its release and a period of voicelessness in the following vowel which together are generally referred to as aspiration. When that same stop is preceded by the fricative /s/, as in the word stop, the aspiration is largely absent. Yet the hearer perceives the two very different acoustic signals as being the same sound category /t/. This particular example of perceptual constancy may be explained in terms of the possible lexical contrasts of English. While there are lexical distinctions that are based on the voicing contrast, cat versus cad for example, there is no lexical distinction in English that is based on an aspiration contrast. It should be remembered that most contextual variation that is non-contrastive in one language is often the foundation of a lexical contrast in another language (Ladefoged & Maddieson, 1996). Rather than being hardwired into the brain at birth or being imposed on the hearer by transformations of the peripheral auditory system, these contrastive characteristics of a particular language must be learned. Thus, the complex process by which perceptual normalization takes place in a particular language is due almost entirely to perceptual learning and categorization. Finally, segmenting the speech signal into units that are hierarchically organized permits a duality of patterning of sound and meaning (Hockett, 1960) that is thought to give language its communicative power. That is, smaller units such as phonemes may be combined according to language specific phonotactic constraints into morphemes and words, and words may be organized
4 4
SPEECH PERCEPTION
according to grammatical constraints into sentences. This means that with a small set of canonical sound units, and the possibility of recursiveness, the talker may produce and the hearer may decode and parse a virtually unbounded number of utterances in the language. There are many types of proposed abstract linguistic units that are related in a nested structure with features at the terminal nodes and other types of units that may include phonemes, syllables, morphemes, words, syntactic phrases and intonation phrases as branching nodes that dominate them (Jacobson, Fant & Halle, 1952; Chomsky & Halle, 1968; Pierrehumbert & Beckman, 1988). Different approaches to speech perception employ different units and different assumptions about levels of processing. Yet, there is no evidence for the primacy of any particular unit in perception. In fact, the perceptual task itself may determine the units that hearers use to analyze the speech signal (Miller, 1962; Cutler, 1997). There are many behavioral studies that have found that human listeners appear to segment the signal into phoneme sized units. For example, it has been found that reaction times of English listeners to phonotactically permissible CVC syllables (where all the sounds in isolation are permissible) were no faster than reaction times to phonotactically impermissible CV syllables indicating that the syllable plays no role in spoken language processing (Cutler, Mehler, Norris, & Segui, 1986; Norris & Cutler, 1988). However, the same experiments conducted with native speakers of French have found that listeners response times are significantly more rapid to the phonotactically permissible CVC syllable than to the CV syllables, while responses of Japanese listeners to the CVC were significantly slower than to the CV syllables (Mehler, 1981; Mehler, Dommergues, Frauenfelder and Segui, 1981; Segui, 1984). Taken together findings of task specific and language specific biases in the preferred units of segmentation indicate that a particular processing unit is contingent on a number of factors. Moreover, there is a great deal of evidence that smaller units like the phoneme or syllable are perceptually contingent on larger units such as the word or phrase (Miller, 1962; Bever, Lackner & Kirk, 1969; Ganong, 1980). This interdependence argues against the strict hierarchical view of speech perception in which the smallest units are extracted as a precursor to the next higher level of processing (Remez, 1987). Rather, the listeners responses appear to be sensitive to attentional demands, processing contingencies, and available information (Eimas & Nygaard, 1992; Remez, Rubin, Berns, Pardo & Lang, 1994).
Basic Stimulus Properties

Understanding the nature of the stimulus is an important step in approaching the basic problems in speech perception. This section will review some of the crucial findings that are relevant to models of speech perception. A large portion of the research on speech perception has been devoted to the investigation of speech cues and some of the better known findings are discussed in four sub-sections: vowels, consonant place, consonant manner, and consonant voicing. In addition to the auditory channel, the visual channel is known to affect speech perception, and we discuss below some of the key findings. Because the speech signal is produced by largely overlapping articulatory gestures, information in the signal is distributed in an overlapping fashion. Non-linearity of information in the speech signal is reviewed and implications for speech perception are touched upon. Finally, although it was largely ignored in the past, variability is arguably the most important issue in speech perception research. It is a problem comes from many sources; some of the most important sources of variability in speech and their perceptual consequences are reviewed in the last part of this section.
Speech Cues Since the advent of modern speech research at the end of the Second World War, much of the work on speech perception has focused on identifying aspects of the speech signal which contain the minimal information that is necessary to convey a speech contrast. These components of the
5 5
AND D.B.
P ISONI
signal are referred to as speech cues. The assumption that a small set of acoustic features or attributes in the acoustic signal provide cues to linguistic contrasts was the motivation for the search for invariant cues which was central to speech perception research from the mid-50s up until the present time. The more the signal was explored for invariant acoustic cues the more the problems of variability and non-linearity became evident. One solution to this problem was to study speech using highly controlled stimuli to minimize the variability. Stimuli for perception experiments were usually constructed using a single synthetic voice producing words in isolation with a single contrast in one segmental context and syllable position. This approach produced empirical evidence about a very limited set of circumstances, thereby missing much of the systematic variability and redundancy of information that plays a part in speech perception. It also artificially removed talker specific information and other extra-linguistic contextual information that was later shown to be used by listeners during speech perception. Despite these shortcomings, early work on speech perception has provided valuable empirical data that must be taken into consideration when evaluating the relative merits of current speech perception models. The acoustic signal is produced by articulatory gestures that are continuous and overlapping to various degrees; thus, the resulting acoustic cues vary greatly with context, speaking rate, and talker. Contextual variation is a factor that contributes to redundancy in the signal. Although the early study of speech cues sought to identify a single primary cue to a particular linguistic contrast, it is improbable that the human perceptual system fails to take advantage of the redundancy of information in the signal. It should also be noted that many factors may contribute to the salience of a particular acoustic cue in the identification of words, syllables, or speech sounds. These include factors that are part of the signal such as coarticulation or positional allophony, semantic factors such as the predictability of a word that the listener is trying to recover, and whether or not the listener has access to visual information generated by the talker. The extent to which a listener attends to particular information in the speech signal is also dependent on the particular system of sound contrasts in his/her language. For example, while speakers of English can use a lateral/rhotic contrast (primarily in F3) to distinguish words like light from words like right, many languages lack this subtle contrast. Speakers of those languages (e.g., Japanese) have great difficulty attending to the relevant speech cues when trying to distinguish /l/ from /r/. Thus, the speech cues that are discussed in this section should not be considered invariant or universal, instead they are context sensitive and highly interactive. Vocalic Contrasts The vocal tract acts as a time-varying filter with resonant properties that transform frequency spectra of the sound sources generated in the vocal tract (Fant, 1960; Flanagan, 1972; Stevens & House, 1955). Movements of the tongue, lips, and jaw cause changes in the resonating characteristics of the vocal tract that are important in distinguishing one vowel from another. Vowels distinctions are generally thought to be based in part on the relative spacing of the fundamental frequency (f0) and the first three vocal tract resonances or formants (F1, F2, F3) (Syrdal & Gopal, 1986). In general, there is an inverse relationship between the degree of constriction (vowel height) and the height of the first formant (Fant, 1960). That is, as the degree of constriction in the vocal tract increases, increasing the vowel height, F1 lowers in frequency. The second formant is generally correlated with the backness of the vowel: the further back in the vocal tract the vowel constriction is made the lower the second formant (F2). F2 is also lowered by lip rounding and protrusion. Thus, the formant frequencies of vowels and vowel-like sounds are produced by changes in the length and shape of the resonant cavities of the vocal tract above the laryngeal sound source. In very clear speech vowels contain steady-state portions where the relative spacing between the formants remains fixed and the f0 remains relatively constant. Words spoken with care and speech samples from read sentences may contain steady-state vowels. Early work on vowel
6 6
SPEECH PERCEPTION
perception that was modeled on this form of carefully articulated speech found the primary cue for vowel perception was the steady-state formant values (Gerstman, 1968; Skinner, 1977). Figure 1 is a spectrogram illustrating the formant structure of five carefully pronounced nonsense words with representative vowels dVd contexts: /did/ (sounds like deed), /ded/ (sounds like dayed), /dad/ (sounds like dodd), /dod/ (sounds like doad), and /dud/ (sounds like dood). The first two formants, the lowest two dark bars, are clearly present and have relatively steady state portions near the center of each word. The words /did/ and /dad/ have the clearest steady state portions. ------------------------------Insert Figure 1 about here ------------------------------In naturally spoken language, however, formants rarely achieve a steady-state, and are usually flanked by other speech sounds which shape the formant structure into a dynamic time varying pattern. For the same reasons, vowels often fall short of the formant values observed in careful speech resulting in undershoot (Fant, 1960; Stevens & House, 1963). The degree of undershoot is a complex function of the flanking articulations, speaking rate, prosody, sentence structure, dialect, and individual speaking style (Lindblom, 1963; Gay, 1978). These observations have led researchers to question the assumption that vowel perception relies on the perception of steady-state formant relationships, and subsequent experimentation has revealed that dynamic spectral information in the formant transitions into and out of the vowel are sufficient to identify vowels even in the absence of any steady-state information (Lindblom & Studdert-Kennedy, 1967; Strange, Jenkins, & Johnson, 1983). Although they are not found in English, and not well studied in the perception literature, there are many secondary vocalic contrasts in the worlds languages in addition to vowel height and backness contrasts. A few of the more common ones are briefly described here; For a complete review see Ladefoged & Maddieson (1996). A secondary contrast may effectively double or, in concert with other secondary contrasts, triple or quadruple the vowel inventory of a language. The most common type of secondary contrast, found in 20% of the worlds languages, is nasalization (Maddieson, 1984). Nasalization in speech is marked by a partial attenuation of energy in the higher frequencies, a broadening of formant bandwidths and by an additional weak nasal formant around 300 Hz (Fant 1960; Fujimura, 1962). Vowel length contrasts are also commonly observed and are found in such diverse languages as Estonian (Lehiste, 1970), Thai (Hudak, 1987) and Japanese (McCawley, 1968). Source characteristics, changes in the vibration characteristics of the vocal folds, may also serve as secondary contrasts. These include creaky and breathy vowels. In a creaky vowel, the vocal fold vibration is characterized by a smaller open-to-closed ratio resulting in more energy in the harmonics of the first and second formants, narrower formant bandwidths, and often more jitter (irregular vocal cord pulse rate) (Ladefoged, Maddieson, & Jackson, 1988). In breathy vowels, the vocal fold vibration is characterized by a greater open-to-closed ratio resulting in more energy in the fundamental frequency, broader formant bandwidths, and often more random energy (noise component) (Ladefoged et al, 1988). Although they are poorly understood, secondary contrasts appear to interact in complex ways with perceived vowel formant structure and with the perception of voice pitch (Silverman, 1997).
7 7
AND D.B.
P ISONI
/did/ "deed"
/ded/ "dayed"
/dad/ "dod"
/dod/ "doed"
/dud/ "dood"
Figure 1. A spectrogram illustrating the formant structure of four representative vowels in dVd contexts. The lowest two dark bands are the first and second formants.
8 8
SPEECH PERCEPTION
Consonant Place of Articulation There are several potential sources of cues to the place of articulation of a consonant, including second formant transitions, stop release bursts, nasal pole-zero patterns, and the generation of fricative noise. The strongest place of articulation cues are found in the brief transitional period between a consonant and an adjacent vowel. Some speech cues are internal, as is the case in fricatives such as /s/ or nasals such as /n/. Other speech cues are distributed over an entire syllable as in vowel coloring by laterals such as /l/, rhotics such as /r/, and retroflex consonants such those found in Malayam and Hindi. We briefly review several of the most important cues to place of articulation here. Formant Transitions The second formant (F2), and to a lesser degree the third formant (F3), provide the listener with perceptual cues to the place of articulation of consonants with oral constrictions, particularly the stops, affricates, nasals, and fricatives (Delattre, Liberman & Cooper, 1955). Transitions are the deformation of the vowels formants resulting from the closure or aperture phase of a consonants articulation, i.e., a rapid change in the resonating cavity, overlapping with the relatively open articulation of a flanking vowel. Because they are the result of very fast movements of the articulators from one position to another, formant transitions are transient and dynamic, with the speed of the transitions depending on the manner, the place of articulation (to a lesser degree), and such factors as the individual talkers motor coordination, the speaking rate and style, and the novelty of the utterance. Unlike other consonants, glides and liquids have clear formant structure throughout their durations. Glides are distinguished from each other by the distance between the first and second formant values at the peak of constriction, whereas the English /l/ is distinguished from /r/ by the relative frequency of the third formant (OConnor, Gerstman, Liberman, Delattre, & Cooper, 1957). English /l/ and /r/ cause vowel coloring, a change in the formant structure of the adjacent vowels, particularly the preceding one, that may last for much of the vowels duration. Both the transitions into and out of the period of consonant constriction provide place cues for consonants that are between vowels. In other positions, there is at most only a single set of formant transitions to cue place: the formant transitions out of the consonant constriction (C to V) in word onset and post-consonantal positions, and the formant transitions into the consonant constriction (V to C) in word final and pre-consonantal positions. For stops in the VC position with no audible release, formant transitions may provide the only place cues. Following the release of voiceless stops, there is a brief period of voicelessness during which energy in the formants is weakened. Following the release of aspirated stops, a longer portion or all of the transition may be present in a much weaker form in the aspiration noise. It is widely thought that CV formant transitions provide more salient information about place than VC transitions (see Wright, 1996, Ch. 2 for a discussion). When formant transitions into a stop closure (V to C) conflict with the transitions out of the closure (C to V), listeners identify the stop as having the place of articulation that corresponds with the C to V transitions (Fujimura, Macchi, & Streeter, 1978). The relative prominence of CV transitions over VC transitions is also influenced by the language of the listener. For example, native Japanese speaking listeners have been shown to be very poor at distinguishing place from VC transitions alone, while native Dutch and English speakers are good at distinguishing place with VC transitions (van Weiringen, 1995). In this case, the difference in performance can be attributed to differences in syllable structure between the languages. While English and Dutch allow post-vocalic stops with contrasting place of articulation (e.g., actor or bad), Japanese does not; experience with Japanese syllable structure has biased Japanese speakers towards relying more on the CV transitions than VC transitions.
9 9
AND D.B.
P ISONI
Fricative Noise Fricatives are characterized by a narrow constriction in the vocal tract that results in turbulent noise either at the place of the constriction or at an obstruction down-stream from the constriction (Schadle, 1985). Frication noise is aperiodic with a relatively long duration. Its spectrum is shaped primarily by the cavity in front of the noise source (Heinz and Stevens, 1961). The spectrum of the frication noise is sufficient for listeners to reliably recover the place of articulation in sibilant fricatives such as /s/ and /z/. However, in other fricatives with lower amplitude and more diffuse spectra, such as /f/ and /v/, the F2 transition has been found to be necessary for listeners to reliably distinguish place of articulation (Harris, 1958). Of these, the voiced fricatives, as in the words that and vat are the least reliably distinguished (Miller & Nicely, 1955). It should be noted that this labio-dental versus inter-dental contrast in fricatives in English is very rare in the worlds languages (Maddieson, 1984). The intensity of frication noise and the degree of front cavity shaping is expected to affect the relative importance of the fricative noise as a source of information for other fricatives as well. Because fricatives have continuous noise that is shaped by the cavity in front of the constriction, they also convey information about adjacent consonants in a fashion that is similar to vowels. Overlap with other consonant constrictions results in changes in the spectral shape of a portion of the frication noise, most markedly when the constriction is in front of the noise source. The offset frequency of the fricative spectrum in fricative-stop clusters serves as a cue to place of articulation of the stop (Bailey & Summerfield, 1980; Repp & Mann, 1981). Stop Release Bursts In oral stop articulations there is complete occlusion of the vocal tract and a resulting buildup of pressure behind the closure. The sudden movement away from complete stricture results in brief high amplitude noise known as the release burst or release transient. Release bursts are aperiodic with a duration of approximately 5-10 ms. The bursts duration depends on both the place of articulation of the stop and the quality of the following vowel; Velar stop releases (/k/ and /g/) are longer and noisier than labial and dental stops, and both dental and velar stops show an increased noisiness and duration of release before high vowels. Release bursts have been shown to play an important role in the perception of place of articulation of stop consonants (e.g., Liberman et al., 1955, Blumstein, 1981; Dorman, Studdert-Kennedy, & Raphael, 1977; Kewley-Port, 1983a). Although the release burst or the formant transitions alone are sufficient cues to place, the formant transitions have been shown to dominate place perception; i.e. if the release burst spectrum and the F2 transition provide conflicting place cues, listeners perceive place according to the F2 transition (Walley & Carrell, 1983). Listeners show the greatest reliance on the transition in identifying velar place in stops (Kewley-Port, Pisoni, & Studdert-Kennedy, 1983). Although less studied as a source of cues, there are many other subtler place dependent differences among stops that are a potential source of information to the listener. For example, velar stops (/k/ and /g/) tend to have shorter closure durations than labial stops (/p/ and /b/) and amplitude differences may help in distinguishing among fricatives. An additional class of sounds known as affricates are similar in some respects to stops and in other aspects to fricatives; they have a stop portion followed by a release into a fricative portion. In their stop portion, they have a complete closure, a build up of pressure and the resultant release burst at release. The release is followed by a period of frication longer than stop aspiration but shorter than a full fricative. Both the burst and the frication provide place cues. In English, all affricates are palato-alveolar, but there is a voicing contrast (chug versus jug). The palato-alveolar affricate found in English is the most commonly found, 45 percent of the worlds languages have it (Maddieson, 1984), but many other places of articulation are common and many languages have an affricate place contrast (e.g., /pf/ versus /ts/ in German).
10
SPEECH PERCEPTION
Nasal Cues Like the oral stops, nasal consonants have an oral constriction that results in formant transitions in the adjacent vowels. In addition, nasals show a marked weakening in the upper formants due to the antiresonance (zero) and a low frequency resonance (pole) below 500 Hz. The nasal pole-zero pattern serves as a place cue (Kurowski & Blumstein, 1984). This cue is most reliable in distinguishing /n/ and /m/, and less so for other nasals (House, 1957). Listeners identify the place of articulation more reliably from external formant transitions than from the internal nasal portion of the signal (Malcot, 1956). Figure 2 schematically illustrates some of the most frequently cited cues to consonant place of articulation for three types of consonants: a voiceless alveolar stop /t/, a voiceless alveolar fricative /s/, and an alveolar nasal /n/. The horizontal bars represent the first three formants of the vowel /a/, the deformation of the bars represents formant transitions, and the hatched areas represent fricative and stop release noises. Table 1 summarizes the consonant place cues discussed above. Although it lists some of the more frequently discussed cues, the table should not be seen as exhaustive as there are many secondary and contextual cues that contribute to a consonant percept that are not listed here. ------------------------------Insert Figure 2 about here -------------------------------
Table 1
Summary of Commonly Cited Place Cues
Cue F2 transition burst spectrum frication spectrum frication amplitude nasal pole, zero fricative noise transition F3 height
Applies to all stops fricatives, affricates (esp. sibilants) fricatives nasals stops liquids and glides
Distribution VC, CV transitions C-release internal internal internal fricative edge internal
Consonants of all types have much narrower constrictions than vowels. They can be viewed as the layering of a series of rapid constricting movements onto a series of slower moving transitions from one vowel to the next (Browman & Goldstein, 1990). For all types of consonants, the change in the vowels formants that result from the influence of the consonants narrower constriction are the most robust types of cues. However, we have seen that there are a number of other sources of information about the place of articulation of a consonant that the listener may use in identifying consonants. This chapter has touched on a few of the better known such as stop release bursts, nasal pole-zero patterns, and fricative noise. These are often referred to as secondary cues because perceptual tests have shown that when paired with formant transitions that provide conflicting information about the consonant place of articulation, the
11
AND D.B.
P ISONI
stop release burst
fricative noise
F3 F2 F1
F2 transitions nasal pole and zero
Figure 2. Schematic illustration of place cues in three VCV sequences where V is the vowel /a/ and C is a voiceless alveolar stop, a voiceless alveolar fricative and an alveolar nasal
12
12
SPEECH PERCEPTION
perceived place is that appropriate for the formant transitions. However, depending on the listening conditions, the linguistic context, and the perceptual task, these so-called secondary cues may serve as the primary source of information about a consonants place of articulation. For example, in word initial fricative-stop clusters, the fricative noise may provide the sole source of information about the fricatives place of articulation. While in English only /s/ appears in word initial fricative-stop clusters, there are many languages which contrast fricative place in such clusters and many which have stop-stop clusters or nasal-stop clusters as well (see Wright, 1996 for a description of a language that has all three types of clusters). Thus, it is likely that phonotactic constraints, position within sentence, position within word, position within syllable, background noise, and so on, must be taken into consideration before a relative prominence or salience is assigned to any particular acoustic cue in the signal. Consonant Manner Contrasts All oral constrictions result in an attenuation of the signal, particularly in the higher frequencies. The relative degree of attenuation is a strong cue to the manner of a consonant. An abrupt attenuation of the signal in all frequencies is a cue to the presence of a stop. Insertion of a period of silence in a signal, either between vowels or between a fricative and a vowel can result in the listener perceiving a stop (Bailey & Summerfield, 1980). A complete attenuation of the harmonic signal together with fricative noise provides the listener with cues to the presence of a fricative. A less severe drop in amplitude accompanied by nasal murmur and a nasal pole and zero are cues to the presence of a nasal (Hawkins & Stevens, 1985). Nasalization of the preceding vowel provides look-ahead cues to post-vocalic nasal consonants (Ali, Gallager, Goldstein, & Daniloff, 1971; Hawkins & Stevens, 1985). Glides and liquids maintain formant structure throughout their peak of stricture, but both attenuate the signal more than vowels. Glides are additionally differentiated from other consonants by the relative gradualness of the transitions into and out of the peak of stricture. Lengthening the duration of synthesized formant transitions has been shown to change the listeners percept of manner from stop to glide (Liberman, Delattre, Gerstman, & Cooper, 1956). A similar cue is found in the amplitude envelope at the point of transition between consonant and vowel: stops have the most abrupt and glides have the most gradual amplitude rise time (Shinn & Blumstein, 1984). Manner cues in general tend to be more robust than place cues because they result in more salient changes in the signal, although distinguishing stop from fricative manner is less reliable with the weaker fricatives (Miller & Nicely, 1955). Figure 3 schematically illustrates some of the most frequently cited cues to consonant manner of articulation for three types of consonants: a voiceless alveolar stop /t/, a voiceless alveolar fricative /s/, and an alveolar nasal /n/. The horizontal bars represent the first three formants of the vowel /a/, the deformation of the bars represents formant transitions, the hatched areas represents fricative and stop release noises. Table 2 summarizes the consonant manner cues discussed above. Again, the table should not be seen as exhaustive as there are many secondary and contextual cues that contribute to a consonant percept that are not listed here. ------------------------------Insert Figure 3 about here -------------------------------
13
AND D.B.
P ISONI
stop release burst
slope of formant transitions
nasalization of vowel
F3 F2 F1
abruptness and degree of attenuation fricative noise nasal pole and zero
Figure 3. Schematic illustration of manner cues in three VCV sequence where V is the vowel /a/ and C is a voiceless alveolar stop, a voiceless alveolar fricative and an alveolar nasal
14
14
SPEECH PERCEPTION
Table 2
Summary of Commonly Cited Manner Cues
Cue silence/near silence frication noise nasal pole & zero vowel nasalization formant structure release burst noise duration noise onset rise-time transition duration
Applies to stops, affricates fricatives, affricates nasals nasals liquids, glides (vowels) stops stop, affricate, fricative stop/affricate, fricative stop, glide
Distribution internal internal internal adjacent vowel internal C-release internal internal VC, CV transitions
Cues to Voicing Contrasts Vocal fold vibration, resulting in periodicity in the signal, is the primary cue to voicing; however tight oral constriction inhibits the airflow necessary for vocal fold vibration. In English and many other languages, voiced obstruents, especially stops, may have little or no vocal fold activity. This is more common for stops in syllable final position. In this situation, the listener must rely on other cues to voicing. There are several other important cues such as Voicing Onset Time (VOT), the presence and the amplitude of aspiration noise, and durational cues. For syllable initial stops in word onset position, the primary cue appears to be VOT. This is not really a single cue in the traditional sense but a dynamic complex that includes the time between the release burst and the onset of vocal fold vibration together with aspiration noise, i.e. low amplitude noise with spectral peaks in the regions of the following vowels formants (Lisker & Abramson, 1964). VOT appears to be important even in languages like French that maintain voicing during stop closure (van Dommelen, 1983). The relationship between VOT and voicing is, in part, dependent on how contrasts are realized in a particular language. For example, for the same synthetic VOT continuum, Spanish and English speakers have different category boundaries despite the fact that both languages have a single voiced-voiceless contrast (Lisker & Abramson, 1970). In Thai, there are two boundaries, one similar to English and one similar to Spanish, because there is a three way voicedvoiceless-aspirated contrast in the language (Lisker & Abramson, 1970). Generally, a short or negative VOT is a cue to voicing, a long VOT is a cue to voicelessness, and a very long VOT is a cue to aspiration (in languages with an aspiration contrast). For English, and presumably other languages, the relative amplitude and the presence or absence of aspiration noise is a contributing cue to voicing for word initial stops (Repp, 1979). An additional cue to voicing in syllable onset stops is the relative amplitude of the release burst: a low amplitude burst cues voiced stops while a high amplitude burst cues voiceless stops (Repp, 1977). The duration and spectral properties of the preceding vowel also provide cues to voicing in post-vocalic stops and fricatives (Soli, 1982). When the vowel is short, with a shorter steady state relative to its offset transitions, voicelessness is perceived. The duration of the consonant stricture is also a cue to both fricative and stop voicing: longer duration cues voicelessness (Massaro & Cohen, 1983). Figure 4 schematically illustrates some of the most frequently cited cues to consonant voicing for two types of consonants: a voiceless alveolar stop /t/, and a voiced alveolar fricative /z/. The
15
AND D.B.
P ISONI
horizontal bars represent the first three formants of the vowel /a/, the deformation of the bars represents formant transitions, the hatched areas represents fricative and stop release noises, and the dark bar at the base of the /z/ represents the voicing bar. Table 2 summarizes the consonant manner cues discussed above. Again, the table should not be seen as exhaustive as there are many secondary and contextual cues that contribute to a consonant percept that are not listed here. ------------------------------Insert Figure 4 about here -------------------------------
Table 3
Summary of Commonly Cited Voicing Cues
Cue periodicity VOT consonant duration release amplitude preceding V duration
Applies to stops, affricates, fricatives stops stops, fricatives stops obstruents
Distribution internal CV transition internal C-release preceding vowel
Visual Information: Multi-modal Speech Perception Much of the research on speech perception focuses on the acoustic channel alone. In part, the concentration on auditory perception is related to the fact that the acoustic signal is richer in information about spoken language than the visual signal. However, the visual signal may have a large impact on the perception of the auditory signal under degraded conditions. When a hearer can see a talkers face, the gain in speech intelligibility in a noisy environment is equivalent to a 15 dB gain in the acoustic signal alone (Sumby & Pollack, 1954). This is a dramatic difference, superior to that of even the most sophisticated hearing aids. The relative importance of the visual signal increases as the auditory channel is degraded through noise, distortion, filtering, hearing loss, and potentially through unfamiliarity with a particular talker, stimulus set, or listening condition. When information in the visual channel is in disagreement with the information in the auditory channel, the visual channel may change or even override the percept of the auditory channel alone. McGurk and MacDonald (1976) produced stunning evidence, now known as the McGurk effect, of the strength of the visual signal in the perception of speech in an experiment that has since been replicated under a variety of conditions. They prepared a video tape of a talker producing two syllable utterances with the same vowel but varying in the onset consonants such as baba, mama, or tata. The audio and video channels were separated and the audio tracks of one utterance were dubbed onto video tracks of different utterances. With their eyes open, subjects perceptions were strongly influenced by the video channel. For example when presented with a video of a talker saying tata together with the audio of the utterance mama the subjects perceived nana. But with their eyes closed, subjects perceived mama. This effect of cross-modal integration is strong and immediate, there is no hesitation or contemplation on the part of the
16
SPEECH PERCEPTION
release burst amplitude vowel duration aspiration noise
F3 F2 F1
vowel duration VOT stricture duration periodicity
Figure 4. Schematic illustration of the voicing cues in two VCV sequences where V is the vowel /a/ and C is a voiceless alveolar stop /t/ and a voiced alveolar fricative /z/.
17
17
AND D.B.
P ISONI
subjects who are completely unaware of the conflict between the two channels. The McGurk effect is considered by many theorists as evidence that the auditory and visual integration occurs at a low level because of its automatic nature. It also reflects limitations of the information which can be obtained through the visual channel. Many aspects of the speech production process are hidden from view. These include voicing, nasalization, and many vowel and consonant contrasts. Non-Linearity of the Speech Signal As a result of the way in which speech is produced, much of the information in the signal is distributed, overlapping, and contextually varying. In producing speech, the articulatory organs of the human vocal tract move continuously with sets of complex gestures that are partially or wholly coextensive and covarying (see the chapter on speech production, this volume). The resulting acoustic and visual signals are continuous and the information that can be identified with a particular linguistic unit shows a high degree of overlap and covariance with information about adjacent units (Delattre, Liberman, & Cooper, 1955; Liberman, Delattre, Cooper, & Gerstman, 1954). This is not to say that segmentation of the signal is impossible; acoustic analysis reveals portions of the signal that can act as reliable acoustic markers for points at which the influence of one segment ends or begins (Fant, 1962). However, the number of segments determined in this way and their acoustic characteristics are themselves highly dependent on the context (Fant, 1962, 1986). As noted earlier, because of the distributed and overlapping nature of phonetic/linguistic information, the speech signal fails to meet the linearity condition (Chomsky & Miller, 1963). This poses great problems for phoneme based speech recognizers and, if discrete units do play a part in perception, they should also be problematic for the human listener. Yet, the listener appears to segment the signal into discrete and linear units such as words, syllables, and phonemes with little effort. In the act of writing, much of the worlds population can translate a heard or internally generated signal into word-like units, syllable-like units, or phoneme-like units. Although this is often cited as an argument for a segmentation process in speech perception, the relation between the discrete representation of speech seen in the worlds writing systems and the continuous signal is complex and may play little role in the perceptual process (Pierrehumbert & Pierrehumbert 1993). Segmentation may be imposed on an utterance after the perceptual process has been completed. It is not clear that a signal of discrete units would be preferable; the distributed nature of information in the signal contributes to robustness by providing redundant look-ahead and look-back information. Reducing speech to discrete segments could result in a system that, unlike human speech perception, cannot recover gracefully from errorful labeling (Fowler & Smith, 1986; Klatt, 1979, 1990; Pisoni, 1997). Informational Burden of Consonants and Vowels From an abstract phonemic point of view, consonant phonemes bear a much greater informational burden than vowel phonemes. That is, there are far more consonant phonemes in English than there are vowel phonemes, and English syllables permit more consonant phonemes per syllable than vowel phonemes. Thus, many more lexical contrasts depend on differences in consonant phonemes than in vowel phonemes. However, the complex overlapping and redundant nature of the speech signal means that the simple information theoretic analysis fails in its predictions about the relative importance of consonant and vowel portions of the signal in speech perception. The importance of vowels is due to the gross differences in the ways consonants and vowels are produced by the vocal tract. Consonants are produced with a complete or partial occlusion of the vocal tract, causing a rapid attenuation of the signal, particularly in the higher frequencies. In the case of oral stops, all but the lowest frequencies (which can emanate through the fleshy walls of the vocal tract) are absent from the signal. In contrast, vowels are produced with a relatively open vocal
18
SPEECH PERCEPTION
tract, therefore there is little overall attenuation and formant transitions are saliently present in the signal (Fant, 1960). This dichotomy means that the vowels are more robust in noise and that vowel portions of the signal carry more information about the identity of the consonant phonemes than the consonant portions of the signal carry about the vowel phonemes. Although they are partially the result of articulator movement associated with consonants, the formant transitions are considered part of the vowel because of their acoustic characteristics. Generally speaking, the transitions have a relatively high intensity and long duration compared to other types of consonantal cues in the signal. The intensity, duration, and periodic structure of the transitions make them more resistant to many types of environmental masking than release bursts, nasal pole-zero patterns, or frication noise. Formant transitions bear a dual burden of simultaneously carrying information about both consonant and vowel phonemes. In addition, information about whether or not a consonant phoneme is a nasal, a lateral, or a rhotic is carried in the vowel more effectively than during the consonant portion of the signal. Figure 5 is a speech spectrogram of the word formant illustrating the informational burden of the vowel. What little information that consonants carry about the flanking vowel phonemes is found in portions of the signal that are low intensity, aperiodic, or transient. Therefore, it is more easily masked by environmental noise. ------------------------------Insert Figure 5 about here ------------------------------It is well known that spoken utterances are made up of more than the segmental distinctions represented by consonant and vowel phonemes. Languages like English rely on lexical stress to distinguish words. The majority of the worlds languages have some form of tone contrast, whether fixed on a single syllable as in the Chinese languages, or mobile across several syllables as in Kikuyu (Clements, 1984). Pitch-accent, like that seen in Japanese, is another form of tone based lexical distinction. Tone and pitch-accent are characterized by changes in voice pitch (fundamental frequency) and in some cases changes in voice quality such as creakiness or breathiness (as in Vietnamese, Nguyen, 1987). These types of changes in the source function are carried most saliently during the vowel portions of the signal. Although stress affects both consonant and vowels, it is marked most clearly by changes in vowel length, vowel formants, and in fundamental frequency excursions. Prosodic information is carried by both consonants and vowels; however, much of it takes the form of pitch, vowel length, and quality changes. Thus, despite the relatively large information burden that consonant phonemes bear, the portions in the physical signal that are identified with vowels carry much more of the acoustic information in a more robust fashion than the portions of the signal associated with the consonants. Invariance and Variability In addition to violating the linearity condition, the speech signal is characterized by a high degree of variability, violating the invariance condition. There are many sources of variability that may be interrelated or independent. Variability can be broken into two broad categories: 1) production related and 2) production independent. It is worth noting that while production related variability is complex, it is lawful and is a potentially rich source of information both about the intended meaning of an utterance and about the talker. Production independent variability derives from such factors as environmental noise or reverberation and may provide the listener with information about the environmental conditions surrounding the conversation; it can be seen as random in its relation to the linguistic meaning of the utterance and to the talker. Understanding how the perceptual process deals with these different types of variability is one of the most important issues in speech perception research. In traditional symbol-
19
AND D.B.
P ISONI
/m/-nasalization /r/-coloring
/n/-nasalization
Figure 5. A spectrogram of the word formant illustrating the information that adjacent vowels carry about rhotics (/r/-coloring) and nasals (nasalization).
20
20
SPEECH PERCEPTION
processing approaches that treat variation as noise, listeners are thought to compensate for differences through a processes of perceptual normalization in which linguistic units are perceived relative to the context, e.g., the prevailing rate of speech (e.g., Miller, 1981, 1987; Summerfield, 1981) or the dimensions of the talkers vocal tract (e.g., Joos, 1948; Ladefoged & Broadbent, 1957; Summerfield & Haggard, 1973). Alternative non-analytic approaches to speech perception that are based on episodic memory (Goldinger 1997, Johnson 1997, Pisoni 1997) propose that speech is encoded in a way that preserves the fine details of speech production related variability. While these approaches may use some types of variability in the speech perception process, little has been said about the production-independent variability. The following section, while not exhaustive, is a sampling of some well known sources of variability and their impact on speech perception (for a more detailed discussion see Klatt 1975, 1976, 1979). Production related variability in speech applies both across talkers as a result of physiological, dialectal, and socioeconomic factors, as well as within a talker from one utterance to the next as a result of factors such as coarticulation, rate, prosody, emotional state, level of background noise, distance between talker and hearer, and semantic properties of the utterance. What follows is a review of some of the most important sources of variability in speech and their effects on perceptual processes. Coarticulation The most studied source of within-talker variability, coarticulation, is one source of nonlinearity in speech. In the production of speech, the gestures in the vocal tract are partially or wholly overlapping in time, resulting in an acoustic signal in which there is considerable contextual variation (Delattre, Liberman, & Cooper, 1955; Liberman, 1957; Liberman, Delattre, Cooper, & Gerstman, 1954). The degree to which any one speech gesture is affected or affects other gestures depends on the movements of the articulators and the degree of its constriction as well as factors such as rate of speech and prosodic position. Although coarticulation is often described as a universal physiological aspect of speech, there is evidence for talker specific variation in the production and timing of speech gestures and in the resulting characteristics of coarticulation (Johnson, Ladefoged & Lindau, 1993; Kuehn & Moll, 1976; Stevens, 1972). The perceptual problems introduced by coarticulatory variation became apparent early in the search for invariant speech cues. Because of coarticulation, there is a complex relationship between acoustic information and phonetic distinctions. In one context, an acoustic pattern may give rise to one percept, while in another context the same acoustic pattern may give rise to a different percept (Liberman et al., 1954). At the same time, many different acoustic patterns may cue a single percept (Hagiwara, 1995). Speaking Rate Changes in speech rate are reflected in changes in the number and duration of pauses, in durational changes of vowels and some consonants, and in deletions and reductions of some of the acoustic properties that are associated with particular linguistic units (J.L. Miller, Grosjean, & Lomato, 1984). For example, changes in VOT and the relative duration of transitions and vowel steady-states occur with changes in speaking rates (J.L. Miller & Baer, 1983; J.L. Miller, Green, & Reeves, 1986; Summerfield, 1975). There is now a large body of research on the consequences of rate based variability on the perception of phonemes. These findings demonstrate that listeners are sensitive to rate based changes that are internal or external to the target word. The importance of token-internal rate sensitivity was demonstrated by J.L. Miller and Liberman (1979). Listeners were presented with a synthetic /ba/ - /wa/ continuum that varied the duration of the formant transitions and the duration of the vowel. The results showed that the crossover point between /b/ and /w/ was dependent on the ratio of the formant transition duration to the vowel duration: the longer the vowel, the longer the
21
AND D.B.
P ISONI
formant transitions had to be to produce the /wa/ percept. The importance of token-external rate sensitivity was demonstrated in an experiment on the identification of voiced and voiceless stops. Summerfield (1981) presented listeners with a precursor phrase that varied in speaking rate followed by a stimulus token. As the rate of the precursor phrase increased, the voiced-voiceless boundary shifted to shorter voice onset time (VOT) values. Sommers, Nygaard, and Pisoni (1992) found that the intelligibility of isolated words presented in noise was affected by the number of speaking rates that were used to generate the test stimulus ensemble: stimuli drawn from three rates (fast, medium, and slow) were identified more poorly than stimuli from only a single speaking rate. Prosody Rate based durational variation is compounded by many other factors, including the location of syntactic boundaries, prosody, and the characteristics of adjacent segments (Klatt, 1976; Lehiste, 1970; Pierrehumbert & Beckman, 1988; Beckman & Edwards, 1994). It is well known that lexical stress has a dramatic effect on the articulations that produce the acoustic signal. However, lexical stress is only one level of prosodic hierarchy spanning the utterance. Prosody is defined by Beckman and Edwards (1994, p. 8) as the organizational framework that measures off chunks of speech into countable constituents of various sizes. Different positions within a prosodic structure lead to differences in articulations which in turn lead to differences in the acoustic signal (Lehiste, 1970; Beckman & Edwards, 1994; Fujimura, 1990; Fougeron & Keating, 1997). For example, vowels that are in the nuclear accented syllable of a sentence (primary sentential stress) have a longer duration, a higher amplitude, and have a more extreme articulator displacement than vowels in syllables that do not bear nuclear accent (Beckman & Edwards, 1994; de Jong, 1995). Articulations that are at the edges of prosodic domains also undergo systematic variation which result in changes in the acoustic signal such as lengthened stop closures, greater release burst amplitude, lengthened VOT, and less vowel reduction. These effects have been measured for wordinitial versus non-initial positions (e.g., Browman & Goldstein, 1992; Byrd, 1996; Cooper, 1991; Fromkin, 1965; Vassire, 1988, Krakow, 1989) and at phrase and sentence edges (e.g., Fougeron & Keating, 1996). Finally, the magnitude of a local effect of a prosodic boundary on an articulation interacts in complex ways with global trends that apply across the utterance. One such trend, commonly referred to as declination, is for articulations to become less extreme and for fundamental frequency to fall as the utterance progresses (e.g., Vassire, 1986; Varya & Fowler, 1992). Another global trend is for domain edge effects to apply with progressively more force as the edges of progressively larger domains are reached (Klatt, 1975; Wightman et al., 1992; Jun, 1993). These factors interact with local domain edge effects in a way that indicates a nested hierarchical prosodic structure (Jun, 1993; Maeda, 1976). However, the number of levels and the relative strength of the effect may be a talker dependent factor (Fougeron & Keating, 1997). Semantics and Syntax In addition to prosodic structure, the syntactic and semantic structure have substantial effects on the fundamental frequency, patterns of duration, and relative intensities of vowels (Klatt, 1976; Lehiste, 1967; Lieberman, 1963). For example, when a word is uttered in a highly predictable semantic and syntactic position, it will show a greater degree of vowel reduction (centralization), with lower amplitude and a shorter duration than the identical word in a position with low contextual predictability (Fowler & Housman, 1987; Lieberman, 1963). These production differences are correlated with speech intelligibility; if the two words are isolated from their relative contexts, the word from the low-predictability context is more intelligible than the word from the highpredictability context. This type of effect is hypothesized to be the result of the talker adapting to the listeners perceptual needs (Lindblom, 1990; Fowler, 1986; Fowler & Housman, 1987): the more information the listener can derive from the conversational context, the less effort a talker needs to spend maintaining the intelligibility of the utterance. The reduced speech is referred to as hypo-
22
SPEECH PERCEPTION
articulated and the non-reduced speech is referred to as hyper-articulated (Lindblom, 1990). Similar patterns of variability can seen in many other production related phenomena such as the Lombard reflex (described below). This variability interacts with other factors like speaking rate and prosody in a complex fashion, making subsequent normalization extremely difficult. Yet, listeners are able to extract and use syntactic, semantic, and prosodic information from the lawful variability in the signal (Klatt, 1976; McClelland & Elman, 1986). Environmental Conditions There are many factors that can cause changes in a talkers source characteristics and in the patterns of duration and intensity in the speech signal that are not directly related to the talker or to the linguistic content of the message. These include the relative distance between the talker and the hearer, the type and level of background noise, and transmission line characteristics. For example, in a communicative situation in which there is noise in the environment or transmission line, there is a marked rise in amplitude of the produced signal that is accompanied by changes in the source characteristics and changes the dynamics of articulatory movements which together are known as the Lombard reflex (Cummings, 1995; Gay, 1977; Lombard, 1911; Lane & Tranel, 1971; Schulman, 1989). The Lombard reflex is thought to result from the need of the talker to maintain a sufficiently high signal-to-noise ratio to maintain intelligibility. As the environmental conditions and the distance between talker and hearer are never identical across instances of any linguistic unit, it is guaranteed that no two utterances of the same word in the same syntactic, semantic, and prosodic context will be identical. Furthermore, in natural settings the level of environmental noise tends to vary continuously so that even within a sentence or word, the signal may exhibit changes. Similarly, if the talker and listener have the ability to communicate using both the visual and auditory channels, the resulting speech signal exhibits selective reductions such as those seen for high semantic context or good signal to noise ratios; but when there is no visual channel available, the resultant speech is marked by hyper-articulation that is similar to that seen for the Lombard reflex or for low semantic predictability contexts (Anderson, Sotillo, & Doherty-Sneddon, 1997). Like other types of hypo- and hyper-articulation, the variation based on access to visual information is highly correlated with speech intelligibility. Physiological Factors Among the most commonly cited sources of between talker variation are differences in the acoustic signal based on a talkers anatomy and physiology. The overall length of the vocal tract and the relative size of the mouth cavity versus the pharyngeal cavity determines the relative spacing of the formants in vowels (Chiba & Kajiyama, 1941; Fant, 1960, 1973). These differences underlie some of the male-female and adult-child differences in vowels and resonant consonants (Fant, 1973; Bladon, Henton, & Pickering, 1984; Hagiwara, 1995). Moreover, vocal tract length may also contribute to observed differences in obstruent voicing (Flege & Massey, 1980) and fricative spectra (Schwartz, 1968; Ingemann, 1968). Physiological differences between male, female and child laryngeal structure also contribute to observed differences in the source characteristics such as fundamental frequency, spectral tilt and noisiness (Henton & Bladon, 1985; Klatt & Klatt, 1990). Other types of physiologically based differences among talkers that are expected to have an effect on the acoustic signal include dentition, size and doming of the hard palate, and neurological factors such as paralysis or motor impairments. The importance of talker specific variability in the perception of linguistic contrasts was first reported by Ladefoged and Broadbent (1957). Listeners were presented with a precursor phrase in which the relative spacing of the formants was manipulated to simulate established differences in vocal tract length. The stimulus was one of a group of target words in which the formant spacing remained fixed. The listeners percepts were shifted by the precursor sentence. In a follow-up
23
AND D.B.
P ISONI
experiment, a different group of listeners was presented with the same set of stimuli under a variety of conditions and instructions (Ladefoged, 1967). Even when listeners were told to ignore the precursor sentence or when the sentence and the target word were presented from different loudspeakers, the vowel in the target word was reliably shifted by the formant manipulation of the precursor sentence. This effect was only successfully countered by placing the target word before the sentence or by having the listeners count aloud for 10 seconds between the precursor and hearing the target word. Dialectal and Ideolectal Differences In addition to physiologically based differences, there are a number of socioeconomic and regional variables that affect the production of speech. Perception of speech from a different dialect can be a challenging task. Peterson and Barney (1952) found differences in dialect to be one of the most important sources of confusions in the perception of vowel contrasts. Research on improving communications reliability found that training talkers to avoid dialectal pronunciations in favor of Standard English was much easier than training listeners to adapt to a variety of dialects (Black & Mason, 1946). Differences between individual speakers styles, or idiolect, also require a certain amount of adaptation on the part of the listener. Dialectal and ideolectal variability have received relatively little attention in the speech perception literature and are generally treated as additional sources of noise which are discarded in the process of normalization. Robustness of Speech Perception In everyday conversational settings there are many different sources of masking noise and distortions of the speech signal, yet only under the most extreme conditions is perceptual accuracy affected. Much of the robustness of speech comes from the redundant information which is available to the listener. Since the main goal of speech is the communication of ideas from the talker to the hearer (normally under less than optimal conditions), it is not surprising that spoken language is a highly redundant system of information transmission. While redundancy of information in a transmission system implies inefficient encoding, it facilitates error correction and recovery of the intended signal in a noisy environment and insures that the listener recovers the talkers intended message. The redundancy of speech resides, in part, in the highly structured and constrained nature of human language. Syntactic and semantic context play a large role in modulating the intelligibility of speech. Words in a sentence are more predictable than words spoken in isolation (Fletcher, 1929; French & Steinberg, 1947; Miller, 1962; Miller, Heise, & Lichten, 1951; Pollack & Pickett, 1964). The sentence structure (syntax) of a particular language restricts the set of possible words that can appear at any particular point in the sentence to members of appropriate grammatical categories. The semantic relationships between words also aids in perception by further narrowing the set of words that are likely to appear in a sentence. It has been shown experimentally that limiting the set of possible words aids in identification. For example, Miller et al. (1951) found that limiting the vocabulary to digits alone results in an increase in speech intelligibility. More generally, lexical factors like a words frequency of usage and the number of acoustically similar words have been shown to have a dramatic impact on a words intelligibility (Anderson, 1962; Luce, 1986; Treisman, 1978). Phonological structure also constrains the speech signal and facilitates the listeners perception of the intended message. Prosodic structure and intonation patterns provide auditory cues to syntactic structure, which reduces the number of possible parses of an utterance. The syllable structure and stress patterns of a language limit the number of possible speech sounds at any particular point in an utterance, which aids in identifying words (see for example Cutler, 1997; Norris & Cutler, 1995).
24
SPEECH PERCEPTION
Much of the top down information in language is contextual in nature and resides in the structural constraints on a given language and not in the speech signal itself (Jelinek 1998). However, because prosodic, syntactic and semantic factors create systematic variability in production (discussed in the variability section), the signal contains a significant amount of information about the linguistic structures larger than the segment, syllable, and word. While the role of suprasegmental information (above the level of the phoneme) has traditionally received less attention in the perception literature, there have been a few studies that reveal the richness of suprasegmental information in the speech signal. In spectrogram reading experiments Cole, Rudnicky, Reddy, & Zue (1978) demonstrated that the acoustic signal is rich in information about the segmental, lexical, and prosodic content of an utterance. An expert spectrogram reader who was given the task of transcribing an utterance of unknown content using speech spectrograms alone achieved an 80-90 percent accuracy rate. This finding demonstrates that not only are features that cue segmental contrasts present in the signal, but prosodic and word boundary information is also available as well. However, it is not clear from these spectrogram reading experiments whether the features that the transcriber used are those that listeners use. A study that tests the ability of listeners to draw on the richness of the signal was conducted by Liberman and Nakatani (cited in Klatt, 1979). Listeners who were given the task of transcribing pseudo-words embedded in normal English sentences achieved better than 90 percent accuracy. There are numerous other studies that demonstrate the importance of prosodic melody (e.g., Collier & tHart, 1975; Klatt & Cooper, 1975) in sentence and word parsing. For example, Lindblom and Svensson (1973), using stimuli in which the segmental information in the signal was removed, found that listeners could reliably parse sentences based on the prosodic melody alone. Prosody has been found to play a role in perceptual coherence (Darwin, 1975; Studdert-Kennedy, 1980) and to play a central role in predicting words of primary semantic importance (e.g., Cutler, 1976). A second source of the redundancy in speech comes from the finding that the physical signal is generated by the vocal tract. As we have already noted, speech sounds are overlapped, or coarticulated, when they are produced, providing redundant encoding of the signal. The ability to coarticulate, and thereby provide redundant information about the stream of speech sounds serves to both increase transmission rate (Liberman 1996) and provide robustness to the signal (Wright 1996). Redundancy in the acoustic signal has been tested experimentally by distorting, masking, or removing aspects of the signal and exploring the effect these manipulations have on intelligibility. For example, connected speech remains highly intelligible when the speech power is attenuated below 1800 Hz or when it is attenuated above 1800 Hz (French & Steinberg, 1947). This finding indicates that speech information is distributed redundantly across lower and higher frequencies. However, not all speech sounds are affected equally by frequency attenuation. Higher frequency attenuation causes greater degradation for stop and fricative consonants, while lower frequency attenuation results in greater degradation of vowels, liquids (/r/ and /l/ in English), and nasal consonants (Fletcher, 1929). For example, the place of articulation distinctions among fricatives are carried in large part by the fricative noise, which tends to be concentrated in higher frequencies. Attenuating these particular frequencies results in an increase in fricative confusions and a decrease in intelligibility. Speech can be distorted in a natural environment by reverberation. Experiments on the perception of nonsense syllables found that intelligibility was relatively unaffected by reverberation with a delay of less than 1.5 seconds. Reverberation with a greater delay caused a marked drop off in intelligibility (e.g., Knudsen, 1929; Steinberg, 1929). Under extremely reverberatory conditions, individual speech sounds blend together as echoes overlap in a way that causes frequency and phase distortions. Again, not all speech sounds are equally affected by reverberation. Long vowels and fricatives, which have an approximately steady state component, are much less susceptible to
25
AND D.B.
P ISONI
degradation than short vowels and nasal and stop consonants, which are distinguished from each other by relatively short and dynamic portions of the signal (e.g., Steinberg, 1929). Overall, the intelligibility of individual speech sounds in running speech is in part a function of their intensity (Fletcher, 1929). In general, vowels are more intelligible than consonants. More specifically, those consonants with the lowest intensity have the poorest intelligibility. Of these, the lease reliably identified in English are the non-sibilant fricatives, such as those found in fat, vat, thin, and this. These fricatives achieve 80 percent correct identification only when words are presented at relatively high signal to noise ratios (Fletcher, 1929). These fricatives noises are also spectrally similar, adding to their confusability with each other. English is one of the few languages that contrasts non-sibilant fricatives (Maddieson, 1984), presumably due to their confusability and low intelligibility. By contrast, the sibilant fricatives (for example those found in the words sap, zap, Confucian, and confusion) have a much greater intensity and are more reliably identified in utterances presented at low signal to noise ratios. The next most intelligible sounds are the stop consonants, including /p/, /t/, and /k/ in English, followed by the vocalic consonants such as the nasals and liquids. Vowels are the most identifiable. The low vowels, such as those found in the words cot and caught, are more easily identified than the high vowels, such as those found in peat and pit.
Models and Theories

Theories of human speech perception can be divided into two broad categories, those that attempt to model segmentation of the spoken signal into linguistic units (which we refer to as models of speech perception) and those which take as input a phonetic transcription and model the access of the mental lexicon (which we refer to as models of spoken word recognition). Almost all models of speech perception try to identify phonemes in the signal. A few models go straight to the word level, and thus encompass the process of word recognition as well. These models are discussed in the section on word recognition below. Models of Human Speech Perception Most current models of speech perception have as their goal the segmentation of the spoken signal into discrete phonemes. All of the models discussed in this section have this property. In addition, these models assume that at some level of perceptual processing there are invariant features which can be extracted from the acoustic signal, though which aspects are taken to be invariant depends on the model. Invariance Approaches The most extensively pursued approach to solving the variability problem is the search for invariant cues in the speech signal. This line of research which dates back to the beginning of modern speech research in the late 1940s has revealed a great deal of coarticulatory variability. It has resulted in a series of careful and systematic searches for invariance in the acoustic signal that has revealed a wealth of empirical data. Although researchers investigating acoustic-phonetic invariance differ in their approaches, they have in common the fundamental assumption that the variability problem can be resolved by studying more sophisticated cues than were originally considered (e.g., Blumstein & Stevens, 1980; Fant, 1967; Mack & Blumstein, 1983; Kewley-Port, 1983). Early experiments on speech cues in speech perception used copy-synthesized stimuli in which much of the redundant information in the signal had been stripped away. In addition, acoustic analysis of speech using spectrograms focused only on gross characteristics of the signal.
26
SPEECH PERCEPTION
One approach, termed static (Nygaard & Pisoni, 1995), is based on the acoustic analysis of simple CV syllables. This approach focused on complex integrated acoustic attributes of consonants that are hypothesized to be invariant in different vowel contexts (e.g., Blumstein & Stevens, 1979). Based on Fants (1960) acoustic theory of speech production, Stevens and Blumstein (1978, 1981; also Blumstein & Stevens, 1979) hypothesized invariant relationships between the articulatory gestures and acoustic features that are associated with a particular segment. They proposed that the gross spectral shape at the onset of the consonant release burst is an invariant cue for place of articulation. In labial stops (/p/ and /b/), the spectral energy is weak and diffuse with a concentration of energy in the lower frequencies. For the alveolar stops (/t/ and /d/) the spectral energy is strong but diffuse with a concentration of energy in the higher frequencies (around 1800 Hz). Velar stops (/k/ and /g/) are characterized by strong spectral energy that is compact and concentrated in the midfrequencies (around the 1000 Hz). A different approach, termed dynamic (Nygaard & Pisoni, 1995), has been proposed by Kewley-Port (1983). She employed auditory transformations of the signal, looking for invariant dynamic patterns in running spectra of those transformations. The dynamic approach is promising because it can capture an essential element of the speech signal: its continuous nature. More recent static approaches adopted an element of dynamic invariance into their approaches (Mack & Blumstein, 1983; Lahiri, Gewirth, & Blumstein, 1984). As is noted by Nygaard & Pisoni (1995), any assumption of invariance necessarily constrains the types of processes that underlie speech perception. Speech perception will proceed in a bottom up fashion with the extraction of invariant features or cues being the first step in the process. Invariance explicitly assumes abstract canonical units and an elimination of all forms of variability and noise from the stored representation (see Pisoni 1997) This includes many sources of variation that are potentially useful to the listener in understanding an utterance. For example, indexical and prosodic information is discarded in the reduction of the information in the signal to a sequence of idealized symbolic linguistic invariants. While these approaches to speech sound perception have provided some promising candidates for extraction of invariant features from the signal (Sussman, 1989) and have produced invaluable empirical data on the acoustic structure of the speech signal and its auditory transforms, they have done so for only a very limited set of consonants in a very limited set of contexts. For example, the three places of articulation treated by Stevens and Blumstein represent slightly less than one quarter of the known consonant places of articulation in the worlds languages (Ladefoged & Maddieson, 1996). Even for the same places of articulation, the features found in English may not invariantly classify segments of other languages (Lahiri, Gewirth, & Blumstein, 1984). Furthermore, much of the contextual variability that is non-contrastive in English, and therefore removed in the invariance approach, forms the basis for a linguistic contrast in at least one other language. Therefore, the type of processing that produces invariant percepts must be language specific. Motor Theory One of the ways in which the perception of speech differs from many other types of perception is that the perceiver has intimate experience in the production of the speech signal. Every listener is also a talker. The motor theory (Liberman et al., 1967) and the revised motor theory (Liberman & Mattingly, 1986, 1989) take advantage of this link by proposing that perception and production are related by a common set of neural representations. Rather than looking for invariance in the acoustic signal, the perceiver is hypothesized to recover the underlying intended phonetic gestures from an impoverished and highly encoded speech signal. The intended gestures of the talker are therefore assumed to be perceived directly via an innate phonetic module conforming to the specifications of modularity proposed by Fodor (1983). The phonetic module is proposed to have evolved for the special purpose of extracting intended gestures preemptively
27
AND D.B.
P ISONI
(Liberman, 1982; Mattingly & Liberman, 1989; Whalen & Liberman, 1987). That is, the phonetic module gets first pass at the incoming acoustic signal and extracts the relevant phonetic gestures passing the residue on for general auditory processing (see also Gaver 1993). A variety of experiments showing that speech is processed differently from non-speech provide evidence for a neural specialization for speech perception. Some of these findings have subsequently been shown to apply equally as well to non-speech stimuli (see for example: Pisoni 1977; Jusczyk, 1980; Eimas & Miller, 1980; Repp, 1983a, b; for a review see Goldinger, Pisoni, & Luce, 1996). While some evidence for the specialness of speech still stands, it is uncertain whether appropriate non-speech controls to compare to speech have been considered. A number of ways of creating complex signals which are more or less acoustically equivalent to speech have been considered; however, these experiments do not explore whether there are controls which are communicatively or informationally equivalent to speech. A good example of the importance of testing the evidence with informationally equivalent stimuli can be found in a phenomenon known as duplex perception (Rand, 1974) which has been cited frequently as strong evidence for a speech specific module (Liberman, 1982; Repp, 1982; Studdert-Kennedy, 1982; Liberman & Mattingly, 1989). To elicit duplex perception, two stimuli are presented dichotically to a listener wearing headphones. An isolated third formant transition, which sounds like a chirp, is presented in one ear while the base syllable, which is ambiguous because it has had the third formant transition removed, is presented in the other ear. The isolated formant transition fuses with the base syllable, which is then heard as an unambiguous syllable in the base ear. Additionally, the chirp is perceived separately in the other ear. Duplex perception was found to occur with speech stimuli but not with acoustically equivalent stimuli. However, the informational equivalence of the stimuli was brought into question by Fowler and Rosenblum (1990) who found that a natural sound, the sound of a door slamming, patterned more like speech in a duplex perception task, and differently from laboratory generated non-speech controls (which are complex artificial sound patterns). A door slam is ecologically relevant, as it gives the hearer information about an action which has occurred in the world (see also Pastore, Schmuckler, Rosenblum, & Szczesiul, 1983; Nusbaum, Schwab, & Sawusch, 1983; Gaver, 1993). Speech has tremendous social significance and is probably the most highly practiced complex perceptual task performed by humans. These factors have not been adequately considered when explaining differences between speech and non-speech perception. A claim of the original formulation of the motor theory (Liberman et al., 1967) was that the percepts of speech are not the acoustic signals which impinge directly upon the ear, but rather the articulations made by the speaker. One of the striking findings from early experiments was that the discontinuities in the acoustic-to-phonemic mapping for stop onset consonants (Liberman, Delattre, & Cooper, 1952). These discontinuities were taken as crucial evidence against an acoustic basis for phoneme categories. However, researchers have found that for some phonemic categories the acoustic mapping is simple while the articulatory mapping is complex. For example, American English /r/ can be produced with one or more of three distinct gestures, and there is intraspeaker variation in which different gestures are used (Delattre & Freeman 1968; Hagiwara 1995; see also Johnson, Ladefoged, & Lindau 1993). The search for first-order acoustic invariance in speech has been largely unsuccessful, and it is now well known that the articulatory gestures and even their motor commands are not invariant either (e.g., MacNeilage, 1970). In the revised motor theory, the articulatory percepts are assumed to be the speakers intended gestures, before contextual adjustments and other sources of speaker independent variability in production (Liberman & Mattingly, 1986, 1989). Thus, in terms of the nature of neural representations, the motor theorys proposed linguistic representations are extremely abstract, canonical symbolic entities that can be treated as formally equivalent to abstract phonetic segments. Since neither acoustic nor articulatory categories provide simple dimensions upon
28
SPEECH PERCEPTION
which to base perceptual categories in speech, as the coherence as categories of these abstractions can be based on either articulatory or acoustic properties, or both. There are several appealing aspects of the motor theory of speech perception. It places the study of speech perception in an ecological context by linking production and perception aspects of spoken language. It also accounts for a wide variety of empirical findings in a principled and consistent manner. For example, the McGurk effect can be nicely accommodated by a model that is based on perception of gestures, although direct perception (Fowler, 1986; Fowler & Rosenblum, 1991) and FLMP (Oden & Massaro, 1978; Massaro & Cohen, 1993; Massaro, 1989) also incorporate visual information, although in very different ways. Despite the appeal of the motor theory, there remain several serious shortcomings. The proposed perceptual mechanisms remain highly abstract, making effective empirical tests of the model difficult to design. A more explicit model of how listeners extract the intended gestures of other talkers would go far to remedy this problem. In addition, the abstract nature of the intended gestures involves a great deal of reduction of information and therefore suffers from the same shortcomings that traditional phonemic reduction does: it throws away much of the inter- and intra-talker variability which is a rich source of information to the listener. Direct-Realist Approach The direct-realist approach to speech perception (Fowler, 1990; Fowler & Rosenblum, 1991) draws on Gibsons (1966) ecological approach to visual perception. Its basic assumption is that speech perception, like all other types of perception, acts directly on events in the perceivers environment rather than on the sensory stimuli and takes place without the mediation of cognitive processes. An event may be described in many ways but those that are ecologically relevant to the perceiver are termed distal events. The sets of possibilities for interaction with them are referred to as affordances. The distal event imparts structure to an informational medium, the acoustic signal and reflected light in the case of visible speech, which in turn provides information about the event to the perceiver by imparting some of its structure to the sense organs through stimulation. The perceiver actively seeks out information about events in the environment, selectively attending to aspects of the environmental structure (Fowler, 1990). In speech, the phonetically determined coordinated set of movements of the vocal tract that produce the speech signal are the events that the perceiver is attending to. In this way, the direct realist approach is like the motor theory. However, rather than assuming a speech specific module retrieving intended gestures from an impoverished acoustic signal, the direct realist approach assumes an information rich signal in which the phonetic events are fully and uniquely specified. Because the perception is direct, the direct realist approach views variability and nonlinearity in a different light than most other approaches to speech perception which are abstractionist in nature. The vocal tract cannot produce a string of static and non-overlapping shapes, so the gestures of speech cannot take place in isolation of each other. Direct perception of gestures gives the listener detailed information about both the gestural and environmental context. This implies that the perceiver is highly experienced with the signal and so long as that variation is meaningful, it provides information about the event. Rather than removing noise through a process of normalization, variation provides the perceiver with detailed information about the event which includes the talkers size, gender, dialect region, emotional state, as well as prosodic and syntactic information. Therefore, according to this view, stimulus variation ceases to be a problem of perception and becomes a problem of perceptual organization. While direct perception focuses on the perceived events as gestural constellations roughly equivalent to the phoneme, it is also compatible with the theory to assume the perceived events are words. Thus, we might also consider direct perception as a model of spoken word recognition. Direct perception shares with the exemplar
29
AND D.B.
P ISONI
models (discussed in the word recognition section) the assumption that the variability in the signal is rich in information which is critical to perception. Direct perception is appealing because of its ability to incorporate and use stimulus variability in the signal, and because it makes the link between production and perception transparent. However, there are several important theoretical issues that remain unresolved. One potential problem for a model that permits no mediation of cognitive processes are top down influences on speech perception. As was noted previously, these effects are extremely robust and include phoneme restoration (Samuel, 1981; Warren, 1970), correction of errors in shadowing (Marslen-Wilson & Welsh, 1978), mishearings (Browman, 1978; Garnes & Bond, 1980), lexical bias (Ganong, 1980), syntactic and semantic bias (Salasoo & Pisoni, 1985), and lexical frequency and density bias. Fowler (1986) acknowledges this problem and suggests that there may be special mechanisms for highly learned or automatic behavior and for perceivers hypothesizing information that is not detected in the signal. She suggests that while perception itself must be direct, behavior may often not be directed by perceived affordances. In this way, the direct-realist perspective departs dramatically from other versions of event perception (Gibson, 1966) which have nothing to say about cognitive mediation. Finally, Remez (1986) notes that it is not clear that the perceptual objects in linguistic communication are the gestures which create the acoustic signal. While visual perception of most objects is unambiguous, speech gestures are very different in terms of their perceptual availability (Diel, 1986; Porter, 1986; Remez, 1986). Fowler proposes that the perception of the articulatory gestural complex is the object of perception, but the articulations themselves are a medium that is shaped by the intended linguistic message. As she notes herself, while visually identified objects are perceived as such, listeners intuitions are that they perceive spoken language not as a series of sound producing actions (i.e., gestures) but as a sequence of words and ideas. This difference is not necessarily a problem for the model itself, but rather a problem for the way this approach has thus far been employed. FLMP A radically different approach to speech perception is represented by informational models which are built around general cognitive and perceptual processes. Most of these models have been developed to explain phonemic perception, and they typically involve multiple processing stages. One example of this approach is the fuzzy logic model of perception, or FLMP (Oden & Massaro, 1978; Massaro & Cohen, 1993; Massaro, 1989). FLMP was developed to address the problem of integrating information from multiple sources, such as visual and auditory input, in making segmental decisions. The criterion for perception of a particular set of features as a particular perceptual unit such as the phoneme is goodness of the percepts match to a subjectively derived prototype description in memory, arrived at through experience with the language of the listener. In acoustic processing, the speech signal undergoes an acoustic analysis by the peripheral auditory system. Evidence for phonemic features in the signal are evaluated by feature detectors using continuous truth values between 1 and 0 (Zadeh, 1965). Then feature values are integrated and matched against the possible candidate prototypes. Because fuzzy algorithms are used, an absolute match is not needed for the process to achieve a phonemic percept. There are several aspects of FLMP which make it an appealing model. The first is that it provides an explicit mechanism for incorporating multiple sources of information from different modalities. This is particularly important considering the role that visual input can play in the speech perception process. Second, it provides a good fit to data from a wide variety of perceptual experiments (see Massaro, 1987 for a review). Third, it is one of the only models of speech perception that is mathematically explicit, because it is based on a precise mathematical framework (Townsend, 1989). However, there are several serious shortcomings to the model. The most severe, noted by Klatt (1989), Repp (1987) and others is that it is unclear that the fuzzy values are flexible
30
SPEECH PERCEPTION
enough to account for the variation that is observed in the speech signal. Because the model works with features to be matched to stored prototypes in memory, there is still a reliance on exclusivity of invariant features and the dependence of features on a degree of normalization across the many sources of variability observed in conversational speech. Moreover, the model has no connection to the perception-production link. Finally, FLMP employs a large number of free parameters that are deduced from the data of specific experimental paradigms but which do not transfer well across paradigms. Models of Spoken Word Recognition Models of spoken word recognition can be broken down into two types: those that act on a phonemic or broad phonetic representations, and those that work directly on the acoustic input. Models based on a phonemic level are inspired, or transparently derived, from models of alphabetic reading. Because these models use a unitized input, they explicitly or implicitly assume access to a phonemic or featural representation. These models require either an additional preprocessor which recognizes phonemes, segments, or features, or they assume direct perception of these units from information in the acoustic signal. Models which work on segmental or featural input are by far the most numerous and best known, and only a few that are representative of the diversity of proposals will be discussed here. These are TRACE, NAM, and SHORTLIST. Models that act on the speech signal, or an auditory transformation thereof, necessarily incorporate the speech perception process into the word recognition process. Of the few models of this type, two examples will be discussed: LAFS and Exemplar-covering models. TRACE The TRACE model (Elman, 1989; Elman & McClelland, 1986; McClelland & Elman, 1986) is an example of an interactive activation/competition connectionist model. The most widely discussed version of TRACE takes allophonic level features as their input. An early form of the model (Elman and McClelland, 1986a) takes the speech signal as its input and relies on feature detectors to extract relevant information; however, this version was quite limited, being built around only nine CV syllables produced by a single talker. TRACE is constructed of three levels representing features, phonemes and words. The featural level passes activation to the phonemic level which in turn passes activation to the word level. Within each level, the functional units are highly interconnected nodes each with a current activation level, a resting level, and an activation threshold. There are bi-directional connections between units of different levels and between nodes within a level. Connections are excitatory between units at different levels that share common properties (e.g., between voice, place and manner features and a particular consonantal phoneme). Connections between units within a level may be inhibitory; for example, as one place feature at one time slice is activated it will inhibit the activation of other place features. Connections between units within a level may also be excitatory; for example, a stop consonant at one time slice will facilitate segments that can precede or follow it, such as /s/ or a vowel (depending on the phonotactic constraints of the language). The excitatory and inhibitory links in TRACE have important implications for the types of processing operations within the model. Because of the inhibitory links within a level, TRACE acts in a winner takes all fashion. Moreover, the of the excitatory links provides a mechanism for the contribution of top-down information to the perception of speech sounds. TRACE contrasts with traditional symbolic invariance approaches because it treats coarticulatory variation as a source of information rather than a source of noise; the inhibitory and facilatory links between one time slice and the next allow for adjacent segments to adjust the weighting to a particular feature or phoneme in a given context (Elman & McClelland, 1986).
31
AND D.B.
P ISONI
Despite these advantages, there are two major problems with TRACE. The first is that although it can use the coarticulatory variation in segmental contexts as information, it is unclear how the model would incorporate other sources of lawful variation such as prosody, rate, or differences among talkers. The second is that TRACEs multiple instantiations of the network across time are considered to be neurally and cognitively implausible (see Cutler, 1995). More recent connectionist models have proposed recurrent neural networks as a way of representing the temporal nature of speech (Elman, 1990; Norris, 1990). Connectionist models such as TRACE are similar to FLMP because they rely on continuous rather than discrete representations. Continuous activation levels allow for varying degrees of support for competing perceptual hypotheses. Connectionist models also allow for the evaluation and integration of multiple sources of input and rely on general purpose pattern matching schemes with a best fit algorithm. But the connectionist models and the FLMP differ in the degree to which top down influences can affect low level processes. Massaro (1989) claims that connectionist models that have top down and bottom up connections are too powerful, predicting both attested and unattested results. Massaro argues that FLMP allows top down bias in the perception process while TRACEs two way connections result in top down induced changes in perceptual sensitivity. This is an open issue in need of further research. The Neighborhood Activation Model The neighborhood activation model, or NAM, (Luce, 1986; Luce, Pisoni, & Goldinger, 1990) shares with TRACE the notion that words are recognized in the context of other words. A pool of word candidates is activated by acoustic/phonetic input. However, the pool of activated candidates is drawn from the similarity neighborhood of the word (Landauer & Streeter, 1973; Coltheart, Davelaar, Jonasson, & Besner, 1977; Luce, 1986; Andrews, 1989; Luce & Pisoni, 1998). A similarity neighborhood is the set of words that is phonetically similar to the target word. Relevant characteristics of the similarity neighborhood are its density and neighborhood frequency. The density of a word is the number of words in a neighborhood. The neighborhood frequency of a word is the average frequency of words in the neighborhood. There is a strong frequency bias in the model which allows it to deal with apparent top down word frequency effects without positing explicit bidirectional links. Rather than unfolding over time, similarity in NAM is a static property of the entire word. NAM is least developed as a general model of word recognition, as it assumes not only a phonemic level, but word segmentation as well (see Auer & Luce 1997 for a revised version called PARSYN which resolves some of these problems). Moreover NAM has been implemented only for mono-syllabic words. NAM can account for a specific set of lexical similarity effects not treated in other models, and is attractive because it is grounded in a more general categorization model based on the Probability Choice Rule (R.D. Luce, 1959). SHORTLIST SHORTLIST (Norris, 1991, 1994) parses a phonemic string into a set of lexical candidates, which compete for recognition. SHORTLIST can be seen as an evolutionary combination of both the TRACE and Marslen-Wilsons Cohort model (Marslen-Wilson & Welsh 1978). A small set (the shortlist) of lexical candidates compete in a TRACE style activation/competition network. The phonemic string is presented gradually to the model, but candidates with early matches to the string have an advantage, due to their early activation, much like the original cohort model. However, as Cutler (1996) notes, SHORTLIST avoids the cognitive implausibility of TRACEs temporal architecture, which effectively duplicates the network at each time slice. SHORTLIST also avoids the cohort models over-dependence on word initial information. The model takes phonemic information as its input and strictly bottom-up information determines the initial candidate set. The candidate set is determined by comparing whole words but with each strong (i.e., stressed) syllable acting as a potential word onset. This use of prosodic information sets this model apart from others
32
SPEECH PERCEPTION
and gives SHORTLIST the ability to parse words from a phrase or utterance represented as a string of phonemes and allophones. LAFS Lexical Access from Spectra (Klatt, 1989), or LAFS, is a purely bottom up model of word recognition which compares the frequency spectrum of the incoming signal to stored templates of frequency spectra of words. The stored templates are context-sensitive spectral prototypes derived from subjective experience with the language and consist of all possible diphone (CV and VC) sequences and all cross-word boundaries in the language, resulting in a very large decoding network. Thus, LAFS addresses the problems of contextual variability by precompiling the coarticulatory and word boundary variations into stored representations in an integrated memory system. The model attempts to address interspeaker and rate based variability by using a best fit algorithm to match incoming spectra with stored spectral templates. LAFS fully bypasses the intermediary featural and segmental stages of processing; the perceptual process consists of finding the best match between the incoming spectra and paths through the network. The advantages of such a strategy are numerous and have been discussed in detail by Klatt (1989). Rather than discarding allophonic and speaker specific detail through reduction to an abstract symbolic representation such as features or segments, the input spectra are retained in full detail. This frees the model from dealing with problems of acoustic invariance across contexts. Since LAFS does not make segmental phonemic level decisions, because it performs recognition at the word level, there is less data reduction than in traditional phonemic based models. More information can be brought to bear on the lexical decision thereby reducing the probability of error and increasing the ability of the system to recover from error (Miller, 1962). Because the stored prototypes are based on subjective learning, there can be local tuning and there is less chance of over-generalization. The perceptual process is explicit being based on a distance/probability metric (Jelinek, 1985) and the scoring strategy is uniform throughout the network. Despite the power of the approach, there are several problems with the LAFS strategy that Klatt (1989) acknowledges and some that have been raised since then. The most serious is that while LAFS is constructed to accommodate coarticulatory and word-edge variability, it is unlikely that the distance metric is powerful enough to accommodate the full range of variability seen in spoken language. Furthermore, it is nearly impossible to preprocess and store in memory all the sources of variability cited in the variability section above. Finally, much of the stimulus variability in speech comes not in spectra alone but in timing differences as well (Klatt cites the example of variable onset of prenasalization) and LAFS is not built to accommodate much of the temporal nature of speech variation. LAFS is obviously constructed to model a fully developed adults perception process and contains some developmentally implausible assumptions (but see Jusczyk 1997 for a developmentally oriented adaptation of LAFS called WRAPSA). Its structure involves a priori knowledge about all possible diphones in the language and all cross word boundary combinations; different languages have varying inventories of speech sounds and different constraints on how these sounds can combine within and across words, yet the model depends on these being precompiled for speech perception and word identification to proceed. Cutler (1995) notes that the redundancy inherent in precompiling all word boundaries for every possible word pair separately is psychologically implausible. In addition, recent phonetic research has found different boundary effects at multiple levels of the prosodic hierarchy. Requiring precompiled boundaries at the foot, intonation phrase, and possibly other levels adds to the psychological implausibility of the model. Finally, because the model is explicitly bottom up, it cannot properly model the top down factors like lexical, prosodic, and semantic bias in the lexical decisions.
33
AND D.B.
P ISONI
Exemplar Based Models of Word Recognition Like LAFS, exemplar based models bypass the reduction of the speech signal to featural and segmental units in the word identification process. However, unlike LAFS, exemplar models are instance based rather than relying on precompiled prototypes stored in memory. In exemplar models, there are no abstract categories (whether learned prototypes or innate features and phonemes). Instead, the set of all experienced instances of a category form the basis for the category. The process of categorization therefore involves computing the similarity of the stimulus to every stored instance of every category (e.g., Hintzman, 1986; Nosofsky 1988; Nosofsky, Kruschke, & McKinley, 1992). Although this type of model behaves as if it works on idealized prototype categories (Hintzman, 1986), categorization is a result of computations and the decision process rather than stored prototypes of the stimulus. Exemplar models of perception and memory are fairly widespread in cognitive psychology, but they have only rarely been applied to speech perception and spoken word recognition (for further background on exemplar models in speech perception see Goldinger, 1997). As discussed at the beginning of this chapter, one of the motivations for proposing that the speech signal is reduced to abstract categories such as phonemes and features has been the widespread belief that memory and processing limitations necessitate data reduction (Haber, 1967). However, more recent empirical data suggest that earlier theorists largely overestimated memory and processing limitations. There is now ample evidence in speech perception and word recognition literature of long term preservation of instance specific details about the acoustic signal. Goldinger (1997) discusses in detail the motivation for exemplar models and cites evidence of episodic memory for such language relevant cases as faces (e.g., Bahrick, Bahrick, & Wittlinger, 1975), physical dynamics (e.g., Cutting & Kozlowski, 1977), modality of presentation (e.g., Stansbury, Rubin, & Linde, 1973), exact wording of sentences (Keenan, McWhinney, & Mayhew, 1977), and talker specific information in spoken words (e.g., Carterette & Barneby, 1975; Hollien, Majewski, & Docherty, 1982; Papcun, Kreiman, & Davis, 1989; Palmeri, Goldinger, & Pisoni, 1993). Taken together this recent evidence has inspired some psycholinguists and speech researchers to reconsider exemplar based approaches to speech perception and spoken word recognition. Although limited, one of the more explicitly implemented models has been proposed by Johnson (1997), which is based on Kruschkes (1992) connectionist ALCOVE model. In a description of his model, Johnson (1997) discusses several potential problems that an exemplar approach to speech perception must address to be a realistic model of human spoken word recognition. Because most exemplar models have been developed for the perception of static images, these models must be revised and elaborated to take into account the time-varying nature of spoken language. This problem is addressed by considering the role of short term auditory memory in the processing of speech (Baddeley et al., 1998). The incoming signal is sliced into auditory vectors in both the frequency and time domains. As the signal is processed and encoded, the spectral vectors in the short term auditory buffer are matched to all stored vectors and matches are activated adding to previous activation levels thereby representing a veridical short-term memory of the signal (Crowder, 1981). Introducing time in this way permits the modeling of temporal selective attention and segmentation strategies. For example, language specific segmentation strategies such as those that inspired the SHORTLIST model might be modeled by probing the matrix cyclically for boundary associated acoustic events. A second problem that must be addressed if exemplar models are to be considered cognitively plausible is that of memory limitations. Although there is now a great deal of evidence that much fine detail about specific instances of spoken language is encoded and stored in memory, it is implausible that each experienced auditory pattern is stored at a separate location in the brain or that these instances can be retrieved. Johnson uses ALCOVE (Kruschke, 1992) as the foundation for his implementation of an exemplar model because it uses a covering (vector) map to store exemplars.
34
SPEECH PERCEPTION
Locations on the map represent vectors of possible auditory properties (based on known auditory sensitivity). Johnson suggests that vector quantization might be a useful approach. While the storage and matching mechanisms are different from a purely exemplar based model where each instance is stored as a fully separate trace, Johnsons model preserves much of the instance specific information. Only where two instances are identical (from an auditory point of view) at a particular vector does the model collapse information into one representation. Top down aspects influences on speech perception pose problems for fully bottom up processing models. It might seem that an exemplar model would leave no room for lexical or semantic bias in the decision process. However, usage frequency, recency, and contextual factors can be modeled with base activation levels and attention weights. For example, a high frequency lexical item would have a high base activation level that is directly tied to the frequency of occurrence with a time decay factor (Nosofsky et al., 1992). As syntactic and semantic conditions increase the predictability of a set of words, the base activation rises. Attention weights can be adjusted to model selective attentionthe shrinking and expanding of the perceptual space (Nosofsky, 1986) that has frequently been observed. One example of such perceptual distortion is the perceptual magnet effect where the perceptual space appears as if it is warped by prototypes (Kuhl, 1991) resulting in decreased sensitivity to changes along a dimension within the range variation of a particular category but increased sensitivity across a category boundary (see however, Lively & Pisoni, 1998). Finally, Johnson suggests that exemplar models are also capable of incorporating the production-perception link. As a talker produces an utterance, he/she is also hearing the utterance. Therefore, the set of auditory memory traces that are specific to words produced by the talker can be linked to a set of equivalent sensory-motoric exemplars or articulatory plans. Like the direct perception approach, exemplar based models of perception bring a radical change from past assumptions. Instead of treating stimulus variation in speech as noise to be removed to aid in the perception of abstract units of speech, variability is treated as inherent to the way experiences with speech are stored in memory. Therefore, variation is a source of information that may be used by the perceiver depending on the demands of the listening situation. The appeal of this approach is its ability to account for a very wide variety of speech perception phenomena in a consistent and principled manner. The extensive work on exemplar modeling in other areas of perception means that, unlike much of the traditional work in speech perception, the fundamentals of the approach have been worked out explicitly. However, as the approach is in its infancy in the field of speech perception, it remains to be seen how it performs when tested in a more rigorous fashion across many different environments.
Conclusion
Research on human speech perception has shown that the perceptual process is highly complex in ways beyond our current understanding or theoretical tools. Speech perception relies on both visual and auditory information which are integrated as part of the perceptual process. Near perfect performance is achieved despite an enormous amount of variability both within and across talkers and across a wide variety of different environmental conditions. In the early days of speech research, it was believed that the perceptual process relied on a few invariant characteristics of the segments which differentiated larger linguistic units like words and utterances. While we may yet find higher order relational invariants which are important features for defining the linguistic categories in language, it has already been demonstrated that listeners make use of the lawful variability in the acoustic signal when perceiving speech. Variability cannot be removed, discarded, or normalized away in any psychologically plausible model of speech perception. Some of the approaches discussed above, which rely on more elaborate notions of perceptual categories and long term encoding, incorporate the variability and non-linearity inherent in the speech signal directly into the perceptual process. These new approaches to speech perception treat the speech signal as
35
AND D.B.
P ISONI
information rich and use lawful variability and redundant information rather than treating these properties of speech as extraneous noise to be discarded. We believe that these new approaches to the traditional problems of invariance and non-linearity provide a solution to the previously intractable problem of perceptual constancy despite variability in the signal.
References
Abercrombie, D. (1967). Elements of general phonetics. Chicago: Aldine. Ali, L., Gallager, T., Goldstein, J., & Daniloff, R. (1971). Perception of coarticulated nasality. Journal of the Acoustical Society of America, 49 , 538-540. Anderson, D. (1962) The number and nature of alternatives as an index of intelligibility. The Ohio State University. Andrews, S. (1989). Frequency and neighborhood effects on lexical access: activation or search? Journal of Experimental Psychology: Learning, Memory & Cognition, 15 , 802-814. Bahrick, H., Bahrick, P., & Wittlinger, R. (1987). Habituation as a necessary condition for maintenance rehearsal. Journal of Experimental Psychology: General, 104 , 54-75. Bailey, P. J., & Summerfield, Q. (1980). Information in speech: Observations on the perception of [s]-stop clusters. Journal of Experimental Psychology, 6 , 536-563. Beckman, M., & Edwards, J. (1994). Articulatory evidence for differentiating stress categories. In P. A. Keating (Eds.), Phonological structure and phonetic form: Papers in laboratory phonology III (pp. 7-33). Cambridge, UK: Cambridge University Press. Bever, T. G., Lackner, J., & Kirk, R. (1969). The underlying structures of sentences are the primary units of immediate speech processing. Perception & Psychophysics, 5 , 191-211. Black, J. W., & Mason, H. M. (1946). Training for voice communication. Journal of the Acoustical Society of America, 18 , 441-445. Bladon, R. A. W., Henton, C. G., & Pickering, J. B. (1984). Towards an auditory theory of speaker Normalization. Language and Communication , 4 , 59-69. Blumstein, S. E., & Stevens, K. N. (1980). Perceptual invariance and onset spectra for stop consonants in different vowel environments. Journal of the Acoustical Society of America , 66 , 1001-1017. Broadbent, D. E. (1965). Information processing in the nervous system. Science, 150 , 475-462. Browman, C., & Goldstein, L. (1992). Articulatory phonology: An overview. Phonetica, 49 , 155180. Browman, C. P. (1978). Tip of the tongue and slip of the ear: Implications for language processing Ph.D. dissertation. University of California at Los Angeles, Los Angeles, CA. Browman, C. P., & Goldstein, L. (1989). Articulatory gestures as phonological units. Phonology , 6 , 201-251.
36
SPEECH PERCEPTION
Browman, C. P., & Goldstein, L. (1990). Gestural specification using dynamically-defined articulatory structures. Journal of Phonetics, 18 , 299-320. Byrd, D. (1994) Articulatory timing in English consonant sequences. Ph. D. dissertation, University of California at Los Angeles, Los Angeles, CA. Byrd, D. (1996). Influences on articulatory timing in consonant sequences. Journal of Phonetics, 24 , 209-224. Carterette, E., & Barneby, A. (1975). Recognition memory for voices. In A. Cohen & S. Nooteboom (Eds.), Structure and process in speech perception (pp. 246-265). New York: SpringerVerlag. Chiba, T., & M., K. (1941). The vowel: Its nature and structure. Tokyo: Kaiseikan. Chomsky, N., & Halle, M. (1968). The sound pattern of English. New York, NY: Harper & Rowe. Chomsky, N., & Miller, G. A. (1963). Introduction to the formal analysis of natural language. In R. D. Luce, R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology (pp. 269321). New York, NY: Wiley. Clements, G. N. (1984). Principles of tone assignment in Kikuyu. In G. N. Clements & J. Goldsmith (Eds.), Autosegmental studies in Bantu tone Dordrecht: Foris. Cole, R., & Cooper, W. (1975). The perception of voicing in English affricates and fricatives. Journal of the Acoustical Society of America, 58 , 1280-1287. Cole, R., Rudnicky, A., Zue, V., & Reddy, D. R. (1978). Speech patterns on paper. In R. Cole (Eds.), Perception and production of fluent speech Hillside, NJ: Erlbaum. Collier, R., & tHart, J. (1971). The role of intonation in speech perception. In A. Cohen & S. Nooteboom (Eds.), Structure and process in speech perception (pp. 107-123). New York: Springer-Verlag. Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Eds.), Attention and performance VI (pp. 535-555). Hillsdale, NJ: Erlbaum. Cooper, A. (1991). Laryngeal and oral gestures in English /p, t, k/. In Proceedings of the XIIth International Congress of Phonetic Sciences 2, University of Provence (pp. 50-53). Aix-enProvence, France: Creelman, C. D. (1957). The case of the unknown talker. Journal of the Acoustical Society of America, 29 , 655. Crowder, R. (1981). The role of auditory memory in speech perception and discrimination. In T. Meyers, J. Laver, & J. Anderson (Eds.), The cognitive representation of speech (pp. 167179). New York: North-Holland. Cummings, K. E., & Clements, M. A. (1995). Analysis of the glottal excitation of emotionally styled and stressed speech. Journal of the Acoustical Society of America. 98 , 88-98. Cutler, A. (1976). Phoneme-monitoring reaction time as a function of preceding intonation contour. Perception and Psychophysics, 20 , 55-60.
37
AND D.B.
P ISONI
Cutler, A. (1995). Spoken word recognition and production. In J. L. Miller & P. D. Eimas (Eds.), Speech, language, and communication (pp. 97-137). San Diego: Academic Press. Cutler, A. (1997). The comparative perspective on spoken-language processing. Speech Communication , 21 , 3-15. Cutler, A., Mehler, J., Norris, D. G., & Segui, J. (1986). The syllables differing role in the segmentation of French and English. Journal of Memory and Language, 25 , 385-400. Cutler, A., Mehler, J., Norris, D. G., & Segui, J. (1992). The mono-lingual nature of speech segmentation by bilinguals. Cognitive Psychology, 24 , 381-410. Cutler, A., & Norris, D. (1988). The role of strong syllables in segmentation for lexical access. Journal of Experimental Psychology: Human Perception and Performance, 14 , 381-410. Cutting, J., & Kozlowski, L. (1977). Recognizing friends by their walk: Gait perception without familiarity cues. Bulletin of the Psychonomic Society, 9 , 353-356. Darwin, C. J. (1975). The dynamic use of prosody in speech perception. In A. Cohen & S. Nooteboom (Eds.), Structure and process in speech perception New York: Springer-Verlag. Darwin, C. J. (1976). The perception of speech. In E.C. Carterette &. M. P. Friedman (Eds.), Handbook of perception (pp. 175-216). Heidelberg: Springer-Verlag. de Jong, K. (1995). The supraglottal articulation of prominence in English: Linguistic stress as localized hyperarticulation. Journal of the Acoustical Society of America, 97 , 491-504. Delattre, P., & Freeman, C. D. (1968). A dialect study of American rs by X-ray motion picture. Linguistics, 44 , 29-68. Delattre, P. C., Liberman, A. M., & Cooper, F. S. (1955). Acoustic loci and transitional cues for consonants. Journal of the Acoustical Society of America, 27 (769-773). Eimas, P. D., & Miller, J. L. (1980). Contextual effects in infant speech perception. Science, 209 , 1140-1141. Eimas, P. D., & Nygaard, L. C. (1992). Contextual coherence and attention in phoneme monitoring. Journal of Memory and Language, 31 , 375-395. Elman, J. L. (1989). Connectionist approaches to acoustic/phonetic processing. In W. D. MarslenWilson (Ed.), Lexical representation and process (pp. 227-260). Cambridge, MA: MIT Press. Elman, J. L., & McClelland, J. L. (1986). Exploiting lawful variability in the speech wave. In J. S. Perkell &. D. H. Klatt (Eds.), Invariance and variability in speech processes (pp. 360-380). Hillsdale, NJ: Erlbaum. Fant, G. (1960). Acoustic theory of speech production. The Hague: Mouton. Fant, G. (1962). Descriptive analysis of the acoustic aspects of speech. Logos, 5 , 3-17. Fant, G. (1970). Automatic recognition and speech research. Speech Transmission Laboratory, Quarterly Progress and Status Report, b, 1/1970, Royal Institute of Technology, Stockholm.
38
SPEECH PERCEPTION
Flanagan, J. L. (1972). Speech analysis, synthesis and perception. New York, NY: Academic Press. Flege, J. E., & Massey, K. P. (1980). English prevoicing: random or controlled. Paper presented at the Linguistic Society of America, 2 August, Albuquerque, NM. Fletcher, H. (1929). Speech and hearing. Princeton, NJ: Von Nostrand Reinhold Company. Fodor, J. A. (1983). The modularity of mind. Cambridge, MA: MIT Press. Forster, K. I. (1976). Accessing the mental lexicon. In R. J. Wales & E. Walker (Eds.), New approaches to language mechanisms Amsterdam: North-Holland. Fougeron, C., & Keating, P. A. (1996). Variations in velic and lingual articulation depending on prosodic position. UCLA Working Papers in Phonetics, 92 , 88-96. Fougeron, C., & Keating, P. A. (1997). Articulatory strengthening at edges of prosodic domains. Journal of the Acoustical Society of America, 3728-3740. Fowler, C. A. (1980). Coarticulation and theories of extrinsic timing. Journal of Phonetics, 8 , 113133. Fowler, C. A. (1986). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics, 14 , 3-28. Fowler, C. A. (1995). Speech production. In J. L. Miller & P. D. Eimas (Eds.), Speech, language, and communication (pp. 30-62). San Diego: Academic Press. Fowler, C. A., & Housum, J. (1987). Talkers signalling of new and old words in speech and listeners perception and use of the distinction. Memory and Language , 26 , 489-504. Fowler, C. A., & Rosenblum, L. D. (1990). Duplex perception: A comparison of monosyllables and slamming doors. Journal of Experimental Psychology: Human Perception and Performance, 16 (4), 742-754. Fowler, C. A., & Rosenblum, L. D. (1991). The perception of phonetic gestures. In I. Mattingly & M. Studdert-Kennedy (Eds.), Modularity and the motor theory of speech perception (pp. 3359). Hillsdale, NJ: Erlbaum. Fowler, C. A., & Smith, M. R. (1986). Speech perception as vector analysis: an approach to the problems of segmentation and invariance. In J. Perkell & D. H. Klatt (Eds.), Invariance and variability of speech processes Hillsdale, NJ: Erlbaum. Frazier, L. (1995). Issues of representation in psycholinguistics. In J. L. Miller & P. D. Eimas (Eds.), Speech, language, and communication (pp. 1-29). San Diego: Academic Press. French, N. R., & Steinberg, J. C. (1947). Factors governing the intelligibility of speech sounds. Journal of the Acoustical Society of America, 19 , 90-119. Fromkin, V. (1965) Some phonetic specifications of linguistic units: An electromyographic investigation. Ph. D. dissertation, University of California at Los Angeles, Los Angeles, CA. Fujimura, O., Macchi, M. J., & Streeter, L. A. (1978). Perception of stop consonants with conflicting transitional cues: A cross-linguistic study. Language and Speech , 21 , 337-345.
39
AND D.B.
P ISONI
Ganong, W. F. (1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology, 6 , 110-125. Garnes, S., & Bond, Z. (1980). A slip of the ear: A snip of the ear? A slip of the year? In V. Fromkin (Ed.), Errors in linguistic performance: Slips of the tongue, ear, pen, hand. New York, NY: Academic Press. Gay, T. (1978). Effect of speaking rate on vowel formant movements. Journal of the Acoustical Society of America, 63 , 223-230. Gerstman, L. (1968). Classification of self-normalized vowels. IEEE Trans. Audio Electroacoust. , AU- 16 , 78-80. Gibson, J. J. (1966). The senses considered as perceptual systems. Boston, MA: Houghton-Mifflin. Goldinger, S. D. (1990). Neighborhood density effects for high frequency words: Evidence for activation-based models of word recognition. Research on Speech Perception: Progress Report, 15 , 163-186. Goldinger, S. D. (1997). Words and voices: perception and production in an episodic lexicon. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 33-66). San Diego: Academic Press. Goldinger, S. D., Pisoni, D. B., & Logan, J. S. (1991). On the locus of talker variability effects in recall of spoken word lists. Journal of Experimental Psychology, 17 , 152-162. Goldinger, S. D., Pisoni, D. B., & Luce, P. A. (1996). Speech perception and spoken word recognition: Research and theory. In N. J. Lass (Ed.), Principles in experimental phonetics (pp. 277-327). St. Louis: Mosby. Goldsmith, J. A. (1990). Autosegmental & metrical Phonology. Oxford, UK: Basil Blackwell. Halle, M. (1985). Speculations about the representation of words in memory. In V. Fromkin (Ed.), Phonetic linguistics (pp. 101-114). New York, NY: Academic Press. Hagiwara, R. (1995). Acoustic realizations of American /r/ as produced by women and men. Ph. D. dissertation, University of California at Los Angeles, Los Angeles. Harris, K. S. (1958). Cues for the discrimination of American English fricatives in spoken syllables. Language and Speech , 1 , 1-7. Hawkins, S., & Stevens, K. N. (1985). Acoustic and perceptual correlates of the nonnasal-nasal distinction for vowels. Journal of the Acoustical Society of America, 77 , 1560-1575. Heinz, J. M., & Stevens, K. N. (1961). On the properties of voiceless fricative consonants. Journal of the Acoustical Society of America, 33 , 589-596. Henton, C. G. (1988). Creak as a sociophonetic marker. In L. Hyman & C. Li (Eds.), Language, speech and mind: Studies in honour of Victoria A. Fromkin (pp. 3-29). London: Routledge. Henton, C. G., & Bladon, R. A. W. (1985). Breathiness in normal female speech: inefficiency versus desirability. Language Communication , 5 , 221-227.
40
SPEECH PERCEPTION
Hintzman, D. L. (1986). Schema Abstraction in a multiple-trace memory model. Psychological Review, 93 , 411-423. Hockett, C. (1955). Manual of phonology. Bloomington, IN: Indiana University Press. Hollien, H., Majewski, W., & Docherty, E. T. (1982). Perceptual identification of voices under normal, stress, and disguise speaking conditions. Journal of Phonetics, 10 , 139-148. House, A. S. (1957). Analog studies of nasal consonants. Journal of Speech and Hearing Research , 22 , 190-204. Hudak, T. J. (1987). Thai. In B. Comrie (Eds.), The worlds major languages (pp. 757-776). Oxford, UK: Oxford University Press. Ingemann, F. (1968). Identification of the speakers sex from voiceless fricatives. Journal of the Acoustical Society of America, 44 , 1142-1144. Jacobson, R., Fant, G., & Halle, M. (1952). Preliminaries to speech analysis. Cambridge, MA: M.I.T. Acoustics Laboratory. Jacoby, L. L., & Brooks, L. R. (1984). Nonanalytic cognition: Memory, perception, and concept learning. In G. Bower (Eds.), The psychology of learning and motivation Orlando, FL: Academic Press. Jelinek, F. (1982). The development of an experimental discrete dictation recognizer. Proceedings of the IEEE, 73 , 1616-1624. Johnson, K. (1997). Speech perception without speaker normalization. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 145-165). San Diego: Academic Press. Johnson, K., Ladefoged, P., & Lindau, M. (1993). Individual differences in vowel production. Journal of the Acoustical Society of America, 94 , 701-714. Joos, M. (1948). Acoustic phonetics. Language , Suppl. 24 , 1-136. Jun, S. A. (1993) The phonetics and phonology of Korean prosody . Ph. D dissertation., The Ohio State University, Columbus, OH. Kewley-Port, D. (1982). Measurement of formant transitions in naturally produced consonant-vowel syllables. Journal of the Acoustical Society of America, 73 , 322-335. Kewley-Port, D., Pisoni, D. B., & Studdert-Kennedy, M. (1983). Perception of static and dynamic acoustic cues to place of articulation in initial stop consonants. Journal of the Acoustical Society of America, 73 , 1779-1793. Kirk, P. J., Ladefoged, J., & Ladefoged, P. (1993). Quantifying acoustic properties of modal, breathy and creaky vowels in Jalapa Mazatec. In A. Mattina & T. Montler (Eds.), American Indian linguistics and ethnography in honor of Laurence C. Thompson (pp. 435-450). University of Montana. Klatt, D. H. (1975). Voice onset time, frication, and aspiration in word-initial consonant clusters. Journal of Speech and Hearing Research, 18 , 686-706.
41
AND D.B.
P ISONI
Klatt, D. H. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Journal of the Acoustical Society of America, 59 , 1208-1221. Klatt, D. H. (1979). Speech perception: A model of acoustic-phonetic analysis and lexical access. Journal of Phonetics, 7 , 279-312. Klatt, D. H. (1989). Review of selected models of speech perception. In W. D. Marslen-Wilson (Eds.), Lexical representation and process (pp. 169-226). Cambridge, MA: MIT Press. Klatt, D. H., & Cooper, W. E. (1975). Perception of segment duration in sentence contexts. In A. Cohen & S. G. Nooteboom (Eds.), Structure and process n speech perception New York, NY: Springer-Verlag. Klatt, D. H., & Klatt, L. C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America, 87 , 820-857. Knudsen, V. O. (1929). The hearing of speech in auditoriums. Journal of the Acoustical Society of America, 1 , 56-82. Krakow, R. A. (1989) The articulatory organization of syllables: A kinematic analysis of labial and velar gestures . Ph. D dissertation., Yale University, New Haven, CT. Kruschke, J. K. (1992). ALCOVE: An exemplar based connectionist model of category learning. Psychological Review, 99 , 22-44. Kuehn, D. P., & Moll, K. L. (1973). A cineradiographic study of VC and CV articulatory velocities. Journal of Phonetics, 4 , 303-320. Kuhl, P. K. (1991). Human adults and human infants show a perceptual magnet effect for the prototypes of speech categories, monkeys do not. Perception & Psychophysics, 50 , 93-107. Kurowski, K., & Blumstein, S. E. (1984). Perceptual integration of the murmur and formant transitions for place of articulation in nasal consonants. Journal of the Acoustical Society of America, 76 , 383-390. Ladefoged, P. (1968). Three areas of experimental phonetics. Oxford, UK: Oxford University Press. Ladefoged, P., & Broadbent, D. E. (1957). Information conveyed by vowels. Journal of the Acoustical Society of America, 29 , 948-104. Ladefoged, P., & Maddieson, I. (1996). The sounds of the worlds languages. Oxford, UK: Blackwell. Ladefoged, P., Maddieson, I., & Jackson, M. T. T. (1988). Investigating phonation types in different languages. In O. Fujimura (Ed.), Vocal physiology: Voice production, mechanisms and functions (pp. 297-317). New York: Raven. Landauer, T. K., & Streeter, L. A. (1973). Structural differences between common and rare words: Failure of equivalence assumptions for theories of word recognition. Journal of Verbal Learning and Verbal Behavior, 12 , 119-131. Lehiste, I. (1970). Suprasegmentals. Cambridge: MIT Press.
42
SPEECH PERCEPTION
Lehiste, I. (1976). Role of duration in disambiguating syntactically ambiguous sentences. Journal of the Acoustical Society of America, 60 , 1199-1202. Liberman, A. L. (1957). Some results of research on speech perception. Journal of the Acoustical Society of America, 29 , 117-123. Liberman, A. L., Cooper, F. S., Harris, K. S., & MacNeilage, P. F. (1963). A motor theory of speech perception. In G. Fant (Eds.), Proceedings of the speech perception seminar, Stockholm, 1962 . Stockholm: Royal Institute of Technology, Speech Transmission Laboratory. Liberman, A. L., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74 , 431-461. Liberman, A. L., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21 , 1-36. Liberman, A. M., Delattre, P. C., & Cooper, F. S. (1952). The role of selected stimulus-variables in the perception of unvoiced stops. American Journal of Psychology, LVX, 497-516. Liberman, A. M., Delattre, P. C., Cooper, F. S., & Gerstman, L. J. (1954). The role of consonantvowel transitions in the perception of the stop and nasal consonants. Journal of Experimental Psychology, 52 , 127-137. Liberman, A. M., Delattre, P. C., & Gerstman, L. J. (1956). Tempo of frequency change as a cue for distinguishing classes of speech sounds. Psychological Monogram (Gen. Appl.), 68 , 1-13. Liberman, A. M., & Mattingly, I. G. (1989). A specialization for speech perception. Science, 243 , 489-494. Lieberman, P. (1963). Some effects of semantic and grammatical context on the production and perception of speech. Language and Speech , 6 , 172-187. Lindblom, B. (1963). Spectrographic study of vowel reduction. 35 , 1773-1781. Lindblom, B. (1990). Explaining phonetic variation: A sketch of the H and H theory. In W. Hardcastle & A. Marchal (Eds.), Speech production and speech modelling (pp. 403-439). Dordrecht: Kluwer. Lindblom, B., & Studdert-Kennedy, M. (1967). On the role of formant transitions in vowel recognition. Journal of the Acoustical Society of America, 42 , 830-843. Lindblom, B., & Svensson, S. G. (1973). Interaction between segmental and nonsegmental factors in speech recognition. IEEE Transactions on Audio and Electroacoustics, AU- 21 , 536-545. Lisker, L., & Abramson, A. D. (1964). A cross-language study of voicing in initial stops: Acoustic measurements. Word, 20 , 384-422. Lively, S. E., Pisoni, D. B., & Goldinger, S. D. (1994). Spoken word recognition: research and theory. In M. A. Gernsbacher (Eds.), Handbook of psycholinguistics (pp. 265-301). New York: Academic Press. Luce, P. A. (1986) Neighborhoods of words in the mental lexicon. Ph. D dissertation., Indiana University, Bloomington, IN.
43
AND D.B.
P ISONI
Luce, R. D. (1959). Individual choice behavior. New York: Wiley. Mack, M., & Blumstein, S. E. (1983). Further evidence of acoustic invariance in speech production: The stop-glide contrast. Journal of the Acoustical Society of America, 73 , 1739-1750. MacNeilage, P. F. (1970). Motor control of serial ordering of speech. Psychological Review, 77 , 182-196. Maddieson, I. (1984). Patterns of sound. Cambridge: Cambridge University Press. Maeda, S. (1976) A characterization of American English intonation. Ph. D., MIT. Malcot, A. (1956). Acoustic cues for nasal consonants. Language , 32 , 274-278. Marslen-Wilson, W. D., & Welsh, A. (1978). Processing interactions during word-recognition in continuous speech. Cognitive Psychology, 10 , 29-63. Martin, C. S., Mullennix, J. W., Pisoni, D. B., & Sommers, M. S. (1989). Effects of talker variability on recall of spoken word lists. Journal of Experimental Psychology: Learning, Memory and Cognition, 15 , 676-684. Massaro, D. W. (1987). Speech perception by ear and eye: A paradigm for psychological inquiry . Hillsdale, NJ: Erlbaum. Massaro, D. W. (1989). Multiple book review of speech perception by ear and eye: A paradigm for psychological inquiry. The Behavioral and Brain Sciences, 12 , 741-755. Massaro, D. W., & Cohen, M. M. (1983). Evaluation and integration of visual and auditory information in speech perception. Journal of Experimental Psychology: Human Perception and Performance, 9 , 753-771. McCawley, J. D. (1968). The phonological component of the Japanese grammar . The Hague: Mouton. McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18 , 1-86. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature , 264 , 746-748. Mehler, J. (1981). The role of syllables in speech processing. Philosophical transactions of the Royal Society of London, Series B, 295 , 333-352. Mehler, J., Dommergues, U., Frauenfelder, U., & Segui, J. (1981). The syllables role in speech segmentation. Journal of verbal learning and verbal behaviour, 2020 , 298-305. Miller, G. A. (1962). Decision units in the perception of speech. IRE Transactions on Information Theory , 81-83. Miller, G. A., Heise, G. A., & Lichten, W. (1951). The intelligibility of speech as a function of the context of the test materials. Journal of experimental Psychology, 16 , 329-335. Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27 , 329-335.
44
SPEECH PERCEPTION
Miller, J. L. (1981). Effects of speaking rate on segmental distinctions. In P. D. Eimas & J. L. Miller (Eds.), Perspectives on the study of speech (pp. 39-74). Hillsdale, NJ: Erlbaum. Miller, J. L. (1987). Rate-dependent processing in speech perception. In A. Ellis (Ed.), Progress in the psychology of language. Hillsdale, NJ: Erlbaum. Miller, J. L. (1990). Speech perception. In D. N. Osherson & H. Lasnik (Eds.), An invitation to cognitive science (pp. 69-93). Cambridge, MA: MIT Press. Miller, J. L., & Baer, T. (1983). Some effects of speaking rate on the production of /b/ and /w/. Journal of the Acoustical Society of America, 73 , 1751-1755. Miller, J. L., Green, K. P., & Reeves, A. (1986). Speaking rate and segments: A look at the relation between speech production and perception for the voicing contrast. Phonetica, 43 , 106-115. Miller, J. L., Grosjean, F., & Lomato, C. (1984). Articulation rate and its variability in spontaneous speech: A reanalysis and some implications. Phonetica, 41 , 215-225. Miller, J. L., & Liberman, A. M. (1979). Some effects of later occurring information on the perception of stop consonant and semivowel. Perception & Psychophysics, 25 , 457-465. Neary, T. M. (1990). The segment as a unit of speech perception. Journal of Phonetics, 18 , 347373. Neary, T. M. (1992). Context effects in a double-weak theory of speech perception. Language & Speech, 35 , 153-172. Neary, T. M. (1997). Speech perception as pattern recognition. Journal of the Acoustical Society of America, 101 (6), 3241-3254. Nguyen, D. (1987). Vietnamese. In B. Comrie (Eds.), The worlds major languages (pp. 777-796). Oxford, UK: Oxford University Press. Norris, D. G. (1991). Rewriting lexical networks on the fly. In Proceedings of EUROSPEECH 91, Genoa, 1 (pp. 117-120). Norris, D. G. (1994). SHORTLIST: A connectionist model of continuous speech recognition. Cognition, 52 , 189-234. Norris, D. G., & Cutler, A. (1988). The relative accessibility of phonemes and syllables. Perception & Psychophysics, 43 , 541-550. Norris, J., & Cutler, A. (1995). Competition and segmentation in spoken-word recognition. Journal of Experimental Psychology: Learning, Memory and Cognition, 21 (5), 1209-1228. Nosofsky, R. M. (1986). Attention, similarity, and the identification-categorization relationship. Journal of Experimental Psychology: General, 115 (1), 39-57. Nosofsky, R. M. (1988). Exemplar-based accounts of relations between classification, recognition, and typicality. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14 , 700-708.
45
AND D.B.
P ISONI
Nosofsky, R. M. (1991). Tests of an exemplar model for relating perceptual classification and recognition memory. Journal of experimental Psychology: Learning, Memory, and Cognition, 14 , 3-27. Nosofsky, R. M., Kruschke, J. K., & McKinley, S. C. (1992). Combining exemplar based category representations and connectionist learning rules. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18 , 211-233. Nusbaum, H. C., Schwab, E. C., & Sawusch, J. R. (1983). The role of chirp identification in duplex perception. Perception and Psychophysics, 33 , 323-332. Nygaard, L. C., & Pisoni, D. B. (1995). Speech perception: New directions in research and theory. In J. L. Miller & P. D. Eimas (Eds.), Speech, language, and communication (pp. 63-96). San Diego: Academic Press. Nygaard, L. C., Sommers, M. S., & Pisoni, D. B. (1995). Effects of stimulus variability on perception and representation of spoken words in memory. Perception & Psychophysics, 57 , 989-1001. OConnor, J. D., Gerstman, L. J., Liberman, A. M., Delattre, P. C., & Cooper, F. S. (1957). Acoustic cues for the perception of initial /w, j, r, l/ in English. Word, 13 , 22-43. Oden, G. C., & Massaro, D. W. (1978). Integration of featural information in speech perception. Psychological Review, 85 , 172-191. Palmeri, T. J., Goldinger, S. D., & Pisoni, D. B. (1993). Episodic encoding of voice attributes and recognition memory for spoken words. Journal of Experimental Psychology: Learning, Memory and Cognition, 19 , 1-20. Papcun, G., Kreiman, J., & Davis, A. (1989). Long-term memory for unfamiliar voices. Journal of the Acoustical Society of America, 85 , 913-925. Pastore, R. E., Schmuckler, M. A., Rosenblum, L., & Szczesiul, R. (1983). Duplex perception with musical stimuli. Perception and Psychophysics, 33 , 469-474. Peters, R. W. (1955). The relative intelligibility of single-voice messages under various conditions of noise. Joint Report No. 56, U.S. Naval School of Aviation Medicine (pp. 1-9). Pensacola, FL. Pierrehumbert, J., & Beckman, M. (1988). Japanese tone structure. Cambridge, MA: MIT Press. Pisoni, D. B. (1977). Identification and discrimination of the relative onset of two component tones: Implications for voicing perception in stops. Journal of the Acoustical Society of America , 61 , 1352-1361. Pisoni, D. B. (1978). Speech perception. In W. K. Estes (Eds.), Handbook of learning and cognitive processes (pp. 167-233). Hillsdale, NJ: Erlbaum. Pisoni, D. B. (1990). Effects of talker variability on speech perception: Implications for current research and theory. In Proceedings of the 1992 International Conference on Spoken Language Processing (pp. 587-590). Banff, Canada: Pisoni, D. B. (1997). Some thoughts on normalization in speech perception. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 9-32). San Diego: Academic Press.
46
SPEECH PERCEPTION
Pollack, I., & Pickett, J. M. (1964). The intelligibility of excerpts from conversations. Language and Speech , 6 , 165-171. Rand, T. C. (1974). Dichotic release from masking for speech. Journal of the Acoustical Society of America, 55 , 678-680. Remez, R. E. (1986). Realism, language, and another barrier. Journal of Phonetics, 14 , 89-97. Remez, R. E. (1987). Units of organization and analysis in the perception of speech. In M. E. H. Schouten (Eds.), The psychophysics of speech perception (pp. 419-432). Dordrecht: Martinus Nijhoff. Remez, R. E., Rubin, P. E., Berns, S. M., Pardo, J. S., & Lang, J. M. (1994). On the perceptual organization of speech. Psychological Review, 101 (1), 129-156. Repp, B. H. (1982). Phonetic trading relations and contest effects: New experimental evidence for a speech mode of perception. Psychological Bulletin, 92 , 81-110. Repp, B. H. (1983a). Categorical perception: Issues, methods, findings. Speech and Language: Advances in Basic Research and Practice, 10 . Repp, B. H. (1983b). Bidirectional contrast effects in the perception of VC-CV sequences. Perception and Psychophysics, 33 , 147-155. Repp, B. H. (1979). Relative amplitude of aspiration noise as a voicing cue for syllable-initial stop consonants. Language and Speech , 22 , 173-189. Repp, B. H., & Mann, V. A. (1981). Perceptual assessment of fricative-stop coarticulation. Journal of the Acoustical Society of America, 69 , 1154-1163. Salasoo, A., & Pisoni, D. B. (1985). Interaction of knowledge sources in spoken word identification. Journal of Memory and Language, 24 , 210-231. Samuel, A. G. (1981). The role of bottom-Up Confirmation in the Phonemic Restoration Illusion. Journal of Experimental Psychology: Human Perception and Performance, 7 , 1124-1131. Samuel, A. G. (1982). Phonetic prototypes. Perception & Psychophysics, 31 , 307-314. Schadle, C. (1985) The acoustics of fricative consonants. Cambridge, MA: MIT Press. Schwartz, M. F. (1968). Identification of speaker sex from isolated, voiceless fricatives. Journal of the Acoustical Society of America, 43 , 1178-1179. Segui, J. (1984). The syllable: A basic perceptual unit in speech processing. In H. Bouma & D. G. Bouwhuis (Eds.), Attention and performance X: Control of language processes Hillsdale, NJ: Erlbaum. Shinn, P., & Blumstein, S. E. (1984). On the role of the amplitude envelope for the perception of [b] and [w]. Journal of the Acoustical Society of America, 75 , 1243-1252. Silverman, D. (1997). Pitch discrimination during breathy versus modal phonation (final results). Journal of the Acoustical Society of America, 102 (5), 3204.
47
AND D.B.
P ISONI
Skinner, T. E. (1977). Speaker invariant characteristics of vowels, liquids, and glides using relative formant frequencies. Journal of the Acoustical Society of America, 62 (S1), S5. Soli, S. D. (1982). Structure and duration of vowels together specify fricative voicing. Journal of the Acoustical Society of America, 72 , 366-378. Sommers, M. S., Nygaard, L. C., & Pisoni, D. B. (1992). The effects of speaking rate and amplitude variability on perceptual identification. Journal of the Acoustical Society of America , 91 , 2340. Steinberg, J. C. (1929). Effects of distortion on telephone quality. Journal of the Acoustical Society of America, 1 , 121-137. Steriade, D. (1993). Closure release and nasal contours. In R. Krakow & M. Huffman (Eds.), Nasals, nasalization, and the velum (pp. 401-470). San Diego: Academic Press. Stevens, K. N. (1972). Sources of inter- and intraspeaker variability in the acoustic properties of speech sounds. In A. Rigault & R. Charbonneau (Eds.), Proceedings of the 7 th International Congress of Phonetic Sciences (pp. 206-232). The Hague: Mouton. Stevens, K. N., & Blumstein, S. E. (1981). The search for invariant acoustic correlates of phonetic features. In P. D. Eimas & J. L. Miller (Eds.), Perspectives on the study of speech (pp. 1-38). Hillsdale, NJ: Erlbaum. Stevens, K. N., & House, A. S. (1955). Development of a quantitative description of vowel articulation. Journal of the Acoustical Society of America, 27 , 484-493. Stevens, K. N., & House, A. S. (1963). Perturbation of vowel articulations by consonantal context: An acoustic study. Journal of Speech and Hearing Research, 6 , 111-128. Strange, W., Jenkins, J. J., & Johnson, T. L. (1983). Dynamic specification of coarticulated vowels. Journal of the Acoustical Society of America, 74 , 695-705. Studdert-Kennedy, M. (1974). The perception of speech. In T. A. Sebeok (Eds.), Current trends in linguistics (pp. 2349-2385). The Hague: Mouton. Studdert-Kennedy, M. (1976). Speech perception. In N. J. Lass (Eds.), Contemporary issues in experimental linguistics (pp. 213-293). New York: Academic Press. Studdert-Kennedy, M. (1980). Speech perception. Language and Speech , 23 , 45-65. Studdert-Kennedy, M. (1982). On the dissociation of auditory and phonetic perception. In R. Carlson & B. Granstrom (Eds.), The representation of speech in the peripheral auditory system. Elsevier Biomedical Press. Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26 , 212-215. Summerfield, Q. (1975). Acoustic and phonetic components of the influence of voice changes and identification times for CVC syllables. Report of Speech Research in Progress, 4. Queens University of Belfast.
48
SPEECH PERCEPTION
Summerfield, Q. (1981). On articulatory rate and perceptual constancy in phonetic perception. Journal of Experimental Psychology: Human Perception and Performance, 7 , 1074-1095. Summerfield, Q., & Haggard, M. P. (1973). Vocal tract normalisation as demonstrated by reaction time. Report of Speech Research in Progress, 2. Queens University of Belfast. Sussman, H. M. (1988). The neurogenesis of phonology. In H. A. Whitaker (Ed.), Phonological processes and brain mechanisms (pp. 1-23). New York: Springer-Verlag. Sussman, H. M., McCaffrey, H. A., & Matthews, S. A. (1991). An investigation of locus equations as a source of relational invariance for stop place categorization. Journal of the Acoustical Society of America, 90 , 1256-1268. Sussman, H. R. (1989). Neural coding of relational invariance in speech: Human language analog to the barn owl. Psychological Review, 96 , 631-642. Syrdal, A. K., & Gopal, H. S. (1986). A perceptual model of vowel recognition based on the auditory representation of American English vowels. Journal of the Acoustical Society of America, 79 , 1086-1100. Townsend, J. T. (1989). Winning 20 questions with mathematical models. The Behavioral and Brain Sciences, 12 , 775-776. Treisman, M. (1971). On the word frequency effect: Comments on the papers by J. Catlin and L. H. Nakatani. Psychological Review, 17 , 37-59. Varya, M., & Fowler, C. A. (1992). Declination of supralaryngeal gestures in spoken Italian. Phonetica, 49 , 48-60. Vassire, J. (1986). Comment on Abbss paper. In J. S. Perkell & D. H. Klatt (Eds.), Invariance and variability in speech processes (pp. 1205-1216). Hillsdale, NJ: Erlbaum. Vassire, J. (1988). Prediction of velum movement from phonological specifications. Phonetica, 45 , 122-139. Walley, A., & Carrell, T. (1983). Onset spectra and formant transitions in the adults and childs perception of place of articulation in initial stop consonants. Journal of the Acoustical Society of America, 73 , 1011-1022. Warren, R. M. (1970). Perceptual restoration of missing speech sounds. Science, 176 , 392-393. Whalen, D. H., & Liberman, A. M. (1987). Speech perception takes precedence over non-speech perception. Science, 237 , 169-171. Weiringen, A. V. (1995) Perceiving dynamic speechlike sounds . Ph.D. dissertation, University of Amsterdam. Wright, R. (1996) Consonant clusters and cue preservation in Tsou. Ph.D. dissertation, University of California at Los Angeles, Los Angeles, CA. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8 , 338-353.
49

1996 1997 Speech Perception

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

1996 1997 Speech Perception

Încărcat de

Drepturi de autor:

Formate disponibile

SPEECH PERCEPTION

This work supported by NIH/NIDCD Training Grant DC00012 to Indiana University.

The Abstractionist/Symbolic Approach to Speech Perception

Basic Stimulus Properties

stop release burst

stop release burst

slope of formant transitions

Cue periodicity VOT consonant duration release amplitude preceding V duration

Applies to stops, affricates, fricatives stops stops, fricatives stops obstruents

Distribution internal CV transition internal C-release preceding vowel

release burst amplitude vowel duration aspiration noise

Models and Theories

S-ar putea să vă placă și