Sunteți pe pagina 1din 15

MIT OpenCourseWare

http://ocw.mit.edu
MAS.632 Conversational Computer Systems
Fall 2008
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
1
Speech as Communication
Speech can be viewed in many ways. Although chapters ofthis book focus on
specific aspects ofspeech andthecomputertechnologies thatutilize speech, the
reader should begin with abroad perspective onthe role ofspeech in our daily
lives. It is essential to appreciate the range ofcapabilities thatconversational
systems must possess before attempting to build them. This chapter lays the
groundwork for the entire book by presenting several perspectives on speech
communication.
The first section ofthis chapter emphasizes the interactive and expressive
role ofvoice communication. Except in formal circumstances such as lectures
and dramatic performances, speech occurs in the context of a conversation,
wherein participants take turns speaking, interrupt each other, nod in
agreement, or try to change the topic. Computer systems that talk or listen
may ultimately be judged by their ability to converse in like manner simply
because conversation permeates human experience. The second section dis-
cusses the various components or layers of a conversation. Although the
distinctions between these layers are somewhat contrived, they provide a
means of analyzing the communication process; research disciplines have
evolved forthestudyofeachofthesecomponents. Finally,thelastsectionintro-
duces the representationsofspeech andconversation, corresponding in partto
the layers identified in the second section. These representations provide
abstractions thatacomputer program may employto engage in aconversation
with ahuman.
6 VOICE (OMMUNICAIION
WITH COMPUTERS
SPEECH AS CONVERSATION
Conversation isaprocess involving multipleparticipants,sharedknowledge,and
a protocol fortakingturns andproviding mutual feedback. Voiceis ourprimary
channel of interaction inconversation, andspeechevolved inhumansinresponse
totheneed amongits memberstocommunicate.
It ishardtoimagine manyuses
ofspeechthatdo notinvolve someinterchangebetweenmultipleparticipants in
aconversation; ifwe are discovered talkingtoourselves, we usuallyfeel embar-
rassed.
Forpeopleofnormalphysicalandmental ability,speech isboth richin expres-
siveness andeasy to use.We learn it without much apparent effort aschildren
and employ it spontaneously
on a daily basis.' People employ many layers of
knowledge and sophisticated
protocols while having a conversation;
until we
attempt toanalyze dialogues, we areunaware ofthecomplexity ofthisinterplay
between parties.
Although muchisknown aboutlanguage, study ofinteractive speech commu-
nication hasbegun onlyrecently.Considerable
research hasbeendoneon natu-
ral language processing systems, butmuch ofthisisbasedonkeyboard input.It
is important to note the contrast between written and spoken language and
between readorrehearsed speechandspontaneous utterances.Spoken language
is less formal than written language, anderrors in construction ofspoken sen-
tences are less objectionable.
Spontaneous
speech shows much evidence of the
real-time processes associated with its production, including false starts, non-
speechnoises suchasmouthclicksandbreath sounds,andpauseseithersilentor
filled("... um .. .")[Zueet al. 1989b]. Inaddition, speechnaturallyconveys into-
national and emotional information that fiction writers andplaywrights must
struggletoimparttowrittenlanguage.
Speech isrich ininteractive techniques to guarantee thatthelistener under-
stands whatisbeing expressed, including facial expressions, physical andvocal
gestures,"uh-huhs,"andthelike.Atcertainpoints in a conversation, it isappro-
priate for the listener to begin speaking; these points are often indicated by
longer pauses andlengthened finalsyllables ormarked decreases in pitchatthe
end ofa sentence. Each round ofspeech by one person is called a turn; inter-
ruption occurs when a participant speaks before a break point offeredby the
talker.Insteadoftakingaturn,thelistenermay quicklyindicate agreementwith
awordortwo,anonverbal sound("uh-huh"), or afacial gesture.Such responses,
called backchannels, speed theexchange andresultin more effective conver-
sations[Krautet al. 1982].2
Because oftheseinteractive characteristics,
speechis usedforimmediate com-
munication needs,whilewritingoftenimpliesa distance,eitherintimeorspace,
'For apersonwith normal speech andhearingtospend adaywithout speaking isquite
anovelexperience.
'We will returnto thesetopics inChapter 9.
7 SpeeCasC(omunkiin
between the author andreader.Speech is used intransitory interactions or situ-
ations inwhich the process oftheinteraction maybe asimportant asits result.
Forexample, theagenda for ameeting islikelyto bewritten,anda writtensum-
maryorminutesmaybeissued "fortherecord,"buttheactualdecisionsare made
during a conversation. Chapanis and hiscolleagues arranged a series ofexperi-
ments to compare the effectiveness ofseveral communication media, i.e., voice,
video,handwriting,andtypewriting,either aloneorincombination, forproblem-
solvingtasks [Ochsman and Chapanis 1974]. Their findings indicated an over-
whelmingcontribution ofvoice forsuch interactions.Any experimental condition
that included voice was superior to any excluding voice; the inclusion ofother
mediawithvoice resultedinonlyasmalladditional effectiveness.Although these
experiments were simplistic intheiruse ofstudent subjectsand invented tasks
and more recent work by others [Minneman and Bly19911 clarifies a role for
video interaction, the dominance ofvoice seemsunassailable.
Butconversation ismore thanmereinteraction; communication oftenserves a
purpose ofchanging orinfluencing the parties speakingto each other.I tellyou
something I have learned with the intention thatyou share myknowledge and
henceenhance yourviewofthe world.OrIwishtoobtain someinformation from
youso I askyou aquestion,hopingtoelicitareply.Orperhaps Iseektoconvince
youtoperformsome activityforme; thismay besatisfied eitherbyyourphysical
performance of the requested actionorbyyourspokenpromisetoperformtheact
atalatertime."Speech Act"theories (tobe discussedinmoredetailinChapter 9)
attempt to explain language as action, e.g., to request, command, query, and
promise,aswell astoinform.
Theintention behind anutterance may notbe explicit.Forexample, "Canyou
passthesalt?"isnotaquery aboutone's ability;it is arequest. Many actualcon-
versations resist such purposeful classifications. Some utterances ("go ahead,"
"uh-huh,""justamoment")existonly toguidetheflow of theconversation orcom-
ment on the state ofthe discourse, rather than to convey information. Directly
purposeful requests areoften phrased in a manner allowing flexibility ofinter-
pretationandresponse. Thisloosenessisimportant totheprocessofpeopledefin-
ingandmaintainingtheirworkroleswithrespect toeach otherand establishing
sociallycomfortable relationships in ahierarchical organization. Therichness of
speech allows a wide rangeof"acceptance" and"agreement"from wholehearted
toskepticaltoincredulous.
Speechalso serves astrongsocialfunction amongindividualsandisoftenused
justtopassthetime,telljokes, ortalkabouttheweather.Indeed, extended peri-
ods ofsilence among agroup maybe associatedwith interpersonal awkwardness
ordiscomfort. Sometimes theactualoccurrence oftheconversation serves amore
significant purpose thananyofthetopics underdiscussion. Speechmaybe used
tocall attention tooneselfin a social setting orasan exclamation ofsurprise or
dismay in which an utterance haslittle meaning with respect toany preceding
conversation.
[Goffman19811
Theexpressiveness ofspeech androbustness ofconversation strongly support
the use ofspeech in computer systems,both for stored voice as a data type as
wellasspeechasa mediumofinteraction. Unfortunately,currentcomputers are
S YVOICE COMMUNICATION WITH COMPUTERS
capable ofutteringonly shortsentences ofmarginalintelligibility andoccasion-
allyrecognizing single words.Engaging acomputer inaconversation canbe like
an interaction in a foreign country. One studies the phrase book, utters a
request, andin return receives either a blank stare (wrong pronunciation,
try
again)oratorrentoffluent speech inwhich one cannot perceive even the word
boundaries.
However,limitationsin technologyonlyreinforce theneedtotakeadvantageof
conversational
techniquestoensure thattheuseris understood.Userswilljudge
the performance of computer systems employing speech on the basis of their
expectations about conversation
developed from years of experience speaking
withfellow humans.Users mayexpect computers tobeeither deafanddumb,or
once they realizethe system can talk andlisten, expect it to speakfluently like
you andme. Sincethe capabilities ofcurrent speech technology liebetween these
extremes, building effective conversational
computer systems can be very frus-
trating.
HIERARCHICAL STRUCTURE OF CONVERSATION
A moreanalyticapproach tospeechcommunication
revealsa numberofdifferent
ways ofdescribing whatactuallyoccurs whenwe speak The hierarchical struc-
tureofsuchanalysis suggests goalstobeattained atvarious stagesincomputer-
based speech communication.
Conversation
requires apparatus both for listening and speaking. Effective
communication
invokes mental processes employing themouth andears to con-
vey a message thoroughly andreliably. There are manylayers at which we can
analyze thecommunication
process, fromthelower layers where speech iscon-
sidered primarily acoustically tohigher layers thatexpress meaning and inten-
tion. Each layerinvolves increased knowledge andpotential for intelligence and
interactivity.
From thepointofviewofthespeaker,we maylookatspeech fromatleasteight
layers ofprocessingasshown inFigure 1.1.
LayersofSpeechProcessing
discourse The regulation ofconversation for pragmatic ends. This includes
takingturnstalking,thehistoryofreferentsin aconversation sopronounscan
refertowords spoken earlier, andtheprocess ofintroducing newtopics.
pragmatics
The intentor motivation foran utterance. This is the underlying
reasonthe utterance wasspoken.
semantics Themeaning ofthe words individually and theirmeaning as com-
binedin a particularsentence.
syntax The rulesgoverning thecombination ofwordsin asentence,theirparts
ofspeech,andtheirforms, such ascase andnumber.
9
SpeedimonC nunicoiien
speaker listener
I disourSe dcourse s
I
I
pragmatic
a I sI
| pragmatic
em
syntactic syntactic
Sphanec onrc
articulatory
perceptual
acoustic acoustic
Figure 1.1. Alayered viewofspeech communication.
lexical The set ofwords in alanguage, therules for formingnew words from
affixes (prefixes andsuffixes), andthe stress ("accent") ofsyllables withinthe
words.
phonetics Theseries ofsounds thatuniquely convey theseries ofwordsinthe
sentence.
articulation Themotionsorconfigurations ofthe vocal tractthat produce the
sounds,e.g., thetonguetouching the lipsorthevocal cords vibrating.
acoustics The realization ofthe stringofphonemes in the sentence asvibra-
tions ofairmolecules toproduce pressure waves, i.e., sound.
Consider two hikers walking through the forest when one hiker's shoelace
becomes untied.Theother hikersees thisandsays, "Hey,you're goingto tripon
yourshoelace." Thelistenerthentiestheshoelace. We canconsiderthisutterance
ateach layerofdescription.
Discourse analysis reveals that "Hey"serves tocall attentiontothe urgency
ofthe message andprobably indicates theintroduction ofa new topicofconver-
sation. It is probably spoken in araised tone and an observer would reasonably
expect thelistenertoacknowledge thisutterance, eitherwith avocal responseor
by tyingtheshoe.Experience with discourse indicatesthatthis isanappropriate
interruption or initiation ofa conversation at least under some circumstances.
Discourse structure may help the listener understand that subsequent utter-
ances refertotheshoelaceinsteadof thedifficultyof theterrainonwhichthecon-
versants are traveling.
In terms ofpragmatics,the speaker's intentis to warn the listener against
tripping; presumably the speakerdoes notwishthelistenertofall.Butthisutter-
ance mightalso havebeenaruseintendedtogetthelistener tolook downforthe
sake ofplaying atrick.We cannot differentiate these possibilitieswithoutknow-
10 VOICE COMMUHICATION WITH (OMPUIERS
ingmore aboutthe contextin which the sentence was spoken andthe relation-
shipbetween theconversants.
From a semantics standpoint, the sentence is about certain objects in the
world: thelistener,hiking,anarticle ofclothing wornon the foot, andespecially
thestringbywhichtheboot isheld on. Theconcern atthe semantic layerishow
thewordsrefertotheworldandwhatstatesofaffairsthey describeorpredict.In
thiscase,the meaninghastodowith ananimate entity("you") performingsome
physical action ("tripping"),and the use offuturetense indicates thatthe talker
is making a prediction ofsomething not currently taking place. Not all words
referto specific subjects; in theexample, "Hey"serves to attractattention, but
hasnoinnate meaning.
Syntax isconcerned with how the words fittogether intothe structure ofthe
sentence. Thisincludes the ordering ofparts ofspeech(nouns, verbs, adjectives)
andrelations between words andthewords thatmodify them.Syntax indicates
thatthe correct word order is subjectfollowed by verb, andsyntax forces agree-
ment ofnumber,person, andcase ofthevarious words in the sentence. "You is
going to..."is syntactically
ill formed. Because the subject oftheexample is
"you,"theassociated form ofthe verb"to be"is "are."The chosen verbform also
indicates afuturetense.
Lexical analysistellsusthat"shoelace"comes fromtherootwords"shoe"and
"lace"andthatthe first syllable is stressed. Lexical analysis also identifies aset
of definitions foreachwordtakeninisolation."frip,"forexample,couldbetheact
offalling or itcould refer toajourney. Syntaxreveals which definition is appro-
priateaseach isassociatedwiththeword"trip"usedasadifferentpartofspeech.
Intheexample, "trip"isusedasaverb andsoreferstofalling.
The phonemic layer is concerned with the string ofphonemes ofwhich the
words arecomposed.Phonemesarethespeechsoundsthatformthewordsofany
language.
3
Phonemes include all the sounds associated with vowels and conso-
nants.A grunt, growl,hiss, orgargling soundisnot aphoneme in English, soit
cannot be part ofa word; such sounds are not referred to as speech. At the
phoneme layer,whiletalkingweareeithercontinuously
producing speechsounds
oraresilent.Wearenotsilentatwordboundaries; thephonemes allruntogether.
Atthearticulatory
layer,thespeaker makes aseries of vocal gesturestopro-
ducethe soundsthatmakeupthephonemes.These sounds arecreated byanoise
sourceat somelocationinthevocal tract,whichisthenmodified by theconfigu-
rationofthe restofthe vocal tract.Forexample, toproduce a"b"sound, thelips
arefirstclosed andairpressurefromthelungsisbuiltupbehindthem.Asudden
release ofairbetween the lips accompanied by vibration ofthe vocal cords pro-
duces the "b"sound. An "s" sound, by comparison, is produced by turbulence
causedasa streamofairrushesthrougha constrictionformedbythetongueand
theroofofthe mouth. Themouth can also be used to create nonspeech sounds,
such assighsandgrunts.
'Amore rigorous definitionwill be giveninthenext section.
II Speech asCmrnurmin
Finally,the acousticsofthe utterance is itsnature assound. Sound is trans-
mitted asvariations inairpressureover time;sound canbeconverted toanelec-
trical signal bya microphone andrepresented asan electricalwaveform. We can
alsoanalyze sound byconvertingthewaveform toaspectrogram, which displays
the various frequency components present in the sound. At the acoustic layer,
speech isjustanother soundlikethewind inthe trees orajet planeflying over-
head.
From the perspective ofthe listener, the articulatory layer is replaced by a
perceptual layer, which comprises the processes whereby sound (variationsin
air pressure over time) is converted to neural signals inthe earand ultimately
interpreted asspeech soundsinthebrain.It isimportanttokeepinmindthatthe
hearercandirectly senseonly the acoustic layerofspeech.If we send an electric
signal representing the speech waveform over a telephone line and convert this
signal tosound atthe otherend,the listeningparty canunderstand the speech.
Therefore, the acousticlayer alone must contain allthe informationnecessaryto
understand the speaker's intent,but it canbe represented at the various layers
aspartofthe process ofunderstanding.
This layered approach is actually more descriptive than analytic in terms of
human cognitive processes. The distinctions between the layers are fuzzy, and
thereislittle evidence thathumans actually organize discourse production into
suchlayers. Intonationisinterpretedinparallel atalltheselayersandthusillus-
trates thelack ofsharpboundaries orsequential processingamong them. Atthe
pragmatic layer, intonation differentiates the simple questionfrom exaggerated
disbelief; the same words spoken with different intonation can havetotally dif-
ferent meaning. Atthe syntactic layer,intonation is acue tophraseboundaries.
Intonationcandifferentiate thenounandverbformsofsomewords (e.g.,conduct,
convict)atthesyntacticlayerby conveyinglexicalstress.Intonationisnotphone-
micinEnglish,butinsomeotherlanguages achangeinpitchdoesindicate adif-
ferentword forthe otherwiseidentical articulation. Andintonationis articulated
andrealizedacoustically inpartasthefundamental frequency atwhich thevocal
cords vibrate.
Dissectingthecommunication process into layers offers several benefits, both
interms ofunderstanding aswell asforpracticalimplementations. Understand-
ing this layering helps us appreciate the complexity and richness of speech.
Research disciplines have evolved around each layer.A layered approach to rep-
resenting conversation is essential for modular software development; a clean
architecture isolates each module from the specialized knowledge ofthe others
withinformationpassed overwell-defined interfaces. Because eachlayerconsists
ofa different perspective on speech communication, each is likely toemploy its
ownrepresentation ofspeech foranalysis andgeneration.
As a cautionary note,it needs to berecognized from the startthatthereis lit-
tleevidence thathumans actually function by invokingeach ofthese layersdur-
ing conversation. The model is descriptive without attempting to explain or
identify components ofour cognitive processes. Themodel is incomplete in that
thereare someaspects ofspeechcommunication thatdonotfitit,butit canserve
asa frameworkformuch ofourdiscussion ofconversational computer systems.
12 VOICE COMMUNICATION WITH COMPUTERS
REPRESENTATIONS
OF SPEECH
We need a means ofdescribingandmanipulating speech ateach ofthese layers.
Representations
for thelowerlayers, such asacoustic waveforms orphonemes,
are simpler and more complete and also more closely correspond to directly
observable phenomena. Higher-layer representations,
such as semantic dis- or
course structure, aresubject toa greatdeal more argument andinterpretation
and are usually abstractions
convenient for a computer program or a linguistic
comparisonofseverallanguages. Anysingle representation iscapable ofconvey-
ingparticular aspects ofspeech; different representations
aresuitable fordiscus-
sionofthe differentlayers ofthecommunication process.
The representation
chosen for any layer should contain all the information
requiredforanalysisatthatlayer.Sincehigherlayers possess agreaterdegree of
abstractionthanlowerlayers,higher-layer representations
extractfeatures from
lower-layer representations
and hence lose the ability to recreate the original
information completely. For example, numerous cues at the acoustic layer may
indicatethatthetalkerisfemale,butifwerepresenttheutterance asastringof
phones orofwords we have lost those cues. Interms ofcomputer software, the
representation
is the data type by which speech is described. One must match
therepresentation
and the particular knowledge aboutspeech thatit conveys to
thealgorithms employing itat aparticular layerofspeech understanding.
Acomsk Representatin
Sounds consistofvariationsin airpressure over time atfrequencies thatwe can
hear.Speech consists of a subset ofthe sounds generated by the human vocal
tract.If wewishtoanalyze asoundorsaveit tohear againlater,weneedtocap-
turethevariationsinairpressure.We canconvert airpressuretoelectric voltage
withamicrophone andthen convert thevoltage tomagnetic flux onanaudiocas-
settetapeusinga recordinghead,forexample.
We can plotthe speech signal in any ofthese media (airpressure, voltage, or
magnetic flux)overtime asawaveform as illustratedinFigure 1.2. This repre-
sentation exhibits positive and negative values over time because the speech
radiatingfrom our mouths causes air pressuretobetemporarily greater orless
thanthatoftheambient air.
A waveform describing sound pressure in air is continuous,
while the wave-
formsemployedbycomputers aredigital,orsampled, andhavediscretevalues
for each sample; these concepts are described in detail in Chapter 3. Tape
recorders store analogwaveforms; acompact audiodischoldsadigital waveform.
A digitized waveform canbe made tovery closely represent the original sound,
and it can be captured easily with inexpensive
equipment. A digitized sound
stored in computer memory allows for fast random access. Once digitized, the
sound may be further processed or compressed using digital signal processing
techniques. The analog audiotape supports only sequential access (it must be
rewound orfast-forwarded
tojumptoadifferent partofthetape)andisprone to
mechanical
breakdown.
Ihs (onnunn icn 13
Figure1.2. Awaveform, showing 100 milliseconds oftheword "me."The
vertical axis depicts amplitude, and thehorizontal axis represents time.
Thedisplay depicts thetransition from"m"to "e."
Awaveform caneffectively representthe originalsignal visuallywhen plotted
on apiece ofpaper oracomputer screen. Buttomake observations aboutwhata
waveform soundslike, we mustanalyze it across a spanoftime notjustata sin-
glepoint. Forexample, inFigure 1.2 we candetermine the amplitude("volume")
ofthesignal by looking atthe differences between its highestandlowest points.
We can also see that it is periodic: The signal repeats a pattern over andover.
Sincethe horizontal axisrepresents time,we candetermine the frequency ofthe
signal by countingthe numberofperiods inone second.A periodic sound with a
higher frequencyhasahigher pitchthanaperiodic soundwith alowerfrequency.
One disadvantage ofworkingdirectly with waveforms isthattheyrequire con-
siderable storage space, makingthembulky; Figure 1.2 showsonly 100 millisec-
onds ofspeech. A varietyofschemes for compressing speech tominimize storage
are discussed in Chapter 3. A morecrucial limitationisthatawaveform simply
shows the signal asa functionoftime. A waveform isin noway speech specific
andcanrepresentanyacoustical phenomenon equallywell.As ageneral-purpose
representation, it contains all the acoustic information but does not explicitly
describeitscontentin termsofproperties ofspeech signals.
Aspeech-specific representation moresuccinctly conveys thosefeatures salient
to speech and phonemes, such as syllable boundaries, fundamental frequency,
andthe higher-energy frequencies in the sound.A spectrogram is a transfor-
mation ofthe waveform into the frequency domain. As seen in Figure 1.3, the
spectrogram reveals the distributionofvarious frequency components ofthe sig-
nal asafunction oftime indicating theenergyat each frequency.Thehorizontal
axis representstime, the vertical axis represents frequency, andtheintensity or
blackness atapointindicates theacoustic energyatthatfrequency andtime.
A spectrogram still consists ofa large amount of data but usually requires
muchless storage thanthe originalwaveform and reveals acoustic features spe-
cific tospeech. Because ofthis, spectral analysis
4
is often employed to process
41b beprecise, spectral or Fourieranalysis usesmathematical techniques toderive the
values ofenergy at particular frequencies. We can plot these as described above; this
visualrepresentation is thespectrogram.
14 VOICE COMMUNICATION WITH COMPUTERS
Figure 1.3. A spectrogram of2.5 seconds of speech. The vertical axis is
frequency,thehorizontal aids istime, andenergy mapstodarkness.
speech for analysis by a human ora computer. People have been trainedtoread
spectrograms anddetermine thewords that were spoken. Although the spectro-
gram conveys salient features ofa sound, theoriginal acoustic signal cannot be
reconstructed from it without some difficulty. As a result, the spectrogram is
moreusefulforanalysis ofthespeechsignalthanasameans ofstoringitforlater
playback.
Other acoustic representations in the frequency domain more suc- are even
cinct, though they are more difficult for a human to process visually than a
spectrogram. Linear Prediction Coefficients and Cepstral Analysis, for
example, are two such techniques thatrelyheavily on digital signalprocessing.
5
Both ofthese techniques reveal the resonances ofthe vocal tractand separate
information about how the sound was produced atthe noise source and how it
wasmodified byvarious parts ofthe vocal tract. Because these two techniques
extract salient information abouthow the sound was articulated they are fre-
quentlyused asrepresentations for computeranalysis ofspeech.
EMES AND SYLLABLES PHON
Tworepresentations ofspeechwhich are moreclosely related toitslexical struc-
turearephonemes andsyllables. Phonemesareimportantsmallunits,severalof
whichmakeup mostsyllables.
Phonemes
Aphoneme isaunitof speech,thesetof whichdefines allthesoundsfromwhich
words can be constructed in a particular language. There is atleast one pairof
'Linear prediction willbe explained inChapter 3. Cepstral analysis is beyondthescope
ofthisbook.
Sp* s(nmunicion iS
words ina language forwhich replacing one phoneme with another will change
what is spoken into a different word. Of course, not every combination of
phonemes resultsina word; many combinations are nonsense.
For example, in English, the words "bit" and "bid"have different meanings,
indicating thatthe "t" and "d"are different phonemes. Two words that differin
onlyasingle phoneme arecalled aminimalpair."Bit"and"bif" also varyby one
sound, butthisexample does notprove that"t" and"fr are distinct phonemes as
"bif"is not a word. But"tan"and"fan" are different words;this proves thepho-
nemic difference between "t"and"f".
Vowels arealso phonemes; the words "heed,""had,""hid,""hide,""howed," and
"hood"each differ only by one sound, showing us that English has at least six
vowel phonemes. It is simple toconstruct a minimal pairforany two vowels in
English,while itmaynotbeassimple tofind apair fortwoconsonants.
An allophone isone ofa number ofdifferent ways ofpronouncing the same
phoneme. Replacing one allophone ofa phoneme with another does not change
the meaningofasentence, althoughthe speakerwillsoundunnatural,stilted,or
likeanon-native.For example,consider the"t" sound in"sit"and"sitter."The "t"
in "sit"is somewhat aspirated; a puffofairisreleased with the consonant. You
canfeelthisifyouput yourhandinfrontof yourmouth asyousaytheword.But
in"sitter"the same phoneme is notaspirated; we say the aspiration is not pho-
nemicfor "t"andconclude thatwehaveidentified twoallophones. If youaspirate
the"t" in"sitter,"it sounds somewhat forced butdoes notchangethemeaningof
theword.
Incontrast, aspiration ofstopconsonants is phonemic inNepali. Foranexam-
ple ofEnglish phonemes that areallophones in another language, consider the
difficulties Japanese speakers have distinguishing our "l"and"r." Thereason is
simply that while these are two phonemes inEnglish, they are allophonic vari-
ants on the same phoneme in Japanese. Each language has its own set of
phonemes andassociated allophones. Sounds thatareallophonic inonelanguage
maybe phonemicin another andmaynotevenexist inathird.Whenyoulearna
language, you learnitsphonemes andhow toemploy the permissible allophonic
variations on them. But learning phonemes is much more difficult as anadult
thanasachild.
Because phonemes are language specific, we can notrelyonjudgmentsbased
solely on our native languages to classify speech sounds. An individual speech
sound is a phone, orsegment. For any particular language, a given phoneme
will haveasetofallophones, eachofwhichisasegment.Segments areproperties
ofthehuman vocal mechanism, andphonemes are properties oflanguages. For
mostpractical purposes, phoneandphoneme maybe consideredtobe synonyms.
Linguists use a notation for phones called the International Phonetic
Alphabet, or IPA.IPA has a symbol for almost every possible phone; some of
these symbols are shown in Figure 1.4. Since there are far more than 26 such
phones, it is not possible to represent them all with the letters ofthe English
alphabet. IPA borrows symbols from the Greek alphabet and elsewhere. For
example, the "th"sound in "thin" is represented as "W"in IPA, and the sound
16 VOICE COMMUNICATION WITH COMPUTERS
Example Example Example
Phoneme Word Phoneme Word Phoneme Word
i beet p put Z chin
I bit t tap j judge
e bet k cat m map
e bait b bit n nap
a: bat d dill r7 sing
a cot g get r ring
D caught f fain I lip
A but 0 thin w will
o boat s sit y yell
U foot f shoe h head
u boot v veal
3 bird 6 then
aj (aI) bite z zeal
Dj (DI) boy 2 azure
aw (aU) bout
a about
Figure 1.4. The English phonemes in IPA, the International Phonetic
Alphabet.
in "then"is "8"' American linguists who use computers have developed the
Arpabet, which uses ordinary alphabet characters to represent phones; some
phonemesarerepresentedasa pairofletters.Arpabet' wasdevelopedforthecon-
venience ofcomputer manipulation and representation ofspeech usingASCII-
printablecharacters.
To avoid thenecessity ofthe readerlearning either IPA orArpabet, this book
indicates phones byexample, such as, "the't' inbottle." Although slightly awk-
ward,suchanotationsuffices forthelimitedexamples described.Thereaderwill
findit necessary tolearn anotation (IPA ismore common intextbooks) tomake
anyserious study ofphonetics orlinguistics.
Aphonemictranscription, although compact, haslostmuchofthe original sig-
nalcontent,suchaspitch,speed,andamplitude ofspeech.Phonemes areabstrac-
tions from the original signal that highlight the speech-specific aspects ofthat
'Are these sounds phonemicinEnglish?
'The word comes from the acronym ARPA, the Advanced Research Projects Agency
(sometimes calledDARPA). aresearchbranch oftheU.S.Defense Departmentthatfunded
much earlyspeechresearchinthiscountry andcontinues tobe themostsignificant source
of government supportforsuchresearch.
Spmeei d [omnicaliW n 17
signal; this makes aphonemic transcription aconcise representation for lexical
analysis asitismuchmoreabstract thantheoriginal waveform.
Another natural way to divide speech sounds is by the syllable. Almost any
native speaker ofEnglish canbreak aword down into syllables, although some
wordscanbemore difficult,e.g., "chocolate"or"factory." A syllableconsistsof one
ormore consonants,avowel (or diphthongs), followedby one ormoreconsonants;
consonants are optional, butthe vowel is not.Two or more adjacent consonants
are called a consonant cluster; examples are the initial sounds of"screw" and
"sling."Acoustically,asyllableconsists of arelativelyhigh energycore(thevowel)
optionally preceded or followed by periods oflower energy (consonants). Conso-
nantshave lower energy because they impose constrictions on the airflow from
thelungs. Manynatural languages, such asthose writtenwiththeArabic script
andthe northern Indianlanguages, arewritten with a syllabic systemin which
one symbol representsboth aconsonant anditsassociated vowel.
Oier Represesmtafs
There are many other representations of speech appropriate to higher layer
aspects ofconversation.
9
Lexical analysis reveals an utterance as a series of
words.Adictionary,orlexicon listsallthe words ofalanguage andtheirmean-
ings.Thephrase,sometimescalleda"breathgroup"whendescribingintonation,
isrelevantboth tothestudy of prosody (pitch,rhythm,andmeter)aswell assyn-
tax, which deals with structures such as the noun phrase and verb phrase. A
parsetree,asshown inFigure 1.5, isanother usefulrepresentation ofthesyn-
tacticrelationships amongwords inasentence.
Representations for higherlayers ofanalysis are varied andcomplex. Seman-
tics associates meaningwith words,andmeaningimplies a relationshiptoother
words or concepts. A semantic network indicates the logical relationships
betweenwords andmeaning.Forexample, adoor is a physical object, butit has
specificmeaningonlyintermsofother objects, such aswalls andbuildings, asit
coversentrance holesinthese objects.
Discourse analysis hasproduced avariety ofmodels ofthefocusofaconversa-
tion. Forexample, one ofthese uses a stackto store potential topics ofcurrent
focus.New topicsarepushedontothestack, andaformertopicagainbecomes the
focus when all topics above it are popped offthe stack. Once removed from the
stack, atopiccannotbecome thefocus withoutbeingreintroduced.
8
Adiphthong consistsoftwovowel soundsspoken insequence andisconsidered a single
phoneme. The two vowel soundsin a diphthong cannotbe separated into different sylla-
bles. Thevowels in 'hi"and "bay"are examples ofdiphthongs.
'Most of the speech representations mentioned in this section will be detailed in
Chapter 9.
18 YVOII (OMMUNICATION WITH (OMPUTERS
John saw Mary
Figure 1.5,. Aparsetree.
Y
The perspective ofspeech as aconversational process constitutes the
andpoint ofview ofthis book. Conversations employ spontaneous s
varietyof interaction techniques tocoordinatetheexchangeofutteran
participants. Speechisour primarymedium ofcommunication,
altho
may be favored for longer-lasting ormore formal messages. Comp
would dowell toexploit therichness androbustness ofhumanconver
But "conversation"is simplytoo richandvaried atermtobe amen
ysiswithout being analyzed in components. Thischapter identified
suchcomponents ofa conversation: acoustic, articulatory, phonetic,
tactic, semantic, pragmatic, and discourse. Disciplines of researc
established foreach ofthese areas,andtherestofthis book will bo
from them. Each discipline has itsown set ofrepresentations
ofsp
allow utterances tobedescribed andanalyzed.
Representations
suchaswaveforms,spectrograms,
andphonetictr
comput
SUMMAR
foundation
peech and a
cesbetween
ugh writing
uter systems
sation.
abletoanal-
a number of
lexical, syn-
h have been
rrowheavily
eech, which
anscriptions
provide suitable abstractions that can be embedded in the er programs
thatattempt toimplement various layersofspeechcommunication.
Eachrepre-
sentation highlights particular features or characteristics ofspeech and may be
farremoved from the originalspeech sounds.
Therationale forthis bookis thatthe studyofspeech inconversations isinter-
disciplinary and that the designers ofconversational computer systems need to
understand each ofthese individual components in order tofullyappreciate the
whole. Therest ofthis book is organized in part as abottom-up analysis ofthe
layers ofspeech communication.
Eachlayer interacts with the other layers, and
theunderlyinggoal forconversational communication
isunification ofthelayers.

S-ar putea să vă placă și