Sunteți pe pagina 1din 13

I Microsoft Workshop on Speech Technology - Building bridges between industry

and academia, May 2 2007, MLDC

Acoustic Modeling
Introduction and Methodology
Fellowship in collaboration with Prof. Carlos Teixeira, FCUL
Carla Simões
t-carlas@microsoft.com
Overview

• Introduction
– Speech Components
– What are Acoustic Models?
– Why to use them?

• Methodology
– Training Acoustic Models

• Modelling
– English Spoken by Portuguese speakers

• Conclusion and Future Work


Speech Components
Acoustic model training
Corpus Feature
(Speech + Transcriptions) Acoustic Models
extraction
(Hidden Markov
Feature vector Models)
Lexicon
(phonetic dictionary)

Grammar + Lexicon Speech Recognition


(for SR apps; grammar Engine (SR)
defines the permitted
Speech sequence of words)
Applications
Desktop + Language Pack
(Office12, (contains core SR and
SAPI TTS engines)
Vista) (developer’s
Speech API)
Home Mobility
(TV, Kitchen) Telephony (Voice
(Speech Server Command)
Text-to-speech
2007, Exchange Engine (TTS)
12)
What are Acoustic Models?

• They reflect the way we pronounce a certain language

• Speech can be broken into phonetic segments, phones

• Acoustic Models are representations of speech segments

• Acoustic model training involves mapping models to


acoustic examples obtained from training data
Why to use them?

• Basis of an automatic speech recognition (ASR) system

S1 S2 S3 … Sequence of symbols

Speech Waveform

Front End

Sequence of observed speech vectors

The acoustic model gives the likelihood for


S1 S2 S3 … a given feature vector as produced by a
particular phoneme
Methodology

• Our Acoustic Models are Hidden Markov Models (HMMs) based


– Markov Assumption: each state probability depends on the previous one

• Each HMM has 3 states


– each state represents a short segment of
speech, described mathematically by
Gaussian probability distributions

– Acoustically similar information is shared


across HMMs - sharing states called
senones

• Cross-word triphone System


Methodology
• Training up a cross-word triphone system for a new language
– Acoustic model training involves mapping acoustic models (cross-word
triphone or whole-word triphones) with equivalent labels (transcriptions)

A phoneset file should never


contain more than 50 phones

Corpus Word level Prototype Cross-word


(Speech+word Phoneset transcriptions monophone system system is then
level into converted to initial updated to
transcriptions)
+
Monophone Cross-Word produce the final
+ QuestionSet level Triphone System Cross-Word
Lexicon transcriptions Triphone system

Clustering triphones into


acoustically similar groups
Modelling
English Spoken by Portuguese Speakers
• Normally a speech recognizer’s precision is lower for
non-native users
– Non-native accents are more problematic than dialects – more
variability

• Research on non-native accent modeling reveals large


gains in performance when acoustics and pronunciation
of an accent are taken into account

• An usage scenario: Voice controlled applications, where


Portuguese language is dominant but English terms
are supported with the same accuracy
Modelling
English Spoken by Portuguese Speakers
• Experiments are being developed concerning this problem

• Corpus description
– 4689 Utterances for a universe of 227 Words
– Files are sampled at 8Khz for 16 bits linear
– 11 male speakers

• Model settings
– 3468 utterances for training 1221 utterances for test
– 43 minutes of speech
– Senones 1200
Modelling
English Spoken by Portuguese Speakers
English spoken by Training New Model
Portuguese corpus

Testing
English spoken by
Portuguese corpus

Test corpus
Update

New Model
Testing

ENU Model

Test corpus
Modelling
English Spoken by Portuguese Speakers

ENU Phoneset PTG Phoneset

English spoken by
Portuguese corpus
English to Portuguese
mapped phoneset
PTG corpus

Training
Test corpus Testing New Model
Future Work
• The improvement of Acoustic Models requires gathering
hundreds of hours of speech data

• The amounts of data would have to be larger if we’re


dealing with non-native speakers, because the accent
variability gets too high

Possible solutions:
• Define new phonesets which implies a phonetic study
concerning the Portuguese English pronunciation
• Train the native models with the English spoken by
Portuguese corpus
Acoustic Modeling
Introduction and Methodology

I Microsoft Workshop on Speech Technology -


Building bridges between industry and
academia, May 2 2007, MLDC

Carla Simões
t-carlas@microsoft.com

Muito obrigado pela vossa atenção!

www.microsoft.com/portugal/mldc

© 2007 Microsoft Corporation. All rights reserved.


This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

S-ar putea să vă placă și