CarlaSimoes I MS WS Speech Tech

I Microsoft Workshop on Speech Technology - Building bridges between industry
and academia, May 2 2007, MLDC
Acoustic Modeling
Introduction and Methodology
Fellowship in collaboration with Prof. Carlos Teixeira, FCUL
Carla Simões
t-carlas@microsoft.com
Overview
• Introduction
– Speech Components
– What are Acoustic Models?
– Why to use them?
• Methodology
– Training Acoustic Models
• Modelling
– English Spoken by Portuguese speakers
• Conclusion and Future Work

Speech Components
Acoustic model training
Corpus Feature
(Speech + Transcriptions) Acoustic Models
extraction
(Hidden Markov
Feature vector Models)
Lexicon
(phonetic dictionary)
Grammar + Lexicon Speech Recognition

(for SR apps; grammar Engine (SR)
defines the permitted
Speech sequence of words)
Applications
Desktop + Language Pack
(Office12, (contains core SR and
SAPI TTS engines)
Vista) (developer’s
Speech API)
Home Mobility
(TV, Kitchen) Telephony (Voice
(Speech Server Command)
Text-to-speech
2007, Exchange Engine (TTS)
12)
What are Acoustic Models?
• They reflect the way we pronounce a certain language
• Speech can be broken into phonetic segments, phones
• Acoustic Models are representations of speech segments
• Acoustic model training involves mapping models to

acoustic examples obtained from training data
Why to use them?
• Basis of an automatic speech recognition (ASR) system
S1 S2 S3 … Sequence of symbols
Speech Waveform
Front End
Sequence of observed speech vectors
The acoustic model gives the likelihood for

S1 S2 S3 … a given feature vector as produced by a
particular phoneme
Methodology
• Our Acoustic Models are Hidden Markov Models (HMMs) based

– Markov Assumption: each state probability depends on the previous one
• Each HMM has 3 states

– each state represents a short segment of
speech, described mathematically by
Gaussian probability distributions
– Acoustically similar information is shared

across HMMs - sharing states called
senones
• Cross-word triphone System

Methodology
• Training up a cross-word triphone system for a new language
– Acoustic model training involves mapping acoustic models (cross-word
triphone or whole-word triphones) with equivalent labels (transcriptions)
A phoneset file should never

contain more than 50 phones
Corpus Word level Prototype Cross-word

(Speech+word Phoneset transcriptions monophone system system is then
level into converted to initial updated to
transcriptions)
+
Monophone Cross-Word produce the final
+ QuestionSet level Triphone System Cross-Word
Lexicon transcriptions Triphone system
Clustering triphones into

acoustically similar groups
Modelling
English Spoken by Portuguese Speakers
• Normally a speech recognizer’s precision is lower for
non-native users
– Non-native accents are more problematic than dialects – more
variability
• Research on non-native accent modeling reveals large

gains in performance when acoustics and pronunciation
of an accent are taken into account
• An usage scenario: Voice controlled applications, where

Portuguese language is dominant but English terms
are supported with the same accuracy
Modelling
• Experiments are being developed concerning this problem
• Corpus description
– 4689 Utterances for a universe of 227 Words
– Files are sampled at 8Khz for 16 bits linear
– 11 male speakers
• Model settings
– 3468 utterances for training 1221 utterances for test
– 43 minutes of speech
– Senones 1200
Modelling
English spoken by Training New Model
Portuguese corpus
Testing
English spoken by
Portuguese corpus
Test corpus
Update
New Model
Testing
ENU Model
Test corpus
Modelling
ENU Phoneset PTG Phoneset
English spoken by
Portuguese corpus
English to Portuguese
mapped phoneset
PTG corpus
Training
Test corpus Testing New Model
Future Work
• The improvement of Acoustic Models requires gathering
hundreds of hours of speech data
• The amounts of data would have to be larger if we’re

dealing with non-native speakers, because the accent
variability gets too high
Possible solutions:
• Define new phonesets which implies a phonetic study
concerning the Portuguese English pronunciation
• Train the native models with the English spoken by
Portuguese corpus
Acoustic Modeling
Introduction and Methodology
I Microsoft Workshop on Speech Technology -

Building bridges between industry and
academia, May 2 2007, MLDC
Carla Simões
t-carlas@microsoft.com
Muito obrigado pela vossa atenção!
www.microsoft.com/portugal/mldc
© 2007 Microsoft Corporation. All rights reserved.

This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

CarlaSimoes I MS WS Speech Tech

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CarlaSimoes I MS WS Speech Tech

Încărcat de

Drepturi de autor:

Formate disponibile

I Microsoft Workshop on Speech Technology - Building bridges between industry

and academia, May 2 2007, MLDC

• Conclusion and Future Work

Grammar + Lexicon Speech Recognition

• They reflect the way we pronounce a certain language

• Speech can be broken into phonetic segments, phones

• Acoustic Models are representations of speech segments

• Acoustic model training involves mapping models to

• Basis of an automatic speech recognition (ASR) system

Sequence of observed speech vectors

The acoustic model gives the likelihood for

• Our Acoustic Models are Hidden Markov Models (HMMs) based

• Each HMM has 3 states

– Acoustically similar information is shared

• Cross-word triphone System

A phoneset file should never

Corpus Word level Prototype Cross-word

Clustering triphones into

• Research on non-native accent modeling reveals large

• An usage scenario: Voice controlled applications, where

ENU Phoneset PTG Phoneset

• The amounts of data would have to be larger if we’re

I Microsoft Workshop on Speech Technology -

Muito obrigado pela vossa atenção!

© 2007 Microsoft Corporation. All rights reserved.

S-ar putea să vă placă și