Sunteți pe pagina 1din 51

Computer-based Lip-reading using Motion Templates

Wai Chee Yau


PhD graduate School of Electrical and Computer Engineering, RMIT University waichee@ieee.org PhD supervisor : A. Prof. Dinesh K. Kumar

Outline
Overview of speech recognition Non audio speech modalities Motivation for visual speech recognition Related work Contributions of this research

Mouth Motion Segmentation


Feature Extraction Classification Temporal segmentation of utterances

Experiments
Discussion Conclusions
16th Feb 2009

Overview of Speech Recognition


Speech recognition :

Spoken words

Computer inputs /commands

Audio signals (voice) : most commonly used input for speech recognition systems

Audio speech recognition applications: speech-to-text application voice dialing of mobile phones call routing aircraft control inside the pilot cockpit

16th Feb 2009

Overview of Speech Recognition


Advantages : Natural and convenient communication method Suitable for disabled users that cannot control their hands to use the computer Useful for hands-eyes-busy situations Control of the car radio or navigation system while driving a car Control of heavy machineries in factories Problems of audio speech recognition: Affected by environmental noise Inside a moving vehicle, noisy factories, pilot cockpits Sensitive to speaking styles Whisper and speaking loud
16th Feb 2009

Overview of Speech Recognition

Possible solutions : - by using noise robust techniques such as implementation of microphone arrays (Brandstein and Ward, 2001; Sullivan and Stern, 1993)

noise adaptation algorithms (Gales and Young, 1992;Stern et al., 1996) - by using non audio speech modalities
16th Feb 2009

Non Audio Speech Modalities


Visual : videos and images

Infra-red camera mounted on headset

In car control system

G. Potamianos, Audio- visual speech processing : Progress and Challenges, VisHCI, Canberra Australia, Nov. 2006

16th Feb 2009

Non Audio Speech Modalities


Muscles activity signals
Lipreading mobile phone
NTT DoCoMo

Control of Mars Rover


NASA Ames Research Lab

Voiceless Recognition Without the Voice, May. 1, 2004 Issue of CIO Magazine

C Jorgensen, D. D. Lee and S. Agabon, Sub auditory recognition based on EMG/EPG signals, 2003.

16th Feb 2009

Non Audio Speech Modalities


Facial EMG signals for English and German vowel recognition Brain signals

SP Arjunan, D.K Kumar, W.C Yau, H. Weghorn, Unspoken Vowel Recognition Using Facial Electromyogram, EMBC, New York 2006.

http://www.eng.nus.edu.sg/EResnews/0202/rd/rd_4.html

16th Feb 2009

Non Audio Speech Modalities


Advantages of Voice-less Speech Recognition:

User can control the system without making a sound :


Uttering pin code of security system Defence or military applications Control of computers/machines for disabled user with speech impairments

When the audio signals is greatly affected by noise:


Control of car radio in a vehicle
16th Feb 2009

Motivation for Using Visual Speech Recognition


Advantages of visual approach Non invasive Do not have to place sensors on the users Cameras commonly available

16th Feb 2009

Motivation for Using Visual Speech Recognition


Is he saying Ba? Ga? Da?

http://www.media.uio.no/personer/arntm/McGurk_english.html 16th Feb 2009

Motivation for Visual Speech Recognition

McGurk effect (McGurk 1976) : we combine what we see with what we hear in understanding speech Sound /ba/ + lip movement /ga/ 98% of us perceive as /da/

People with hearing impairment can lip-read by looking at the mouth of the speaker

16th Feb 2009

Block diagram of a lip-reading system

16th Feb 2009

Related Work
Visual speech recognition techniques reported in the literature can be broadly divided into :

Appearance-based (Potamianos et. al. 2004) Uses image pixels in the surrounding mouth region
Shape-based (Petajan 1984, Adjoudani et. al. 1996) Uses the shape information of the mouth/lips
height

width

Motion-based (Mase & Pentland 1991) Describes the mouth movements


16th Feb 2009

Appearance -based

Shape-based

Related Work
Few have focused on motion features. Human perceptual studies indicate that dynamic information is important for visual speech perception (Rosenblum & Saldaa, 1998)

16th Feb 2009

Contributions of this research


This research: proposes new motion features for computer-based lip reading using motion templates (MT) investigates the use of Zernike moments to derive rotational invariant features compares the performance of hidden Markov models (HMM) and support vector machines (SVM) for classification of the motion features

16th Feb 2009

Mouth movement segmentation


Motion templates (MT) are 2D grayscale images (Bobick et. al. 2001) where the: Intensity values indicate when the movements occurred Pixel location indicate where the movements happened Step 1 : Compute the difference of frames (DOF) Step 2 : Convert the DOFs into binary images

Step 3 : Temporally integrate the binary images with linear ramp of time
16th Feb 2009

Mouth movement segmentation


Removes the static elements & preserve the short duration facial movements Invariant within limits to skin colour

The intensity values of MT are normalized to reduce the changes in speed of speech
Histogram equalization is applied on the MT to minimize the global changes in lighting conditions

16th Feb 2009

Feature extraction
2 types of features investigated: DCT coefficients
commonly used as appearance-based features in visual speech recognition

Zernike moments
a type of image moments novel features for visual speech recognition

16th Feb 2009

Zernike moments
Advantages of Zernike moments :
Selected as one of the robust feature descriptor in MPEG-7 (Jeannin 2000) Rotation invariant Robust and good image representation (Teh 1988)

Computed by projecting the image function onto the orthogonal Zernike polynomial
Before computing ZM from MT, the MT needs to be mapped to a unit circle

16th Feb 2009

Zernike moments
Mapping of MT to a unit circle

16th Feb 2009

Zernike moments
Zernike Moments:

Zernike polynomial:

Normalizing constant:

Radial polynomials:
16th Feb 2009

Zernike moments
Zernike moments computed from an image function

Magnitude of ZM is rotational invariant

16th Feb 2009

Zernike moments

16th Feb 2009

DCT features
2-D DCT produces a compact energy representation of an image Focuses energy on the top left corner of the image

16th Feb 2009

DCT features
For an M x N image, f(x, y), DCT coefficients :

DCT features are shown to outperform DWT and PCA (Potamianos 2000) for visual speech recognition
16th Feb 2009

Classification
Assigning new feature vectors to one of the pre-defined utterances Two types of classifiers evaluated :
Generative models : hidden Markov models (HMM) Discriminative classifier : support vector machines (SVM)
16th Feb 2009

SVM classifier
Supervised classifiers trained using learning algorithm from statistical learning theory Successfully implemented for different image object recognition tasks Can find the optimal hyper plane between classes in sparse high-dimensional spaces with relatively few training data SVM with RBF kernel are used in the experiments for classifying the motion features (Zernike moments and DCT coefficients)
16th Feb 2009

HMM classifier
assumes that the speech signals contain short time segments that are stationary.
models these short periods where the signals are steady the changes between states are represented as transitions of states in HMM. the temporal variations within each of these segments are assumed to be statistical.
16th Feb 2009

HMM Classifier
The motion features are assumed to be Gaussian distributed and modelled as continuous observation densities Each phone is modelled as a left-right HMM with 3 states and diagonal covariance matrix. This HMM structure has been demonstrated to be suitable for modelling English phonemes (Foo & Dong 2002) Baum-Welch algorithm is used in the HMM training for reestimation of the HMM parameters

16th Feb 2009

HMM Classifier
Recognition phase :
compute the likelihoods of each of the HMM to produce the test sample The test sample is classified as the phoneme with HMM producing the highest likelihoods

16th Feb 2009

Temporal Segmentation of Utterances


Temporal segmentation of utterances are usually achieved using audio signals.

Temporal Segmentation of Utterances


This method combines motion + mouth appearance information Motion information : 3-frame MT are computed for a sequence of image. The average energy of 3-frame MT represent the magnitude of movement

Visual Utterance Segmentation


Mouth appearance information : A kNN classifier (k=3) is trained to recognize 2 classes: mouth appearance of uttering a phoneme (speaking) mouth appearance of silence Trained using mouth images of when the talker is speaking and when he/she is maintaining silence.
Examples of silence images

Examples of speaking images

Utterance Segmentation Algorithm


START Mouth movement present? yes Previous frames = silence no Previous frames = speaking no no Compute the next 3-frame MT

yes
Following frames = speaking

no

yes Following frames = silence no

End of utterance

Start of utterance

Experiments
Experiment 1 : Compare the performance of Zernike moments and DCT features Experiment 2 : Compare the performance of HMM and SVM

Experiment 3 : Evaluate the performance of the proposed temporal segmentation approach


16th Feb 2009

Vocabulary
Recognition units : visemes (basic unit of facial movement during the articulation of a phoneme) Visemes defined in MPEG 4 standard is used

16th Feb 2009

Experimental Setup
Video recording and Processing: Recorded using a web camera in an office environment Frontal view of the mouth of 10 speaker (5 males and 5 females). Constant view angle. A total of 2800 utterances were recorded as AVI files of 320x240. Frame rate : 30 frames/sec

16th Feb 2009

Experimental Setup
1 MT was generated from grayscale images of each phoneme
Histogram equalization was applied on the images to reduce the effects of illumination variations The images were analysed and processed using MATLAB 7

LIBSVM toolbox (Chang and Lin 2001) was used to create the SVM classifier and HMM toolbox for Matlab (Murphy 1998) was used to design the HMM classifier.
16th Feb 2009

MT of 14 visemes

16th Feb 2009

Experiments
Zernike moments and DCT coefficients are computed from the MTs 64 Zernike moments are used to represent each MT. Same number of DCT coefficients are used to form feature vectors

16th Feb 2009

Experiment 1 : Comparison of ZM and DCT features


The features are classified using SVM with RBF kernel. The SVM parameters are determined through 5-fold cross validation on the training data Leave-1-out method for testing

Average recognition rates DCT : 99% ZM : 97.4%

16th Feb 2009

Experiment 1 : Comparison of ZM and DCT features


Sensitivity to illumination variations testing Trained with original lighting images and tested with reduced/increased 30% brightness Average recognition rates
DCT = 100% Zernike moments = 100%

16th Feb 2009

Experiment 1 : Comparison of ZM and DCT features


Recognition rates of ZM and DCT features to rotational changes

16th Feb 2009

Experiment 1 : Comparison of ZM and DCT features


Sensitivity analysis of ZM and DCT to image noise

16th Feb 2009

Experiment 2 : Comparison of SVM and HMM classifier


Single stream , left-right HMM with 3 states used to model each viseme SVM and HMM trained and tested using ZM and DCT features of Participants 1 ( 280 utterances was used in this experiment) Leave-1-out method Average recognition rates:
HMM : 95.0% SVM : 99.5%

16th Feb 2009

Experiment 3 : Results for Temporal Segmentation of Utterances


98.6% accuracy 276 (out of a total of 280 phonemes) phonemes were correctly segmented (4 errors)

Discussion
The results demonstrate the efficacy of the proposed motion features in visual speech recognition DCT and Zernike moments produce high accuracy in classifying 14 visemes using SVM. The proposed technique is demonstrated to be invariant to global changes in illumination Zernike moments are demonstrated to be invariant to rotational changes up to 20 degrees whereas DCT is sensitive to such rotation.
16th Feb 2009

Discussion
DCT features has better tolerance towards image noise as compared to ZM features
Possible reasons for SVM to outperform HMM in classifying the motion features : STT eliminates the need for temporal modelling The training dataset is not large One possible reason for misclassification is the occlusion of the articulators movements. The accuracies is higher as compared to the results reported by Foo et. al. (2002) (88.5%) using static features tested on the same vocabulary.
16th Feb 2009

Conclusions
This research evaluated a novel approach for visual speech recognition using motion templates. The proposed technique is demonstrated to be useful for English phoneme recognition DCT features are found to be sensitive to rotational changes whereas Zernike moments are rotation invariant Motion segmentation technique used in the proposed technique eliminates the need for temporal modeling of the features for phoneme classification. Hence, SVM can be used to recognize the motion features and outperforms HMM The efficacy of the proposed temporal segmentation approach using motion and appearance information is demonstrated. 16th Feb 2009

Future work
Evaluates the proposed visual speech recognition method using different languages Combines with audio signals to form audio visual speech recognition systems Investigates the implementation of the proposed technique for speaker recognition and emotion recognition
16th Feb 2009

S-ar putea să vă placă și