Documente Academic
Documente Profesional
Documente Cultură
Outline
Overview of speech recognition Non audio speech modalities Motivation for visual speech recognition Related work Contributions of this research
Experiments
Discussion Conclusions
16th Feb 2009
Spoken words
Audio signals (voice) : most commonly used input for speech recognition systems
Audio speech recognition applications: speech-to-text application voice dialing of mobile phones call routing aircraft control inside the pilot cockpit
Possible solutions : - by using noise robust techniques such as implementation of microphone arrays (Brandstein and Ward, 2001; Sullivan and Stern, 1993)
noise adaptation algorithms (Gales and Young, 1992;Stern et al., 1996) - by using non audio speech modalities
16th Feb 2009
G. Potamianos, Audio- visual speech processing : Progress and Challenges, VisHCI, Canberra Australia, Nov. 2006
Voiceless Recognition Without the Voice, May. 1, 2004 Issue of CIO Magazine
C Jorgensen, D. D. Lee and S. Agabon, Sub auditory recognition based on EMG/EPG signals, 2003.
SP Arjunan, D.K Kumar, W.C Yau, H. Weghorn, Unspoken Vowel Recognition Using Facial Electromyogram, EMBC, New York 2006.
http://www.eng.nus.edu.sg/EResnews/0202/rd/rd_4.html
McGurk effect (McGurk 1976) : we combine what we see with what we hear in understanding speech Sound /ba/ + lip movement /ga/ 98% of us perceive as /da/
People with hearing impairment can lip-read by looking at the mouth of the speaker
Related Work
Visual speech recognition techniques reported in the literature can be broadly divided into :
Appearance-based (Potamianos et. al. 2004) Uses image pixels in the surrounding mouth region
Shape-based (Petajan 1984, Adjoudani et. al. 1996) Uses the shape information of the mouth/lips
height
width
Appearance -based
Shape-based
Related Work
Few have focused on motion features. Human perceptual studies indicate that dynamic information is important for visual speech perception (Rosenblum & Saldaa, 1998)
Step 3 : Temporally integrate the binary images with linear ramp of time
16th Feb 2009
The intensity values of MT are normalized to reduce the changes in speed of speech
Histogram equalization is applied on the MT to minimize the global changes in lighting conditions
Feature extraction
2 types of features investigated: DCT coefficients
commonly used as appearance-based features in visual speech recognition
Zernike moments
a type of image moments novel features for visual speech recognition
Zernike moments
Advantages of Zernike moments :
Selected as one of the robust feature descriptor in MPEG-7 (Jeannin 2000) Rotation invariant Robust and good image representation (Teh 1988)
Computed by projecting the image function onto the orthogonal Zernike polynomial
Before computing ZM from MT, the MT needs to be mapped to a unit circle
Zernike moments
Mapping of MT to a unit circle
Zernike moments
Zernike Moments:
Zernike polynomial:
Normalizing constant:
Radial polynomials:
16th Feb 2009
Zernike moments
Zernike moments computed from an image function
Zernike moments
DCT features
2-D DCT produces a compact energy representation of an image Focuses energy on the top left corner of the image
DCT features
For an M x N image, f(x, y), DCT coefficients :
DCT features are shown to outperform DWT and PCA (Potamianos 2000) for visual speech recognition
16th Feb 2009
Classification
Assigning new feature vectors to one of the pre-defined utterances Two types of classifiers evaluated :
Generative models : hidden Markov models (HMM) Discriminative classifier : support vector machines (SVM)
16th Feb 2009
SVM classifier
Supervised classifiers trained using learning algorithm from statistical learning theory Successfully implemented for different image object recognition tasks Can find the optimal hyper plane between classes in sparse high-dimensional spaces with relatively few training data SVM with RBF kernel are used in the experiments for classifying the motion features (Zernike moments and DCT coefficients)
16th Feb 2009
HMM classifier
assumes that the speech signals contain short time segments that are stationary.
models these short periods where the signals are steady the changes between states are represented as transitions of states in HMM. the temporal variations within each of these segments are assumed to be statistical.
16th Feb 2009
HMM Classifier
The motion features are assumed to be Gaussian distributed and modelled as continuous observation densities Each phone is modelled as a left-right HMM with 3 states and diagonal covariance matrix. This HMM structure has been demonstrated to be suitable for modelling English phonemes (Foo & Dong 2002) Baum-Welch algorithm is used in the HMM training for reestimation of the HMM parameters
HMM Classifier
Recognition phase :
compute the likelihoods of each of the HMM to produce the test sample The test sample is classified as the phoneme with HMM producing the highest likelihoods
yes
Following frames = speaking
no
End of utterance
Start of utterance
Experiments
Experiment 1 : Compare the performance of Zernike moments and DCT features Experiment 2 : Compare the performance of HMM and SVM
Vocabulary
Recognition units : visemes (basic unit of facial movement during the articulation of a phoneme) Visemes defined in MPEG 4 standard is used
Experimental Setup
Video recording and Processing: Recorded using a web camera in an office environment Frontal view of the mouth of 10 speaker (5 males and 5 females). Constant view angle. A total of 2800 utterances were recorded as AVI files of 320x240. Frame rate : 30 frames/sec
Experimental Setup
1 MT was generated from grayscale images of each phoneme
Histogram equalization was applied on the images to reduce the effects of illumination variations The images were analysed and processed using MATLAB 7
LIBSVM toolbox (Chang and Lin 2001) was used to create the SVM classifier and HMM toolbox for Matlab (Murphy 1998) was used to design the HMM classifier.
16th Feb 2009
MT of 14 visemes
Experiments
Zernike moments and DCT coefficients are computed from the MTs 64 Zernike moments are used to represent each MT. Same number of DCT coefficients are used to form feature vectors
Discussion
The results demonstrate the efficacy of the proposed motion features in visual speech recognition DCT and Zernike moments produce high accuracy in classifying 14 visemes using SVM. The proposed technique is demonstrated to be invariant to global changes in illumination Zernike moments are demonstrated to be invariant to rotational changes up to 20 degrees whereas DCT is sensitive to such rotation.
16th Feb 2009
Discussion
DCT features has better tolerance towards image noise as compared to ZM features
Possible reasons for SVM to outperform HMM in classifying the motion features : STT eliminates the need for temporal modelling The training dataset is not large One possible reason for misclassification is the occlusion of the articulators movements. The accuracies is higher as compared to the results reported by Foo et. al. (2002) (88.5%) using static features tested on the same vocabulary.
16th Feb 2009
Conclusions
This research evaluated a novel approach for visual speech recognition using motion templates. The proposed technique is demonstrated to be useful for English phoneme recognition DCT features are found to be sensitive to rotational changes whereas Zernike moments are rotation invariant Motion segmentation technique used in the proposed technique eliminates the need for temporal modeling of the features for phoneme classification. Hence, SVM can be used to recognize the motion features and outperforms HMM The efficacy of the proposed temporal segmentation approach using motion and appearance information is demonstrated. 16th Feb 2009
Future work
Evaluates the proposed visual speech recognition method using different languages Combines with audio signals to form audio visual speech recognition systems Investigates the implementation of the proposed technique for speaker recognition and emotion recognition
16th Feb 2009