Sunteți pe pagina 1din 30

Listen and Look: Audio-Visual Matching Assisted Speech Source

Separation

Mane Pooja (M190442EC)

July 21, 2020

National Institute of Technology, Calicut


M. Tech, Signal Processing

Mane Pooja (M190442EC) NITC AV match July 21, 2020 1 / 30


Overview

1 Introduction

2 Prior Methods
Deep Clustering

3 Proposed Method
Architecture and Methodology
Mask Correction With Audio–Visual Matching

4 Experiments and Results


Separation results on 2 speaker mixtures

5 Conclusion

6 References

Mane Pooja (M190442EC) NITC AV match July 21, 2020 2 / 30


Introduction

Mane Pooja (M190442EC) NITC AV match July 21, 2020 3 / 30


Speech Source Separation: Flowchart

Figure: Flowchart of Speech Separation Methods

Mane Pooja (M190442EC) NITC AV match July 21, 2020 4 / 30


Audio-Visual Match

AV Match is Audio Visual Deep approach.

AIM: To separate individual voices from an audio mixture of multiple simultaneous talkers.

Audio and Visual signals of speech (e.g., lip movements), can be used to learn better feature
representations for speech separation.

Alleviates Source Permutaion problem.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 5 / 30


Applications to speech separation:

Automatic Speech Recognition (ASR): It includes translation of spoken language into text
by computers.

Dialogue systems: A conversational agent is a computer system intended to converse with


a human.

Multimedia retrieval: To extract semantic information (like words, phrases, signs, and
symbols) from multimedia data sources.

Hearing aids: A hearing aid is a device designed to improve hearing to a person with hearing
loss.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 6 / 30


Terms to be known

Speech Signal/Acoustic Signal: It is a 1-D function of time emerged from a speaker’s


mouth which has specific meaning.

Spectrogram : A spectrogram is a visual representation of the spectrum of frequencies of a


signal as it varies with time.

Source Permutaion problem:Assigning separated signal snippets to wrong sources over


time.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 7 / 30


Prior Methods

Mane Pooja (M190442EC) NITC AV match July 21, 2020 8 / 30


Deep Methods:

Deep methods usually approach the problem through spectrogram mask estimation

1 Classification-based methods: Classify time-frequency (TF) bins to distinct speakers.

Disadvantage: Failed under speaker-independent case due to the source permutation


problem.

2 Clustering using supervised training: Predict the T-F bins and assign at once, without the
need of frame-by-frame assignment. eg: DC and uPIT

Disadvantage:Assigning T-F bins to sources at once is main cause of permutation problem.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 9 / 30


Audio-based Deep Approach:

Deep Clustering :
Deals with the source permutation problem.

Deep Clustering method projects each T-F bin of the mixture’s spectrogram into a
high-dimensional space where embeddings of the same speaker can be clustered to form its
separation mask.

Limitation:Permutation problem still exists when two speakers are of same gender.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 10 / 30


Proposed Method

Mane Pooja (M190442EC) NITC AV match July 21, 2020 11 / 30


Listen and Look: Audio–Visual Matching Assisted Speech Source
Separation

Design a Neural Network to learn speaker-independent audio–visual matching.

We use DC as the “Audio Only Separation Model” in our method.

We calculate similarities between the separated audio and visual streams with our proposed
“AV Match" model.

We use these similarities to correct masks predicted by the audio-only model.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 12 / 30


AV Match Framework

Figure: Proposed AV Match speech separation framework

Mane Pooja (M190442EC) NITC AV match July 21, 2020 13 / 30


Methodology

We calculate audio only embeddings using audio network(DC model).

We calculate visual only embeddings using visual network.

We find audio-visual embeddings using AV match model.

We compute similarities of audio and visual embeddings by inner product.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 14 / 30


AV Match Architecture

Mane Pooja (M190442EC) NITC Figure: Proposed AVAV


Match
matchNetwork Architecture July 21, 2020 15 / 30
CNN Architecture: Types of Layers

Convolutional layer: A “filter” passes over the image, scanning a few pixels at a time and
creating a feature map, that predicts the class to which feature belongs.

Max-Pooling layer (downsampling): Reduces the amount of information in each feature


obtained in the convolutional layer, maintaining the most important information.

Batch Normalisation: Allows each layer of a network to learn by itself independently of other
layers.

Activation Function: Relu Activation function is used to reduce Non-linearity.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 16 / 30


CNN Architecture: Types of Layers

Fully connected input layer: Takes the output of the previous layers, turns them into a
single vector that can be an input for the next stage.

Fully connected output layer: Gives the final probabilities for each label.

Fully connected layer: Takes the inputs from the feature analysis and applies weights to
predict the correct label.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 17 / 30


Bidirectional Long Short-Term Memory

BLSTM: The output layer can get information from past (backwards) and future (forward)
states simultaneously.

Do not require input data to be fixed. Their future input information is reachable from the
current state.

BLSTM are especially useful to extract features using the context of input .

Mane Pooja (M190442EC) NITC AV match July 21, 2020 18 / 30


Mask Correction With Audio–Visual Matching

Similarities obtained by the predicted masks are denoted as

Relative similarity of audio and visual streams obtained by applying the triplet loss for training
and set m = 1 empirically

A binary sequence decides whether we permute masks predicted by the audio-only model
can be obtained by

Mane Pooja (M190442EC) NITC AV match July 21, 2020 19 / 30


Experiments and Results

Mane Pooja (M190442EC) NITC AV match July 21, 2020 20 / 30


Experiments:

We carry out experiments on 2-speaker mixtures of the WSJ0 and GRID datasets.

We also quantitatively and qualitatively show the benefits of the proposed AV Match model in
obtaining high performance even for same-gender mixtures .

Mane Pooja (M190442EC) NITC AV match July 21, 2020 21 / 30


Separation results on 2 speaker mixtures

fig3:Results on 2 Speaker Mixtures

Mane Pooja (M190442EC) NITC AV match July 21, 2020 22 / 30


Conclusions based on above table:

The proposed AVDC model outperform the audio-based DC on both the GRID and WSJO
datasets.

It is clear that the proposed method improves the separation quality by a large margin on the
GRID dataset.

In same-gender mixtures, we can better trace the speakers given visual information of the lip
regions,thus relieving the source permutation problem.

We also observe that median filtering improves the performance.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 23 / 30


Separation results on 2 speaker mixtures

fig3:Improvement on SDR of different settings of AV Match against the DC baseline.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 24 / 30


Separation results on 2 speaker mixtures

fig3:Results on GRID-Extreme Dataset

Mane Pooja (M190442EC) NITC AV match July 21, 2020 25 / 30


Conclusion

Mane Pooja (M190442EC) NITC AV match July 21, 2020 26 / 30


Conclusion

Proposed AV Match model successfully corrects the permutation problem in the masks.

Proposed approach is effective when the performance of audio-only separation is poor.

The training procedure of AV Match model is independent of the audio-only separation model.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 27 / 30


References

Mane Pooja (M190442EC) NITC AV match July 21, 2020 28 / 30


References

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering:Discriminative


embeddings for segmentation and separation,” in Proc.41st Int. Conf. Acoust., Speech,
Signal Process., 2016, pp. 31–35.

Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single channel multi-speaker


separation using deep clustering,” in Proc. Interspeech,2016, pp. 545–549.

A. Torfi, S. M. Iranmanesh, N. Nasrabadi, and J. Dawson, “3D convolutional neural networks


for cross audio-visual matching recognition,” IEEE Access, vol. 5, pp. 22081–22091, 2017.

J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2017, pp. 3444–3453.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 29 / 30


Thank You

Mane Pooja (M190442EC) NITC AV match July 21, 2020 30 / 30

S-ar putea să vă placă și