Listen and Look: Audio-Visual Matching Assisted Speech Source Separation

Listen and Look: Audio-Visual Matching Assisted Speech Source
Separation
Mane Pooja (M190442EC)
July 21, 2020
National Institute of Technology, Calicut

M. Tech, Signal Processing
Mane Pooja (M190442EC) NITC AV match July 21, 2020 1 / 30

Overview
1 Introduction
2 Prior Methods
Deep Clustering
3 Proposed Method
Architecture and Methodology
Mask Correction With Audio–Visual Matching
4 Experiments and Results

Separation results on 2 speaker mixtures
5 Conclusion
6 References

Introduction

Speech Source Separation: Flowchart
Figure: Flowchart of Speech Separation Methods

Audio-Visual Match
AV Match is Audio Visual Deep approach.
AIM: To separate individual voices from an audio mixture of multiple simultaneous talkers.
Audio and Visual signals of speech (e.g., lip movements), can be used to learn better feature
representations for speech separation.
Alleviates Source Permutaion problem.

Applications to speech separation:
Automatic Speech Recognition (ASR): It includes translation of spoken language into text
by computers.
Dialogue systems: A conversational agent is a computer system intended to converse with

a human.
Multimedia retrieval: To extract semantic information (like words, phrases, signs, and
symbols) from multimedia data sources.
Hearing aids: A hearing aid is a device designed to improve hearing to a person with hearing
loss.

Terms to be known
Speech Signal/Acoustic Signal: It is a 1-D function of time emerged from a speaker’s

mouth which has specific meaning.
Spectrogram : A spectrogram is a visual representation of the spectrum of frequencies of a

signal as it varies with time.
Source Permutaion problem:Assigning separated signal snippets to wrong sources over

time.

Prior Methods

Deep Methods:
Deep methods usually approach the problem through spectrogram mask estimation
1 Classification-based methods: Classify time-frequency (TF) bins to distinct speakers.
Disadvantage: Failed under speaker-independent case due to the source permutation

problem.
2 Clustering using supervised training: Predict the T-F bins and assign at once, without the
need of frame-by-frame assignment. eg: DC and uPIT
Disadvantage:Assigning T-F bins to sources at once is main cause of permutation problem.

Audio-based Deep Approach:
Deep Clustering :
Deals with the source permutation problem.
Deep Clustering method projects each T-F bin of the mixture’s spectrogram into a
high-dimensional space where embeddings of the same speaker can be clustered to form its
separation mask.
Limitation:Permutation problem still exists when two speakers are of same gender.

Proposed Method

Listen and Look: Audio–Visual Matching Assisted Speech Source
Separation
Design a Neural Network to learn speaker-independent audio–visual matching.
We use DC as the “Audio Only Separation Model” in our method.
We calculate similarities between the separated audio and visual streams with our proposed
“AV Match" model.
We use these similarities to correct masks predicted by the audio-only model.

AV Match Framework
Figure: Proposed AV Match speech separation framework

Methodology
We calculate audio only embeddings using audio network(DC model).
We calculate visual only embeddings using visual network.
We find audio-visual embeddings using AV match model.
We compute similarities of audio and visual embeddings by inner product.

AV Match Architecture
Mane Pooja (M190442EC) NITC Figure: Proposed AVAV

Match
matchNetwork Architecture July 21, 2020 15 / 30
CNN Architecture: Types of Layers
Convolutional layer: A “filter” passes over the image, scanning a few pixels at a time and
creating a feature map, that predicts the class to which feature belongs.
Max-Pooling layer (downsampling): Reduces the amount of information in each feature

obtained in the convolutional layer, maintaining the most important information.
Batch Normalisation: Allows each layer of a network to learn by itself independently of other
layers.
Activation Function: Relu Activation function is used to reduce Non-linearity.

CNN Architecture: Types of Layers
Fully connected input layer: Takes the output of the previous layers, turns them into a
single vector that can be an input for the next stage.
Fully connected output layer: Gives the final probabilities for each label.
Fully connected layer: Takes the inputs from the feature analysis and applies weights to
predict the correct label.

Bidirectional Long Short-Term Memory
BLSTM: The output layer can get information from past (backwards) and future (forward)
states simultaneously.
Do not require input data to be fixed. Their future input information is reachable from the
current state.
BLSTM are especially useful to extract features using the context of input .

Mask Correction With Audio–Visual Matching
Similarities obtained by the predicted masks are denoted as
Relative similarity of audio and visual streams obtained by applying the triplet loss for training
and set m = 1 empirically
A binary sequence decides whether we permute masks predicted by the audio-only model
can be obtained by

Experiments and Results

Experiments:
We carry out experiments on 2-speaker mixtures of the WSJ0 and GRID datasets.
We also quantitatively and qualitatively show the benefits of the proposed AV Match model in
obtaining high performance even for same-gender mixtures .

fig3:Results on 2 Speaker Mixtures

Conclusions based on above table:
The proposed AVDC model outperform the audio-based DC on both the GRID and WSJO
datasets.
It is clear that the proposed method improves the separation quality by a large margin on the
GRID dataset.
In same-gender mixtures, we can better trace the speakers given visual information of the lip
regions,thus relieving the source permutation problem.
We also observe that median filtering improves the performance.

fig3:Improvement on SDR of different settings of AV Match against the DC baseline.

fig3:Results on GRID-Extreme Dataset

Conclusion

Conclusion
Proposed AV Match model successfully corrects the permutation problem in the masks.
Proposed approach is effective when the performance of audio-only separation is poor.
The training procedure of AV Match model is independent of the audio-only separation model.

References

References
J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering:Discriminative

embeddings for segmentation and separation,” in Proc.41st Int. Conf. Acoust., Speech,
Signal Process., 2016, pp. 31–35.
Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single channel multi-speaker

separation using deep clustering,” in Proc. Interspeech,2016, pp. 545–549.
A. Torfi, S. M. Iranmanesh, N. Nasrabadi, and J. Dawson, “3D convolutional neural networks

for cross audio-visual matching recognition,” IEEE Access, vol. 5, pp. 22081–22091, 2017.
J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2017, pp. 3444–3453.

Thank You

Listen and Look: Audio-Visual Matching Assisted Speech Source Separation

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Listen and Look: Audio-Visual Matching Assisted Speech Source Separation

Încărcat de

Drepturi de autor:

Formate disponibile

Listen and Look: Audio-Visual Matching Assisted Speech Source

Mane Pooja (M190442EC)

July 21, 2020

National Institute of Technology, Calicut

Mane Pooja (M190442EC) NITC AV match July 21, 2020 1 / 30

4 Experiments and Results

Mane Pooja (M190442EC) NITC AV match July 21, 2020 2 / 30

Mane Pooja (M190442EC) NITC AV match July 21, 2020 3 / 30

Figure: Flowchart of Speech Separation Methods

Mane Pooja (M190442EC) NITC AV match July 21, 2020 4 / 30

AV Match is Audio Visual Deep approach.

Alleviates Source Permutaion problem.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 5 / 30

Dialogue systems: A conversational agent is a computer system intended to converse with

Mane Pooja (M190442EC) NITC AV match July 21, 2020 6 / 30

Speech Signal/Acoustic Signal: It is a 1-D function of time emerged from a speaker’s

Spectrogram : A spectrogram is a visual representation of the spectrum of frequencies of a

Source Permutaion problem:Assigning separated signal snippets to wrong sources over

Mane Pooja (M190442EC) NITC AV match July 21, 2020 7 / 30

Mane Pooja (M190442EC) NITC AV match July 21, 2020 8 / 30

1 Classification-based methods: Classify time-frequency (TF) bins to distinct speakers.

Disadvantage: Failed under speaker-independent case due to the source permutation

Disadvantage:Assigning T-F bins to sources at once is main cause of permutation problem.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 9 / 30

Mane Pooja (M190442EC) NITC AV match July 21, 2020 10 / 30

Mane Pooja (M190442EC) NITC AV match July 21, 2020 11 / 30

Design a Neural Network to learn speaker-independent audio–visual matching.

We use DC as the “Audio Only Separation Model” in our method.

We use these similarities to correct masks predicted by the audio-only model.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 12 / 30

Figure: Proposed AV Match speech separation framework

Mane Pooja (M190442EC) NITC AV match July 21, 2020 13 / 30

We calculate audio only embeddings using audio network(DC model).

We calculate visual only embeddings using visual network.

We find audio-visual embeddings using AV match model.

We compute similarities of audio and visual embeddings by inner product.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 14 / 30

Mane Pooja (M190442EC) NITC Figure: Proposed AVAV

Max-Pooling layer (downsampling): Reduces the amount of information in each feature

Activation Function: Relu Activation function is used to reduce Non-linearity.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 16 / 30

Mane Pooja (M190442EC) NITC AV match July 21, 2020 17 / 30

Mane Pooja (M190442EC) NITC AV match July 21, 2020 18 / 30

Similarities obtained by the predicted masks are denoted as

Mane Pooja (M190442EC) NITC AV match July 21, 2020 19 / 30

Mane Pooja (M190442EC) NITC AV match July 21, 2020 20 / 30

Mane Pooja (M190442EC) NITC AV match July 21, 2020 21 / 30

fig3:Results on 2 Speaker Mixtures

Mane Pooja (M190442EC) NITC AV match July 21, 2020 22 / 30

We also observe that median filtering improves the performance.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 23 / 30

fig3:Improvement on SDR of different settings of AV Match against the DC baseline.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 24 / 30

fig3:Results on GRID-Extreme Dataset

Mane Pooja (M190442EC) NITC AV match July 21, 2020 25 / 30

Mane Pooja (M190442EC) NITC AV match July 21, 2020 26 / 30

Proposed approach is effective when the performance of audio-only separation is poor.

Mane Pooja (M190442EC) NITC AV match July 21, 2020 27 / 30

Mane Pooja (M190442EC) NITC AV match July 21, 2020 28 / 30

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering:Discriminative

Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single channel multi-speaker

A. Torfi, S. M. Iranmanesh, N. Nasrabadi, and J. Dawson, “3D convolutional neural networks

Mane Pooja (M190442EC) NITC AV match July 21, 2020 29 / 30