Documente Academic
Documente Profesional
Documente Cultură
Separation
1 Introduction
2 Prior Methods
Deep Clustering
3 Proposed Method
Architecture and Methodology
Mask Correction With Audio–Visual Matching
5 Conclusion
6 References
AIM: To separate individual voices from an audio mixture of multiple simultaneous talkers.
Audio and Visual signals of speech (e.g., lip movements), can be used to learn better feature
representations for speech separation.
Automatic Speech Recognition (ASR): It includes translation of spoken language into text
by computers.
Multimedia retrieval: To extract semantic information (like words, phrases, signs, and
symbols) from multimedia data sources.
Hearing aids: A hearing aid is a device designed to improve hearing to a person with hearing
loss.
Deep methods usually approach the problem through spectrogram mask estimation
2 Clustering using supervised training: Predict the T-F bins and assign at once, without the
need of frame-by-frame assignment. eg: DC and uPIT
Deep Clustering :
Deals with the source permutation problem.
Deep Clustering method projects each T-F bin of the mixture’s spectrogram into a
high-dimensional space where embeddings of the same speaker can be clustered to form its
separation mask.
Limitation:Permutation problem still exists when two speakers are of same gender.
We calculate similarities between the separated audio and visual streams with our proposed
“AV Match" model.
Convolutional layer: A “filter” passes over the image, scanning a few pixels at a time and
creating a feature map, that predicts the class to which feature belongs.
Batch Normalisation: Allows each layer of a network to learn by itself independently of other
layers.
Fully connected input layer: Takes the output of the previous layers, turns them into a
single vector that can be an input for the next stage.
Fully connected output layer: Gives the final probabilities for each label.
Fully connected layer: Takes the inputs from the feature analysis and applies weights to
predict the correct label.
BLSTM: The output layer can get information from past (backwards) and future (forward)
states simultaneously.
Do not require input data to be fixed. Their future input information is reachable from the
current state.
BLSTM are especially useful to extract features using the context of input .
Relative similarity of audio and visual streams obtained by applying the triplet loss for training
and set m = 1 empirically
A binary sequence decides whether we permute masks predicted by the audio-only model
can be obtained by
We carry out experiments on 2-speaker mixtures of the WSJ0 and GRID datasets.
We also quantitatively and qualitatively show the benefits of the proposed AV Match model in
obtaining high performance even for same-gender mixtures .
The proposed AVDC model outperform the audio-based DC on both the GRID and WSJO
datasets.
It is clear that the proposed method improves the separation quality by a large margin on the
GRID dataset.
In same-gender mixtures, we can better trace the speakers given visual information of the lip
regions,thus relieving the source permutation problem.
Proposed AV Match model successfully corrects the permutation problem in the masks.
The training procedure of AV Match model is independent of the audio-only separation model.
J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2017, pp. 3444–3453.