Documente Academic
Documente Profesional
Documente Cultură
http://www.merl.com
Abstract
Deep clustering is a recently introduced deep learning architecture that uses discriminatively
trained embeddings as the basis for clustering. It was recently applied to spectrogram segmen-
tation, resulting in impressive results on speaker-independent multi-speaker separation. In
this paper we extend the baseline system with an end-to-end signal approximation objective
that greatly improves performance on a challenging speech separation. We first significantly
improve upon the baseline system performance by incorporating better regularization, larger
temporal context, and a deeper architecture, culminating in an overall improvement in signal
to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker sep-
aration, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend
the model to incorporate an enhancement layer to refine the signal estimates, and perform
end-to-end training through both the clustering and enhancement stages to maximize signal
fidelity. We evaluate the results using automatic speech recognition. The new signal approx-
imation objective, combined with end-to-end training, produces unprecedented performance,
reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major
advancement towards solving the cocktail party problem.
Interspeech
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in
whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all
such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric
Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all
applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require
a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.
Copyright
c Mitsubishi Electric Research Laboratories, Inc., 2016
201 Broadway, Cambridge, Massachusetts 02139
Single-Channel Multi-Speaker Separation using Deep Clustering
Yusuf Isik1,2,3 , Jonathan Le Roux1 , Zhuo Chen1,4 ,
Shinji Watanabe1 , John R. Hershey1
1
Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA
2
Sabanci University, Istanbul, Turkey, 3 TUBITAK BILGEM, Kocaeli, Turkey
4
Columbia University, New York, NY, USA