IberSPEECH 2018 Proceedings

PROGRAM
AND PROCEEDINGS
Iber SPEECH2018
BARCELONANOVEMBER 21-23
Wed Thu Fri
21 22 23
8:00
Registration
Starts at 08:00
9:00 Opening
09:00-09:20
Oral 4: Synthesis, Produc- Oral 5: Text & NLP
tion & Analysis Applications
Oral 1: Speaker recognition 09:00-10:40 09:00 - 10.40
10:00 09:20-10:40
Special Session: Projects,
Coffee break Demo
Coffeeand PhD thesis
break Coffee break
10:40-11:00 12:00-13:30
10:40-11:00 10:40-11:00
11:00
Keynote: Tanja Schultz Keynote: Rob Clark Keynote: Lluis Marquez
11:00-12:00 11:00-12:00 11:00-12:00
12:00
Posters: Topics on Speech Special Session: Projects, Round Table
Technologies Demo and PhD thesis 12:00-13:00
13:00 12:00-13:30 12:00-13:30
Closing
13:00-13:30
14:00 Lunch Lunch

13:30-15:00 13:30-15:00
15:00
Oral 2: ASR & Speech Albayzin Evaluations
Applications 15:00-16:40
16:00 15:00-16:40
Coffee break Coffee break

17:00 16:40-17:00 16:40-17:00
Oral 3: Speech & Language Albayzin Evaluations

Tech. Applied to Health 17:00-19:00
18:00 17:00-18:40
19:00 RTTH Assembly

18:45-19:45
20:00
Welcome Reception
20:00
21:00 Gala Dinner
20:30
IberSPEECH2018
NOVEMBER 21-23
BARCELONA
Edited by Antonio Bonafonte, Jordi Luque and Francesc Alías Pujol
At the time of release, the proceedings can be downloaded from:
IberSPEECH2018 website: iberspeech2018.talp.cat

ISCA Archive: https://www.isca-speech.org/archive/Iberspeech_2018/
Barcelona, November 2018

Contents
Welcome Message iii
Committees vi
Organizing Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Program Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Organizing Institutions xi
Awards xii
Venue xiii
Social Program xvi
Invited Speakers xix

Tanja Schultz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Rob Clark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
Lluís Màrquez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii
Technical Program xxv
Papers xxxiii
Author’s Index 307
i
ii
Welcome Message
Welcome to IberSPEECH2018 in Barcelona, November 21-23rd 2018, co-organized by Tele-

fónica Research, the Center for Language and Speech Technologies and Applications (TALP)
at the Universitat Politècnica de Catalunya (UPC), and the Research Group on Media Tech-
nologies (GTM) at La Salle, Universitat Ramon Llull. The current edition has received ines-
timable valued support by the Spanish Thematic Network on Speech Technology (RTTH),
the Cátedra RTVE and the Voice Input Voice Output Laboratory (Vivolab) at Universidad de
Zaragoza, and the ISCA Special Interest Group on Iberian Languages (SIG-IL). In addition, and
for the very first time, IberSPEECH becomes an official ISCA Supported Event.
The IberSPEECH2018 event — the fourth of its kind using this name — brings together the
10th Jornadas en Tecnologías del Habla and the 6th Iberian SLTech Workshop events, aiming
to promote interaction and discussion among junior and senior researchers in the field of
speech and language processing for Iberian languages.
Barcelona is a modern capital of 1.7 million people and the sixth-most populous urban area in
Europe. It is the home of many points of interest declared as World Heritage Sites by UNESCO
such as Sagrada Familia, Park Güell, and Palau de la Musica Catalana, and also the birthplace
for great minds like Antoni Gaudí, Joan Miró, Montserrat Caballé, Eduardo Mendoza, and
many more. Barcelona offers a unique combination of landscapes and weather, coupled
with exquisite gastronomical experiences that are the result of a blend of heritage, produce,
terroir, tradition, creativity, and innovation.
The venue, the Telefónica Diagonal ZeroZero Tower by Spanish architects EMBA, is located at
the very beginning of the famous Diagonal Avenue, which crosses Barcelona diagonally from
the sea to the Llobregat river. The position of the Diagonal ZeroZero Tower is exceptional; it
is very visible from the city and the coast, and it lays on the border between the consolidated
city and the large expanses of public space in the Forum area. In fact, the address of the
tower is Diagonal Avenue, number 0 and is just next to the Forum Building. The Forum area
is next to the sea and the Besòs River end. It is a new business centre, located in a reformed
area. Torre Telefónica is hosting several Telefónica Group companies as well as the Tele-
fónica Research group, a leading industrial research lab following an open research model
in collaboration with Universities and other research institutions. It promotes the dissemi-
nation of scientific results both through publications in top-tier peer-reviewed international
journals, conferences, and technology transfer.
Following with the tradition of previous editions, IberSPEECH2018 will be a three-day event,
bringing together the best researchers and practitioners in speech and language technolo-
gies in Iberian languages to promote interaction and discussion. The organizing committee
has planned a wide variety of scientific and social activities, including technical paper presen-
iii
tations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis,
discussion panels, a round table, and awards to the best thesis and papers.
The core of the scientific program of IberSPEECH2018 includes a total of 37 full regular pa-
per contributions that will be presented distributed among 5 oral and 1 poster sessions. To
ensure the quality of all the contributions, each submitted paper was reviwed by three mem-
bers of the scientific review committee. All the papers in the conference will be accessible
through the International Speech Communication Association (ISCA) Online Archive. Paper
selection was based on the scores and comments provided by the scientific review commit-
tee, which includes over 86 researchers from different institutions (mainly from Spain and
Portugal, but also from France, Germany, Brazil, Slovakia, Ireland, Greece, Hungary, Slovenia,
Austria and United Kingdom).
Furthermore, it is confirmed to publish an extension of selected papers as a special issue of
the Journal of Applied Sciences, “IberSPEECH 2018: Speech and Language Technologies for
Iberian Languages”, published by MDPI with fully open access. In addition to regular paper
sessions, the IberSPEECH2018 scientific program features the following activities: the AL-
BAYZIN evaluation challenge session, a special session including the presentation of demos,
research projects and recent PhD thesis, a round table and three keynote lectures.
Following the success of previous ALBAYZIN technology evaluations since 2006, this year
ALBAYZIN evaluations have focused around multimedia analysis of TV broadcast content.
Under the framework of a newly created Cátedra RTVE at Universidad de Zaragoza, we intro-
duce and report on the results of the IberSPEECH-RTVE 2018 Challenge. The Corporación de
Radiotelevisión Española (RTVE) has provided participants with an annotated TV broadcast
database and the necessary tools for the evaluations, promoting the fair and transparent
comparison of technology in different fields related to speech and language technology. It
comprises four different challenge evaluations: Speech to Text Challenge (S2TC), Speaker Di-
arization Challenge (SDC) and Multimodal Diarization Challenge (MDC), organized by RTVE
and Universidad de Zaragoza; and the Search on Speech Challenge (SoSC) jointly organized
by Universidad San Pablo-CEU and AuDIaS from Universidad Autónoma de Madrid with the
support of the ALBAYZIN Committee. Overall, 7 teams participated in the S2TC challenge,
8 teams in the SDC, 3 teams in the MDC, and 3 more teams in the SoSC challenge, which
results in 21 system paper description contributions. Additionally, 11 special session pa-
pers are also included in the conference program. These were intended to describe either
progress in current or recent research and development projects, demonstration systems, or
PhD Thesis extended abstracts to compete in the PhD Award. Furthermore, IberSPEECH2018
features 3 remarkable keynote speakers: Prof. Tanja Schultz (University of Bremen, Germany
and Institute of Carnegie Mellon, Pittsburgh, PA USA), Dr. Rob Clark (Google, London, UK)
and Dr. Lluis Marquez (Amazon, Barcelona, Spain), to whom we would like to acknowledge
for their extremely valuable participation.
Moreover, a round table with recognized experts brought discussion about the role of re-
search and innovation from both academia and industry. Such a symbiosis creates market
power by exploring and developing new categories which will eventually become the next
blue oceans of our society. However, converting such activities into real businesses, making
strong bones and figuring out new products that have a major impact in the real world is
a non trivial task. A round table, we expected as an opportunity for both worlds on finding
and exploring synergies and collaboration.
iv
The social program of IberSPEECH2018 sets sail with the welcome reception at the Escola
d’Enginyeria Barcelona Est (EEBE), at the recently created UPC Diagonal Besòs Campus, next
to the Telefonica Tower. EEBE aims to become a top-quality academic centre in the field of
engineering for the 21st-century industry that is capable of acting as an agent of transfor-
mation at a local and international level. The EEBE was born from the Barcelona College of
Industrial Engineering (EUETIB) and from part of the teaching and research activity in chem-
ical and materials engineering hitherto carried out at the Barcelona School of Industrial En-
gineering (ETSEIB). The gala dinner will be held at Restaurant Marítim, next to the legendary
Barcelona Reial Club Marítim, designed by Lázaro Rosa-Violán from Contemporain Studio,
to provide the aesthetics and flavours of different Mediterranean paradises.
Finally, we would like to thank all those whose effort made possible this conference, including
the members of the organizing committee, the local organizing committee, the ALBAYZIN
committee, the scientific reviewer committee, the authors, the conference attendees, the
supporting institutions, and so many people who gave their best to achieve a succesful con-
ference.
Barcelona, November 2018
Jordi Luque, General Chair
v
Organizing Committee
General Chair:
Jordi Luque, Telefónica Research, Spain
General Co-Chairs:
Antonio Bonafonte, Universitat Politècnica de Catalunya, Spain
Francesc Alías Pujol, La Salle – Universitat Ramon LLull, Spain
António Teixeira, Universidade de Aveiro, Portugal
Technical Program Chair:

Javier Hernando Pericás, Universitat Politècnica de Catalunya, Spain
Technical Program Co-Chairs:
Alberto Abad, INESC ID Lisboa / IST Lisboa, Portugal
Xavier Anguera, ELSA Corp., United States
Carlos David Martínez Hinarejos, Universitat Politècnica de València, Spain
Eva Navas, University of the Basque Country, UPV- EHU, Spain
Carlos Segura, Telefónica Research, Spain
Publication Chairs:
vi
Special Session and Awards Chair:
Ascensión Gallardo-Antolín, Universidad Carlos III, Spain
Plenary Talks & Round Table Chairs:

Joan Serrà, Telefónica Research, Spain
Marta Ruiz Costa-Jussà, Universitat Politècnica de Catalunya, Spain
Evaluations Chairs:
Alfonso Ortega, Universidad de Zaragoza, Spain
Eduardo Lleida, Universidad de Zaragoza, Spain
Luis Javier Rodríguez Fuentes, Universidad del País Vasco, Spain
Local Committee:
Jordi Pons, Universitat Pompeu Fabra, Spain
Bardia Rafieian, Universitat Politècnica de Catalunya, Spain
Carlos Segura, Telefónica Research, Spain
vii
Scientific Review Committee
Alberto Abad, INESC-IST, Portugal

Francesc Alías, La Salle – Universitat Ramon Llull, Spain
Aitor Alvarez, Vicomtech-IK4, Spain
Xavier Anguera, Miro ELSA Corp., Portugal
Jorge Baptista, INESC-ID Lisboa, Portugal
Plinio Barbosa, University of Campinas, Brazil
Fernando Batista, INESC-ID & ISCTE-IUL, Portugal
José Miguel Benedí, Universitat Politècnica de València, Spain
Joao Cabral, Trinity College Dublin, Ireland
Francisco Casacuberta, Universitat Politècnica de València, Spain
María José Castro-Bleda, Universitat Politècnica de València, Spain
Ricardo Córdoba, Universidad Politécnica de Madrid, Spain
Conceicao Cunha, IPS Munich, Germany
Carme de-la-Mota, Universitat Autònoma de Barcelona, Spain
Arantza del Pozo, Vicomtech, Spain
Laura Docío-Fernández, University of Vigo, Spain
Daniel Erro, Cirrus Logic, Madrid, Spain
David Escudero, University of Valladolid, Spain
Nicholas Evans, EURECOM, France
Mireia Farrús, Universitat Pompeu Fabra, Spain
Rubén Fernández, Universidad Politécnica de Madrid, Spain
Javier Ferreiros, Universidad Politécnica de Madrid, Spain
Julian Fierrez, Universidad Autonoma de Madrid, Spain
Ascensión Gallardo, Universidad Carlos III de Madrid, Spain
Fernando García Granada, Universitat Politècnica de València, Spain
Carmen García Mateo, University of Vigo, Spain
Kafentzis George, University of Crete, Greece
Omid Ghahabi, EML European Media Laboratory GmbH, Germany
Juna Ignacio Godino, Llorente Universidad Politécnica de Madrid, Spain
viii
Jon Ander Gómez, Universitat Politècnica de València, Spain
Emilio Granell, Universitat Politècnica de València, Spain
Inma Hernaez, University of the Basque Country (UPV/EHU), Spain
Javier Hernando, Universitat Politècnica de Catalunya, Spain
Lluís-F. Hurtado, Universitat Politècnica de València, Spain
Oliver Jokisch, Leipzig University of Telecommunications (HfTL), Germany
Oscar Koller, Microsoft Germany GmbH, Germany
Eduardo Lleida, University of Zaragoza, Spain
José David Lopes, Heriot Watt University UK
Paula López Otero, Universidade da Coruña, Spain
Carlos David Martínez Hinarejos, Universitat Politècnica de València, Spain
Helena Moniz, INESC/FLUL, Portugal
Juan Montero, Universidad Politécnica de Madrid, Spain
Nicolás Morales, Nuance Communications GmbH, Germany
Climent Nadeu, Universitat Politècnica de Catalunya, Spain
Juan L. Navarro-Mesa, Universidad de Las Palmas de Gran Canaria, Spain
Eva Navas, University of the Basque Country, Spain
Géza Németh, Budapest University of Technology & Economics, Hungary
Nelson Neto, Universidade Federal do Pará, Brazil
Hermann Ney, RWTH Aachen University, Germany
Alfonso Ortega, University of Zaragoza, Spain
Yannis Pantazis, Foundations for Research and Technology – Hellas, Spain
Carmen Peláez-Moreno University Carlos III Madrid, Spain
Thomas Pellegrini, Université de Toulouse; IRIT, France
Mikel Penagarikano, University of the Basque Country, Spain
Fernando Perdigao, Institute of Telecommunications (IT), Lisbon, Portugal
José L. Pérez-Córdoba, University of Granada, Spain
Ferran Pla, Universitat Politècnica de València, Spain
Jiri Pribil, Slovak Academy of Sciences Slovakia
Jorge Proenca, IT – Coimbra, Portugal
Michael Pucher, Acoustics Research Institute Austria
Paulo Quaresma, Universidade de Evora, Portugal
Ganna Raboshchuk, ELSA Corp., Portugal
Sam Ribeiro, The University of Edinburgh, UK
Eduardo Rodriguez Banga, University of Vigo, Spain
Luis Javier Rodríguez-Fuentes, Univ. of the Basque Country UPV/EHU, Spain
Rubén San-Segundo, Universidad Politécnica de Madrid, Spain
ix
Jon Sánchez, Aholab – EHU/UPV, Spain
Joan Andreu, Sanchez Universitat Politècnica de València, Spain
Emilio Sanchis, Universitat Politècnica de València, Spain
Diana Santos, University of Oslo, Norway
Ibon Saratxaga, University of the Basque Country, Spain
Encarna Segarra, Universitat Politècnica de València, Spain
Carlos Segura Perales, Telefónica Research, Spain
Alberto Simões, 2Ai Lab – IPCA, Portugal
Rubén Solera-Ureña, INESC-ID Lisboa, Portugal
António Teixeira, University of Aveiro, Portugal
Javier Tejedor, Universidad CEU San Pablo, Spain
Doroteo Toledano, Universidad Autónoma de Madrid, Spain
Isabel Trancoso, INESC ID Lisboa / IST, Portugal
Cassia Valentini-Botinhao, The University of Edinburgh, UK
Amparo Varona, University of the Basque Country, Spain
Andrej Zgank, University of Maribor, Slovenia
Catalin Zorila, Toshiba Cambridge Research Laboratory UK
x
Organizing Institutions
This conference has been organized by:
with the collaboration of:
IberSPEECH2018 has been partially funded by the project Red Temática en Tecnologías del Habla 2017
(TEC2017-90829-REDT) founded by Ministerio de Ciencia, Innovación y Universidades.
xi
Awards
Best Paper Award
All regular papers are candidates for this award. The award, given based on the review reports
and the presentation at the conference, grants the authors the publicaton of an extended
version of their work within the Special Issue of Applied Sciences journal (MDPI) entitled
”IberSPEECH 2018: Speech and Language Technologies for Iberian Languages”.
Best Albayzin evaluation system
Papers submitted to Albayzin evaluation tasks are candidates for these awards. The awards
will be given to the winners of the Albayzin evaluation challenges, in accordance with the
evaluation plan and rules defined for each task.
Best PhD Thesis award
Papers submitted to the PhD Thesis special session are candidates for this award. The award
is given based on the decision of the committee formed by the members of the General chair,
Technical Program chair and Special Session and Awards Chair. The award is given based on
different criteria, including the quality of the document, impact of the thesis and clearness
of the presentation at the conference.
Professional Career Prize in Speech Technologies
This is an honorary prize awarded by the Spanish Thematic Network on Speech Technology
(RTTH) that recognizes experienced individuals who have made outstanding contributions
related to speech technology research in Spain.
IberSPEECH2018 Edition
• José B. Mariño Acebal, Universitat Politècnica de Catalunya
• Antonio Rubio Ayuso, Universidad de Granada.
xii
Venue
Main conference venue

Torre Telefónica
Address:
Torre Telefónica – Diagonal 00
Plaza de Ernest Lluch i Martín, 5
08019 Barcelona – Spain
WI-FI access instructions

Connect to IberSPEECH2018’s network with the following password:
• Network: Iberspeech 2018
• Password: *******
xiii
The main body of the conference will be held in Torre Telefónica (floor 0, see figure) and
the Auditorium in floor 2 by accessing the elevators depicted in the figure. The following
diagram outlines the main conference areas and services:
xiv
Lunch: 21 and 22 November, 13:30.
D’Ins Escola, Restaurant i Càtering
A gastronomic offer adapted to the occasion and a catering service with an added value
that will enrich it: THE SOCIAL VALUE of the PEOPLE that work in this service. People who
participate in a training and job placement program developed by the Fundación Formació
i Treball.
Address: Carrer de Ramon Llull, 240, 08930, Sant Adrià del Besòs.
Wednesday 21st and Thursday 22nd lunches’ will be given very close to Torre Telefónica
(350 m).
xv
Social Program
Welcome Reception: Wednesday 21 November, 20:00.

Escola d’Enginyeria Barcelona Est (UPC Diagonal Besòs)
Address: Av. Eduard Maristany, 16, 08019, Barcelona
The welcome reception will be given at the Rambla de Colors space in UPC Diagonal Besòs
(230 m from Torre Telefónica). The space is placed between buildings C and I being possible
to access it from both main entrances at each building. Use the number 37400 at the entrance
phone to connect with reception and granting access to the building and go down to the
underground level.
xvi
Gala dinner: Thursday 22 November, 20:30.
Restaurant Marítim
Address: Moll d’Espanya, 08039, Barcelona
Telephone: +34 93 221 17 75
Web: www.maritimrestaurant.es
The gala dinner will be held downtown, next to the sea (close to Cristobal Colon statue).
How do I get there?

Take L4 from El Maresme – Fòrum metro station to Barceloneta.
You can also reach the restaurant by walking the seaside (1h 15 min).
xvii
xviii
Invited Speakers
Tanja Schultz received her diploma and doctoral degree in Infor-

matics from University of Karlsruhe, Germany, in 1995 and 2000.
Prior to these degrees she completed the state exam in Mathe-
matics, Sports, Physical and Educational Science from Heidelberg
University, Germany in 1989. She is currently the Professor for
Cognitive Systems at the University of Bremen, Germany and ad-
junct Research Professor at the Language Technologies Institute
of Carnegie Mellon, PA USA. Since 2007, she directs the Cogni-
tive Systems Lab, where her research activities include multilin-
gual speech recognition and the processing, recognition, and in-
terpretation of biosignals for human-centered technologies and
applications. Prior to joining University of Bremen, she was a Re-
search Scientist at Carnegie Mellon (2000-2007) and a Full Profes-
sor at Karlsruhe Institute of Technology in Germany (2007-2015).
Dr. Schultz is an Associate Editor of ACM Transactions on Asian
Language Information Processing (since 2010), serves on the Ed-
itorial Board of Speech Communication (since 2004), and was
Associate Editor of IEEE Transactions on Speech and Audio Pro-
cessing (2002-2004). She was President (2014-2015) and elected
Board Member (2006-2013) of ISCA, and a General Co-Chair of
Interspeech 2006. She was elevated to Fellow of ISCA (2016) and
to member of the European Academy of Sciences and Arts (2017).
Dr. Schultz was the recipient of the Otto Haxel Award in 2013, the
Alcatel Lucent Award for Technical Communication in 2012, the
PLUX Wireless Biosignals Award in 2011, and the Allen Newell
Medal for Research Excellence in 2002, and received the ISCA /
EURASIP Speech Communication Best paper awards in 2001 and
2015.
Abstract.- Speech is a complex process emitting a wide range of biosignals, including, but
not limited to, acoustics. These biosignals – stemming from the articulators, the articulator
muscle activities, the neural pathways, and the brain itself – can be used to circumvent limi-
tations of conventional speech processing in particular, and to gain insights into the process
of speech production in general. In my talk I will present ongoing research at the Cognitive
Systems Lab (CSL), where we explore a variety of speech-related muscle and brain activi-
xix
ties based on machine learning methods with the goal of creating biosignal-based speech
processing devices for communication applications in everyday situations and for speech
rehabilitation, as well as gaining a deeper understanding of spoken communication. Several
applications will be described such as Silent Speech Interfaces that rely on articulatory mus-
cle movement captured by electromyography to recognize and synthesize silently produced
speech, Brain-to-text interfaces that recognize continuously spoken speech from brain activ-
ity captured by electrocorticography to transform it into text, and Brain-to-Speech interfaces
that directly synthesize audible speech from brain signals.
xx
Rob Clark received his PhD from the University of Edinburgh in
2003. His primary interest is in producing engaging synthetic
speech. Before joining Google Rob was at the University of Ed-
inburgh for many years involved in both teaching and research
relating to text-to-speech synthesis. Rob was one of the primary
developers and maintainers of the open source Festival text-to-
speech synthesis system. In 2015 he joined Google where he is
working on text-to-speech synthesis and prosody.
Abstract.- This talk addresses the issue of producing appropriate and engaging text-to-
speech. The quality of speech produced by modern text-to-speech systems is sufficiently
intelligible and naturally sounding that we are now seeing it widely used in an increasing
number of real world applications. While the speech generated can sound very natural, we
are still a long way from ensuring it always sounds appropriate and engaging in the context
of a particular discourse or dialogue. We present recent work at Google which begins to
address this issue by looking at techniques to generate variation in prosody and speaking
style using latent representations and discuss the problems and challenges that we face in
going further.
xxi
xxii
Lluís Màrquez is a Principal Applied Scientist at Amazon Re-
search in Barcelona. From 2013 to 2017 he had a Principal Sci-
entist role at the Arabic Language Technologies group from the
Qatar Computing Research Institute (QCRI), and previously, he
was Associate Professor at the Technical University of Catalonia
(UPC, 2000-2013). He holds a university award-winning PhD in
Computer Science from UPC (1999). His research focuses on nat-
ural language understanding by using statistical machine learn-
ing models. He has 150+ papers in Natural Language Processing
and Machine Learning journals and conferences. He has been
General and Program Co-chair of major conferences in the area
(ACL, EMNLP, EACL, CoNLL, *SEM, EAMT, etc.), and held several
organizational roles in ACL and EMNLP too. He was co-organizer
of various international evaluation tasks at Senseval/SemEval
(2004, 2007, 2010, 2015-2017) and CoNLL shared tasks (2004-
2005, 2008-2009). He was Secretary and President of the ACL
special interest group on Natural Language Learning (SIGNLL) in
the period 2007-2011. More recently, he was President-elect and
President of the European Chapter of the ACL (EACL; 2013-2016)
and member of the ACL Executive Committee (2015-2016). Luís
Màrquez has been Guest Editor of special issues at Computa-
tional Linguistics, LRE, JNLE, and JAIR in the period (2007-2015).
He has participated in 16 national and EU research projects, and
2 projects with technology transfer to the industry, acting as the
principal site researcher in 10 of them, helping companies embed
AI in their business.
Abstract.- Automatic Question Answering (Q&A), i.e., the task of building computer pro-
grams that are able to answer question posed in natural language, has a long tradition in the
fields of Natural Language Processing and Information Retrieval. In recent years, Q&A appli-
cations have had a tremendous impact in industry and they are ubiquitous (e.g., embedded
in any of the personal assistants that are in the market, Siri, Alexa, Cortana, Google Assistant,
etc.). At the same time, we have witnessed a renewed interest in the scientific community, as
Q&A has become one of the paradigmatic tasks for assessing the ability of machines to com-
prehend text. A plethora of corpora, resources and systems have blossomed and flooded the
community in the last three years. These systems can do very impressive things, for instance,
finding answers to open ended questions in long text contexts with super-human accuracy,
or answering complex questions about images, by mixing the two modalities. As in many
other fields, these state-of-the-art systems are implemented using machine learning in the
form of neural networks (deep learning). The new AI, of course. But do these Q&A systems
really understand what they read? In more simple words, do they provide the right answers
for the right reasons? Several recent studies have shown that QA systems are actually very
brittle. They generalize badly and they fail miserably when presented with simple adversar-
ial examples. The machine learning algorithms are very good at picking all the biases and
xxiii
artefacts in the corpora, and they learn to find answers based on shallow text properties and
pattern matching. But they do not show many understanding or reasoning abilities, after all.
Following this serious setback, there is a new push in the community for carefully designing
more complex and bias-free datasets, and more robust and explainable systems. Hopefully,
this will lead to a new generation of smarter and more useful Q&A engines in the near future.
In this talk, I will overview the present and the future of Question Answering by going over
all the aforementioned topics.
xxiv
Technical Program
Speaker Recognition
Wednesday, 21 November 2018, 09:20 – 10:40
Chair: Xavier Anguera, ELSA Corp.
Differentiable Supervector Extraction for Encoding Speaker and Phrase Informa-

tion in Text Dependent Speaker Verification 1
Victoria Mingote, Antonio Miguel, Alfonso Ortega, Eduardo Lleida
Phonetic Variability Influence on Short Utterances in Speaker Verification 6

Ignacio Viñals, Alfonso Ortega, Antonio Miguel, Eduardo Lleida
Restricted Boltzmann Machine Vectors for Speaker Clustering 10

Umair Khan, Pooyan Safari, Javier Hernando
Speaker Recognition under Stress Conditions 15

Esther Rituerto-González, Ascensión Gallardo-Antolín, Carmen Peláez-Moreno
Topics on Speech Technologies

Wednesday, 21 November 2018 12:00 – 13:30
Chair: Alberto Abad, INESC-ID/IST
Bilingual Prosodic Dataset Compilation for Spoken Language Translation 20

Alp Öktem, Mireia Farrús, Antonio Bonafonte
Building an Open Source Automatic Speech Recognition System for Catalan 25

Baybars Külebi, Alp Öktem
Multi-Speaker Neural Vocoder 30

Oriol Barbany, Antonio Bonafonte, Santiago Pascual
Improving the Automatic Speech Recognition through the improvement of

Laguage Models 35
Andrés Piñeiro-Martín, Carmen García-Mateo, Laura Docío-Fernández
xxv
Towards expressive prosody generation in TTS for reading aloud applications 40
Monica Dominguez, Alicia Burga, Mireia Farrús, Leo Wanner
Performance evaluation of front- and back-end techniques for ASV spoofing de-
tection systems based on deep features 45
Alejandro Gomez-Alanis, Antonio M. Peinado, José Andrés González López, Angel
M. Gomez
The observation likelihood of silence: analysis and prospects for VAD applications 50
Igor Odriozola, Inma Hernaez, Eva Navas, Luis Serrano, Jon Sanchez
On the use of Phone-based Embeddings for Language Recognition 55

Christian Salamea, Ricardo de Córdoba, Luis Fernando D’Haro, Rubén San-
Segundo, Javier Ferreiros
End-to-End Speech Translation with the Transformer 60

Laura Cross Vila, Carlos Escolano, José A. R. Fonollosa, Marta R. Costa-Jussà
Audio event detection on Google’s Audio Set database: Preliminary results using
different types of DNNs 64
Javier Darna-Sequeiros, Doroteo T. Toledano
Emotion Detection from Speech and Text 68

Mikel de Velasco, Raquel Justo, Josu Antón, Mikel Carrilero, M. Inés Torres
Experimental Framework Design for Sign Language Automatic Recognition 72

Darío Tilves Santiago, Ian Benderitter, Carmen García-Mateo
Baseline Acoustic Models for Brazilian Portuguese Using Kaldi Tools 77

Cassio Batista, Ana Larissa Dias, Nelson Sampaio Neto
xxvi
ASR & Speech Applications
Chair: Carmen García Mateo, University of Vigo
Converted Mel-Cepstral Coefficients for Gender Variability Reduction in Query-

by-Example Spoken Document Retrieval 82
Paula López Otero, Laura Docío-Fernández
A Recurrent Neural Network Approach to Audio Segmentation for Broadcast Do-

main Data 87
Pablo Gimeno, Ignacio Viñals, Alfonso Ortega, Antonio Miguel, Eduardo Lleida
Improving Transcription of Manuscripts with Multimodality and Interaction 92

Emilio Granell, Carlos David Martinez Hinarejos, Verónica Romero
Improving Pronunciation of Spanish as a Foreign Language for L1 Japanese

Speakers with Japañol CAPT Tool 97
Cristian Tejedor-García, Valentín Cardeñoso-Payo, María J. Machuca, David
Escudero-Mancebo, Antonio Ríos, Takuya Kimura
Exploring E2E speech recognition systems for new languages 102

Conrad Bernath, Aitor Alvarez, Haritz Arzelus, Carlos David Martínez
xxvii
Speech & Language Technologies Applied to Health
Chair: Mireia Farrús, Universitat Pompeu Fabra
Listening to Laryngectomees: A study of Intelligibility and Self-reported Listening

Effort of Spanish Oesophageal Speech 107
Sneha Raman, Inma Hernaez, Eva Navas, Luis Serrano
Towards an automatic evaluation of the prosody of people with Down syndrome 112
Mario Corrales-Astorgano, Pastora Martínez-Castilla, David Escudero-Mancebo,
Lourdes Aguilar, César González-Ferreras, Valentín Cardeñoso-Payo
Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial

Networks 117
Santiago Pascual, Antonio Bonafonte, Joan Serrà, José Andrés González López
LSTM based voice conversion for laryngectomees 122

Luis Serrano, David Tavarez, Xabier Sarasola, Sneha Raman, Ibon Saratxaga, Eva
Navas, Inma Hernaez
Sign Language Gesture Classification using Neural Networks 127

Zuzanna Parcheta, Carlos David Martinez Hinarejos
Synthesis, Production & Analysis

Thursday, 22 November 2018, 09:00 – 10:40
Chair: Francesc Alías Pujol, La Salle - Universitat Ramon Llull
Influence of tense, modal and lax phonation on the three-dimensional finite ele-
ment synthesis of vowel [A] 132
Marc Freixes, Marc Arnela, Joan Claudi Socoró, Francesc Alías Pujol, Oriol Guasch
Exploring Advances in Real-time MRI for Speech Production Studies of European

Portuguese 137
Conceicao Cunha, Samuel Silva, António Teixeira, Catarina Oliveira, Paula Martins,
Arun Joseph, Jens Frahm
A postfiltering approach for dual-microphone smartphones 142

Juan M. Martín-Doñas, Iván López-Espejo, Angel M. Gomez, Antonio M. Peinado
Speech and monophonic singing segmentation using pitch parameters 147

Xabier Sarasola, Eva Navas, David Tavarez, Luis Serrano, Ibon Saratxaga
Self-Attention Linguistic-Acoustic Decoder 152

Santiago Pascual, Antonio Bonafonte, Joan Serrà
xxviii
Special Session
Thursday, 22 November 2018 12:00 – 13:30
Chair: Ricardo de Córdoba, Universidad Politécnica de Madrid
Show & Tell
Japañol: a mobile application to help improving Spanish pronunciation by

Japanese native speakers 157
Cristian Tejedor-García, Valentín Cardeñoso-Payo, David Escudero-Mancebo
Ongoing Research Projects
Towards the Application of Global Quality-of-Service Metrics in Biometric Sys-

tems 159
Juan Manuel Espín, Roberto Font, Juan Francisco Inglés-Romero, Cristina Vicente-
Chicote
Incorporation of a Module for Automatic Prediction of Oral Productions Quality

in a Learning Video Game 161
David Escudero-Mancebo, Valentín Cardeñoso-Payo
Silent Speech: Restoring the Power of Speech to People whose Larynx has been
Removed 163
José Andrés González López, Phil D. Green, Damian Murphy, Amelia Gully, James
M. Gilbert
RESTORE Project: REpair, STOrage and REhabilitation of speech 166

Inma Hernaez, Eva Navas, Jose Antonio Municio Martín, Javier Gomez Suárez
Corpus for Cyberbullying Prevention 170

Asuncion Moreno, Antonio Bonafonte, Igor Jauk, Laia Tarrés, Victor Pereira
EMPATHIC, Expressive, Advanced Virtual Coach to Improve Independent Healthy-

Life-Years of the Elderdy 172
M. Inés Torres, Gérard Chollet, César Montenegro, Jofre Tenorio-Laranga, Olga
Gordeeveva, Anna Esposito, Cornelius Glackin, Stephan Schlögl, Olivier Deroo,
Begoña Fernández-Ruanova, Riberto Santana, Maria S. Kornes, Fred Lindner, Daria
Kyslitska, Miriam Reiner, Gennaro Cordasco, Mari Aksnes, Raquel Justo
PhD Thesis
Advances on the Transcription of Historical Manuscripts based on Multimodality,

Interactivity and Crowdsourcing 174
Emilio Granell, Carlos David Martinez Hinarejos, Verónica Romero
xxix
Bottleneck and Embedding Representation of Speech for DNN-based Language
and Speaker Recognition 179
Alicia Lozano-Diez, Joaquin Gonzalez-Rodriguez, Javier Gonzalez-Dominguez
Deep Learning for i-Vector Speaker and Language Recognition: A Ph.D. Thesis
Overview 184
Omid Ghahabi
Unsupervised Learning for Expressive Speech Synthesis 189

Igor Jauk
Albayzin Evaluation
Thursday, 22 November 2018 15:00 – 16:40
Chair: Alfonso Ortega & Eduardo Lleida, Universidad de Zaragoza
Multimodal Diarization Challenge
ODESSA/PLUMCOT at Albayzin Multimodal Diarization Challenge 2018 194

Benjamin Maurice, Hervé Bredin, Ruiqing Yin, Jose Patino, Héctor Delgado, Claude
Barras, Nicholas Evans, Camille Guinaudeau
UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge 199
Miquel Angel India Massana, Itziar Sagastiberri, Ponç Palau, Elisa Sayrol, Josep
Ramon Morros, Javier Hernando
The GTM-UVIGO System for Audiovisual Diarization 204

Eduardo Ramos-Muguerza, Laura Docío-Fernández, José Luis Alba-Castro
Speaker Diarization Challenge
The SRI International STAR-LAB System Description for IberSPEECH-RTVE 2018

Speaker Diarization Challenge 208
Diego Castan, Mitchell McLaren, Mahesh Kumar Nandwana
ODESSA at Albayzin Speaker Diarization Challenge 2018 211

Jose Patino, Héctor Delgado, Ruiqing Yin, Hervé Bredin, Claude Barras, Nicholas
Evans
EML Submission to Albayzin 2018 Speaker Diarization Challenge 216

Omid Ghahabi, Volker Fischer
In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge 220
Ignacio Viñals, Pablo Gimeno, Alfonso Ortega, Antonio Miguel, Eduardo Lleida
xxx
DNN-based Embeddings for Speaker Diarization in the AuDIaS-UAM System for
the Albayzin 2018 IberSPEECH-RTVE Evaluation 224
Alicia Lozano-Diez, Beltran Labrador, Diego de Benito, Pablo Ramirez, Doroteo T.
Toledano
CENATAV Voice-Group Systems for Albayzin 2018 Speaker Diarization Evaluation

Campaign 227
Edward L. Campbell, Gabriel Hernandez, José R. Calvo de Lara
The Intelligent Voice System for the IberSPEECH-RTVE 2018 Speaker Diarization
Challenge 231
Abbas Khosravani, Cornelius Glackin, Nazim Dugan, Gérard Chollet, Nigel Can-
nings
JHU Diarization System Description 236

Zili Huang, L. Paola García-Perera, Jesús Villalba, Daniel Povey, Najim Dehak
Search on Speech Challenge
GTM-IRLab Systems for Albayzin 2018 Search on Speech Evaluation 240

Paula López Otero, Laura Docío-Fernández
AUDIAS-CEU: A Language-independent approach for the Query-by-Example

Spoken Term Detection task of the Search on Speech ALBAYZIN 2018 evaluation 245
Maria Cabello, Doroteo T. Toledano, Javier Tejedor
GTTS-EHU Systems for the Albayzin 2018 Search on Speech Evaluation 249
Luis J. Rodríguez-Fuentes, Mikel Peñagarikano, Amparo Varona, Germán Bordel
Cenatav Voice Group System for Albayzin 2018 Search on Speech Evaluation 254
Ana R. Montalvo, Jose M. Ramirez, Alejandro Roble, Jose R. Calvo
Speech to Text Challenge
MLLP-UPV and RWTH Aachen Spanish ASR Systems for the IberSpeech-RTVE
2018 Speech-to-Text Transcription Challenge 257
Javier Jorge, Adrià Martínez-Villaronga, Pavel Golik, Adrià Giménez, Joan Albert
Silvestre-Cerdà, Patrick Doetsch, Vicent Andreu Císcar, Hermann Ney, Alfons Juan,
Albert Sanchis
Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program tran-

scription 262
Juan M. Perero-Codosero, Javier Antón-Martín, Daniel Tapias Merino, Eduardo
López-Gonzalo, Luis A. Hernández-Gómez
xxxi
The Vicomtech-PRHLT Speech Transcription Systems for the IberSPEECH-RTVE
2018 Speech to Text Transcription Challenge 267
Haritz Arzelus, Aitor Alvarez, Conrad Bernath, Eneritz García, Emilio Granell, Carlos
David Martinez Hinarejos
Intelligent Voice ASR system for Iberspeech 2018 Speech to Text Transcription
Challenge 272
Nazim Dugan, Cornelius Glackin, Gérard Chollet, Nigel Cannings
The GTM-UVIGO System for Albayzin 2018 Speech-to-Text Evaluation 277

Laura Docío-Fernández, Carmen García-Mateo
Text & NLP Applications

Friday, 23 November 2018, 09:00 – 10:40
Chair: José F. Quesada, Universidad de Sevilla
Topic coherence analysis for the classification of Alzheimer’s disease 281

Anna Pompili, Alberto Abad, David Martins de Matos, Isabel Pavão Martins
Building a global dictionary for semantic technologies 286

Iklódi Eszter, Gábor Recski, Gábor Borbély, Maria Jose Castro-Bleda
TransDic, a public domain tool for the generation of phonetic dictionaries in stan-
dard and dialectal Spanish and Catalan 291
Juan-María Garrido, Marta Codina, Kimber Fodge
Wide Residual Networks 1D for Automatic Text Punctuation 296

Jorge Llombart, Antonio Miguel, Alfonso Ortega, Eduardo Lleida
End-to-End Multi-Level Dialog Act Recognition 301

Eugénio Ribeiro, Ricardo Ribeiro, David Martins de Matos
xxxii
Papers
xxxiii
xxxiv
IberSPEECH 2018
21-23 November 2018, Barcelona, Spain
Differentiable Supervector Extraction for Encoding Speaker and Phrase

Information in Text Dependent Speaker Verification
Victoria Mingote, Antonio Miguel, Alfonso Ortega, Eduardo Lleida
ViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Spain
{vmingote,amiguel,ortega,lleida}@unizar.es
Abstract among the best results of the state-of-the-art. The i-vector ex-
tractor represents each utterance in a low-dimensional subspa-
In this paper, we propose a new differentiable neural network
ce called the total variability subspace as a fixed-length feature
alignment mechanism for text-dependent speaker verification
vector and the PLDA model produces the verification scores.
which uses alignment models to produce a supervector repre-
However, as we previously mentioned, many improvements on
sentation of an utterance. Unlike previous works with similar
this baseline system have been achieved in recent years by pro-
approaches, we do not extract the embedding of an utterance
gressively substituting components of the systems by DNNs,
from the mean reduction of the temporal dimension. Our system
thanks to their larger expressiveness and the availability of big-
replaces the mean by a phrase alignment model to keep the tem-
ger databases. Examples of this are the use of DNN bottleneck
poral structure of each phrase which is relevant in this applica-
representations as features replacing or combined with spectral
tion since the phonetic information is part of the identity in the
parametrization [8], training DNN acoustic models to use their
verification task. Moreover, we can apply a convolutional neural
outputs as posteriors for alignment instead of GMMs in i-vector
network as front-end, and thanks to the alignment process being
extractors [9], or replacing PLDA by a DNN [10]. Other pro-
differentiable, we can train the whole network to produce a su-
posals similar to face verification architectures have been more
pervector for each utterance which will be discriminative with
ambitious and have trained a discriminative DNN for multiclass
respect to the speaker and the phrase simultaneously. As we
classifying and then extract embeddings by reduction mecha-
show, this choice has the advantage that the supervector encodes
nisms [11] [12], for example taking the mean of an intermediate
the phrase and speaker information providing good performan-
layer named usually bottleneck layer. After that embedding ex-
ce in text-dependent speaker verification tasks. In this work, the
traction, the verification score is obtained by a similarity metric
process of verification is performed using a basic similarity me-
such as cosine similarity [11].
tric, due to simplicity, compared to other more elaborate mo-
dels that are commonly used. The new model using alignment The application of DNNs and the same techniques as in
to produce supervectors was tested on the RSR2015-Part I data- text-independent models for text-dependent speaker verification
base for text-dependent speaker verification, providing compe- tasks has produced mixed results. On the one hand, specific mo-
titive results compared to similar size networks using the mean difications of the traditional techniques have been shown suc-
to extract embeddings. cessful for text-dependent tasks such as i-vector+PLDA [13],
Index Terms: Text Dependent Speaker verification, HMM DNNs bottleneck as features for i-vector extractors [14] or pos-
Alignment, Deep Neural Networks, Supervectors terior probabilities for i-vector extractors [14][15]. On the other
hand, speaker embeddings obtained directly from a DNN ha-
ve provided good results in tasks with large amounts of data
1. Introduction and a single phrase [16] but they have not been as effective in
Recently, techniques based on discriminative deep neural tasks with more than one pass phrase and smaller database si-
networks (DNN) have achieved a substantial success in many zes [4][5]. The lack of data in this last scenario may lead to
speaker verification tasks. These techniques follow the philo- problems with deep architectures due to overfitting of models.
sophy of the state-of-the-art face verification systems [1][2] Another reason that we explore in the paper for the lack
where embeddings are usually extracted by reduction mecha- of effectiveness of these techniques in general text-dependent
nisms and the decision process is based on a similarity metric tasks is that the phonetic content of the uttered phrase is rele-
[3]. Unfortunately, in text-dependent tasks this approach does vant for the identification. State-of-art text-independent approa-
not work efficiently since the pronounced phrase is part of the ches to obtain speaker embeddings from an utterance usually
identity information [4][5]. A possible cause of the imprecision reduce temporal information by pooling and by calculating the
in text-dependent tasks could be derived from using the mean mean across frames of the internal representations of the net-
as a representation of the utterance as we show in the experi- work. This approach may neglect the order of the phonetic in-
mental section. To solve this problem, this paper shows a new formation because in the same phrase the beginning of the sen-
architecture which combines a deep neural network with a ph- tence may be totally different from what is said at the end. An
rase alignment method used as a new internal layer to maintain example of this is the case when the system asks the speaker
the temporal structure of the utterance. As we will show, it is to utter digits in some random order. In that case a mean vec-
a more natural solution for the text-dependent speaker verifica- tor would fail to capture the combination of phrase and speaker.
tion, since the speaker and phrase information can be encoded Therefore one of the objectives of the paper is to show that it is
in the supervector thanks to the neural network and the specific important to keep this phrase information for the identification
states of the supervector. process, not just the information of who is speaking.
In the context of text-independent speaker verification In previous works we have developed systems that need to
tasks, the baseline system based on i-vector extraction and Pro- store a model per user which were adapted from a universal
babilistic Linear Discriminant Analysis (PLDA) [6][7] are still background model and the evaluation of the trial was based on
1 10.21437/IberSPEECH.2018-1
a likelihood ratio [17][18]. One of the drawbacks of this ap- 2.1. Alignment mechanism
proach is the need to store a large amount of data per user and
In this work, we select a Hidden Markov Model (HMM)
the speed of evaluation of trials, since likelihood expressions
as the alignment technique in all the experiments, but other po-
were dependent on the frame length. In this paper, we focus
sibilities could be to select Gaussian Mixture Model (GMM)
on systems using a vector representation of a trial or a speaker
or DNN posteriors. In text-dependent tasks we know the phrase
model. We propose a new approach that includes alignment as
transcription which allows us to construct a specific left-to-right
a key component of the mechanism to obtain the vector repre-
HMM model for each phrase of the data and obtain a Viterbi
sentation from a deep neural network. Unlike previous works,
alignment per utterance.
we substitute the mean of the internal representations across ti-
One reason to employ a phrase HMM alignment was due to
me which is used in other neural network architectures [4][5]
its simplicity for training independent HMM models for diffe-
by a frame to state alignment to keep the temporal structure of
rent phrases used to develop our experiments without the need
each utterance. We show how the alignment can be applied in
of phonetic information for training. Another reason was that
combination with a DNN acting as a front-end to create a su-
using the decoded sequence provided by the Viterbi algorithm
pervector for each utterance. As we will show, the application
in a left-to-right architecture it is ensured that each state of the
of both sources of information in the process of defining the su-
HMM corresponds to at least one frame of the utterance, so no
pervector provides better results in the experiments performed
state is empty.
on RSR2015 compared to previous approaches.
The process followed to add this alignment to our system
This paper is organized as follows. In Section 2 we present is detailed below. Once models for alignment are trained, a se-
our system and especially the alignment strategy developed. quence of decoded states γ=(q1 , ..., qt ) where qt indicates the
Section 3 presents the experimental data. Section 4 explains the decoded state at time t with qt ∈ {1, ..., Q} is obtained. Before
results achieved. Conclusions are presented in Section5. adding these vectors to the neural network they are preproces-
sed and converted into a matrix with ones and zeros in function
2. Deep neural network based on alignment of its correspondences with the states which makes possible to
use them directly inside of the neural network. In this way, we
In view of the aforementioned imprecisions in the results put ones at each state according to the frames that belong to this
achieved in previous works for this task with only DNNs and a state as a result of this process, we have the P alignment matrix
basic similarity metric, we decided to apply an alignment me- A ∈ RT ×Q with its components atqt =1 and q atq =1 which
chanism due to the importance of the phrases and their tempo- means that only one state is active at the same time.
ral structure in this kind of tasks. Since same person does not For example, if we train an HMM model with 4 states and
always pronounce one phrase at the same speed or in the same we obtain a vector γ and apply the previous transformation, the
way due to differences in the phonetic information, it is usual resultant matrix A would be:
that there exists an articulation and pronunciation mismatch bet-  
ween two compared speech utterances even from the same per- 1 0 0 0
son.  1 0 0 0 
 
In Fig. 1 we show the overall architecture of our system,  1 0 0 0 
 
where the mean reduction to obtain the vector embedding befo-  0 1 0 0 
γ = [1, 1, 1, 2, 2, 3, 3, 4] → A =   (1)
re the backend is substituted by the alignment process to finally  0 1 0 0 
 0 0 1 0 
create a supervector by audio file. This supervector can be seen  
 0 0 1 0 
as a mapping between an utterance and the state components of
the alignment, which allows to encode the phrase information. 0 0 0 1
For the verification process, once our system is trained, one su-
After this process, as we show in Fig. 2, we added this ma-
pervector is extracted for each enroll and test file, and then a
trix to the network as a matrix multiplication like one layer mo-
cosine metric is applied over them to achieve the verification
re, thanks to the expression as a matrix product it is easy to dif-
scores.
ferentiate and this enables to backpropagate gradients to train
neural network as usual. This matrix multiplication allows as-
signing the corresponding frames to each state resulting in a su-
pervector. Then, the speaker verification is performed with this
supervector. The alignment as a matrix multiplication can be
expressed as a function of the input signal to this layer xct with
dimensions (c × t) and matrix of alignment of each utterance A
with dimensions (t × q):
P
t xct · atq
scq = P (2)
t atq
where scq is the supervectors with dimensions (c × q), where

there are q state vectors of dimension c and we normalize with
the number of times state q is activated.
2.2. Deep neural network architecture

Figura 1: Differentiable neural network alignment mechanism
based on alignment models. The supervector is composed of Q As a first approximation to check that the previous align-
vectors sQ for each state. ment layer works better than extracting the embedding from the
mean reduction, we apply this mentioned layer directly on the
2
(a) Operation with 1D Convolution
Figura 2: Process of alignment, the input signal x is multiplied
by an alignment matrix A to produce a matrix with vectors sQ
which are then concatenated to obtain the supervector.
input signal over the acoustic features thus we obtain the tradi-
tional supervector. However, we expect to improve this baseline
result, so we propose to add some layers as front-end previous
to the alignment layer and train them in combination with the
alignment mechanism.
For deep speaker verification some simple architectures (b) Example of the convolution operation
with only dense layers [4] have been proposed. However, lately
it has been tried to employ deep neural networks as Residual Figura 3: Operation with 1D Convolution layers, 3(a) general
CNN Networks [5] but in text-dependent task it has not achie- pipeline of this operation. 3(b) example of how k context frames
ved the same good results as previous simple approaches. from input are multiplied by the weight matrix W and the output
In our network we propose a straightforward architecture is equivalent to a linear combination of convolutions.
with only a few layers which include the use of 1-dimension
convolution (1D convolution) layers instead of dense layers or put dimension of 60. On these input features we apply a data
2D convolution layers as in other works. Our proposal is to ope- augmentation method called Random Erasing [20], which helps
rate in the temporal dimension to add context information to the us to avoid overfitting in our models due to lack of data in this
process and at the same time the channels are combined at each database.
layer. The context information which is added depends on the On the other hand, the DNN architecture consists of the
size of the kernel used in convolution layer. front-end part in which several different configurations of la-
To use this type of layer, it is convenient that the input sig- yers have been tested as we will detail in the experiments, and
nals have the same size to concatenate them and pass to the the second part of the architecture which is an alignment based
network. For this reason, we apply a transformation to interpo- on HMM models. Finally, we have extracted supervectors as a
late or fill with zeros the input signals to have all of them with combination of front-end and alignment with a flatten layer and
the same dimensions. with them we have obtained speaker verification scores by using
The operation of the 1D convolution layers is depicted in a cosine similarity metric without any normalization technique.
Fig. 3, the signal used as layer input and its context, the previous A set of experiments was performed using Pytorch [21] to
frames and the subsequent frames, are multiplied frame by fra- evaluate our system. We compare a front-end with mean reduc-
me with the corresponding weights. The result of this operation tion with similar philosophy as [4][5] to the feature input di-
for each frame is linearly combined to create the output signal. rectly or a front-end both followed by the HMM alignment. In
3. Experimental Data the part of the front-end, we implemented 3 different layer con-
figurations: one convolutional layer with a kernel of dimension
In all the experiments in this paper, we used the RSR2015 1 equivalent to a dense layer but keeping the temporal structure
text-dependent speaker verification dataset [19]. This dataset and without adding context information, one convolutional la-
consists of recordings from 157 male and 143 female. There yer with a kernel of dimension 3, and three convolutional layers
are 9 sessions for each speaker pronouncing 30 different phra- with a kernel of dimension 3.
ses. Furthermore, this data is divided into three speaker subset: In Table 1 we show equal error rate (EER) results with the
background (bkg), development (dev) and evaluation (eval). We different architectures trained on the background subset for fe-
develop our experiments in Part I of this data set and we em- male, male and both partitions together. We have found that,
ploy the bkg and dev data (194 speakers, 94 female/100 male) as we expected, the first approach with mean reduction mecha-
for training. The evaluation part is used for enrollment and trial nism for extracting embeddings does not perform well for this
evaluation. text-dependent speaker verification task. It seems that this ty-
pe of embeddings do not represent correctly the information to
4. Results achieve discrimination between the correct speaker and phra-
In our experiments, we do not need the phrase transcrip- se both simultaneously. Furthermore, we show how changing
tion to obtain the corresponding alignment, because one phrase the typical mean reduction for a new alignment layer inside the
dependent HMM model has been trained with the background DNN achieves a relative improvement of 91.62 % in terms of
partition using a left-to-right model of 40 states for each phrase. the EER %.
With these models we can extract statistics from each utterance Nevertheless, these EER results were still quite high, so we
of the database and use this alignment information inside our decided that the results can be improved training with back-
DNN architecture. As input to the DNN, we employ 20 dimen- ground and develop subsets together. In Table 2, we can see that
sional Mel-Frequency Cepstral Coefficients (MFCC) with their if we use more data for training our systems, we achieve bet-
first and second derivatives as features for obtaining a final inter performance especially in deep architectures with more than
3
Cuadro 1: Experimental results on RSR2015 part I [19] eval
subset, where EER % is shown. These results were obtained by
training only with bkg subset.
Architecture Fem Male Fem+Male

Layers Kernel
F E : 3C + mean 3 11,20 % 12,13 % 11,70 %
Signal + alig. − 1,43 % 1,37 % 1,54 %
F E : 1C + alig. 1 1,16 % 0,98 % 1,56 % (a) Mean embeddings (b) Supervectors
F E : 1C + alig. 3 1,04 % 0,77 % 1,20 %
F E : 3C + alig. 3 0,86 % 1,00 % 0,98 % Figura 5: Visualizing Mean embeddings vs Supervectors for 1
phrase from male+female using t-SNE, where female is marked
by cold color scale and male is marked by hot color scale.
one layer, this improvement is observed for both architectures.
This fact remarks the importance of having a large amount of
data to be able to train deep architectures. In addition, we per- and on the other side examples from male identities.
formed an experiment to illustrate this effect in Fig.4 where we Furthermore, we illustrate in Fig.6 the same representation
show how if we increase little by little the amount of data used in the previous figure, however in this case we represent the em-
to train, the results progressively improve although we can see beddings and the supervectors of the thirty phrases from female
that the alignment mechanism makes the system more robust to identities. With this depiction we checked something that we
training data size. had already observed in the previous verification experiments
since the embeddings from mean architecture are not able to
Cuadro 2: Experimental results on RSR2015 part I [19] eval separate between same identity with different phrase and sa-
subset, showing EER %. These results were obtained by training me identity with the same phrase which is the base of text-
with bkg+dev subsets. dependent speaker verification task.
Architecture Fem Male Fem+Male

Layers Kernel
F E : 3C + mean 3 9,11 % 8,66 % 8,87 %
Signal + alig. − 1,43 % 1,37 % 1,54 %
F E : 1C + alig. 1 1,17 % 0,98 % 1,55 %
F E : 1C + alig. 3 1,07 % 0,78 % 1,24 %
F E : 3C + alig. 3 0,58 % 0,70 % 0,72 %
(a) Mean embeddings (b) Supervectors
Figura 6: Visualizing Mean embeddings vs Supervectors for 30

phrases from female using t-SNE. Each phrase is marked by one
different color scale.
5. Conclusions
In this paper we present a new method to add a new layer
as an alignment inside of the DNN architectures for encoding
meaningful information from each utterance in a supervector,
which allows us to conserve the relevant information that we
use to verify the speaker identity and the correspondence with
the correct phrase. We have evaluated the models in the text-
dependent speaker verification database RSR2015 part I. Re-
Figura 4: Results of EER % varying train percentage where
sults confirm that the alignment as a layer within the architectu-
standard deviation is shown only for both gender independent
re of DNN is an interesting line since we have obtained compe-
results.
titive results with a straightforward and simple alignment tech-
nique which has a low computational cost, so we can achieve
For illustrative purposes, we also represent our high- better results with other more powerful techniques.
dimensional supervectors in a two-dimensional space using t-
SNE [22] which preserves distances in a small dimension spa-
ce. In Fig.5(a), we show this representation for the architec-
6. Acknowledgements
ture which uses the mean to extract the embeddings, while in This work has been supported by the Spanish Ministry
Fig.5(b) we represent the supervectors of our best system. As of Economy and Competitiveness and the European Social
we can see in the second system the representation is able to Fund through the project TIN2017-85854-C4-1-R, by Gobierno
cluster the examples from the same person, whereas in the first de Aragón/FEDER (research group T36 17R) and by Nuance
method is not able to cluster together examples from the same Communications, Inc. We gratefully acknowledge the support
person. On the other hand, in both representations data are auto- of NVIDIA Corporation with the donation of the Titan Xp GPU
organized to show on one side examples from female identities used for this research.
4
7. References [18] A. Miguel, J. Llombart, A. Ortega, and E. Lleida, “Tied Hid-
den Factors in Neural Networks for End-to-End Speaker Recog-
[1] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “DeepFace: Clo- nition,” pp. 2819–2823, 2017.
sing the Gap to Human-Level Performance in Face Verification,”
2014 IEEE Conference on Computer Vision and Pattern Recogni- [19] A. Larcher, K. A. Lee, B. Ma, and H. Li, “Text-dependent
tion, pp. 1701–1708, 2014. speaker verification: Classifiers, databases and RSR2015,” Speech
[2] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified Communication, vol. 60, pp. 56–77, 2014. [Online]. Available:
embedding for face recognition and clustering,” in 2015 IEEE http://dx.doi.org/10.1016/j.specom.2014.03.001
Conference on Computer Vision and Pattern Recognition (CVPR), [20] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random era-
June 2015, pp. 815–823. sing data augmentation,” arXiv preprint arXiv:1708.04896, 2017.
[3] H. V. Nguyen and L. Bai, “Cosine similarity metric learning for [21] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,
face verification,” in Asian conference on computer vision. Sprin- Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic diffe-
ger, 2010, pp. 709–720. rentiation in pytorch,” in NIPS-W, 2017.
[4] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, [22] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”
“Deep feature for text-dependent speaker verification,” Speech Journal of machine learning research, vol. 9, no. Nov, pp. 2579–
Communication, vol. 73, pp. 1–13, 2015. [Online]. Available: 2605, 2008.
http://dx.doi.org/10.1016/j.specom.2015.07.003
[5] E. Malykh, S. Novoselov, and O. Kudashev, “On residual cnn in
text-dependent speaker verification task,” Lecture Notes in Com-
puter Science (including subseries Lecture Notes in Artificial Inte-
lligence and Lecture Notes in Bioinformatics), vol. 10458 LNAI,
pp. 593–601, 2017.
[6] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel,
“A study of interspeaker variability in speaker verification,”
IEEE Transactions on Audio, Speech, and Language Processing,
vol. 16, no. 5, pp. 980–988, 2008.
[7] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Oue-
llet, “Front-end factor analysis for speaker verification,” IEEE
Transactions on Audio, Speech, and Language Processing,
vol. 19, no. 4, pp. 788–798, 2011.
[8] A. Lozano-Diez, A. Silnova, P. Matejka, O. Glembek, O. Plchot,
J. Pešán, L. Burget, and J. Gonzalez-Rodriguez, “Analysis and op-
timization of bottleneck features for speaker recognition,” in Pro-
ceedings of Odyssey, vol. 2016, 2016, pp. 352–357.
[9] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme
for speaker recognition using a phonetically-aware deep neural
network,” in Acoustics, Speech and Signal Processing (ICASSP),
2014 IEEE International Conference on. IEEE, 2014, pp. 1695–
1699.
[10] O. Ghahabi and J. Hernando, “Deep belief networks for i-vector
based speaker recognition,” in Acoustics, Speech and Signal
Processing (ICASSP), 2014 IEEE International Conference on.
IEEE, 2014, pp. 1700–1704.
[11] G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker embed-
dings for short-duration speaker verification,” in Proc. Inters-
peech, 2017, pp. 1517–1521.
[12] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
“Deep neural network embeddings for text-independent speaker
verification,” in Proc. Interspeech, 2017, pp. 999–1003.
[13] H. Zeinali, H. Sameti, and L. Burget, “Hmm-based phrase-
independent i-vector extractor for text-dependent speaker verifica-
tion,” IEEE/ACM Transactions on Audio, Speech, and Language
Processing, vol. 25, no. 7, pp. 1421–1435, 2017.
[14] H. Zeinali, L. Burget, H. Sameti, O. Glembek, and O. Plchot,
“Deep neural networks and hidden markov models in i-vector-
based text-dependent speaker verification,” in Odyssey-The Spea-
ker and Language Recognition Workshop, 2016, pp. 24–30.
[15] S. Dey, S. Madikeri, M. Ferras, and P. Motlicek, “Deep neural
network based posteriors for text-dependent speaker verification,”
in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE
International Conference on. IEEE, 2016, pp. 5050–5054.
[16] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end
text-dependent speaker verification,” ICASSP, IEEE International
Conference on Acoustics, Speech and Signal Processing - Procee-
dings, vol. 2016-May, no. Section 3, pp. 5115–5119, 2016.
[17] A. Miguel, J. Villalba, A. Ortega, E. Lleida, and C. Vaquero, “Fac-
tor Analysis with Sampling Methods for Text Dependent Spea-
ker Recognition,” Proceedings of the 15th Annual Conference
of the International Speech Communication Association, Inters-
peech 2014, no. September, pp. 1342–1346, 2014.
5
IberSPEECH 2018
Phonetic Variability Influence on Short Utterances

in Speaker Verification
Ignacio Viñals, Alfonso Ortega, Antonio Miguel, Eduardo Lleida
ViVoLAB, Aragón Institute for Engineering Research (I3A),

University of Zaragoza, Spain
{ivinalsb,ortega,amiguel,lleida}@unizar.es
Abstract i-vector [9][10]. Some contributions have also worked on the

following PLDA [11][12]. Finally, some authors also deal with
This work presents an analysis of i-vectors for speaker recog- the problem at the score step [13].
nition working with short utterances and methods to alleviate
the loss of performance these utterances imply. Our research The paper is organized as follows: In Section 2 a review
reveals that this degradation is strongly influenced by the pho- about short utterances in speaker verification is presented. Sec-
netic mismatch between enrollment and test utterances. How- tion 3 contains an analysis of i-vectors with short utterances,
ever, this mismatch is unused in the standard i-vector PLDA including our proposal to mitigate their degradation. The exper-
framework. It is proposed a metric to measure this phonetic iments are explained in Section 4. Finally, Section 5 contains
mismatch and a simple yet effective compensation for the stan- our conclusions.
dard i-vector PLDA speaker verification system. Our results,
carried out in NIST SRE10 coreext-coreext female det. 5, ev- 2. Speaker Verification and Short
idence relative improvements up to 6.65% in short utterances,
and up to 9.84% in long utterances as well. Utterances
Index Terms: Speaker verification, i-vector, short utterance,
Speaker identification is the task focused on the search of pat-
vocal content.
terns to best recognize an individual by means of his voice.
Deeply studied for telephone channel, its usefulness has pro-
1. Introduction voked its adaptation to other environments, such as meetings,
Speaker identification is the area of knowledge focused on char- broadcast, etc.
acterizing a speaker, so any acoustic utterance can be undoubt- But, how can we differentiate two speakers by means of
edly assigned to the active speaker in it. If the range of hypo- their voice, specially when their speech does not have to be the
thetical speakers is limited (the target speaker is one out N pos- same, i.e. text-independent speaker identification? The answer
sible speakers), we refer to the problem as speaker recognition. lies on the different pronunciation of phonemes by two different
Otherwise, we normally talk about speaker verification: Given people. The state-of-the-art, past and present, has been gov-
utterances generated by two speakers, enrollment and test, the erned by generative models, and specifically the GMM. The
system must decide whether both speakers are the same or not. GMM-UBM generative model is supposed to represent the av-
Multiple possibilities have been proposed to best character- erage pronunciation for the whole phonetic content. This av-
ize the speaker on an utterance. Some of the first approaches erage pronunciation can assume the role of reference, making
were based on GMMs [1]. These approaches were evolved to possible the estimation of the deviations for each speaker and
subspace techniques such as JFA [2] and the i-vector [3]. Com- consequently, speaker identification. This idea has evolved with
plemented by PLDA [4], i-vectors have become state of-the-art subspace techniques [2][3], restricting the possible deviations in
for the last decade, receiving small modifications in its original a limited subspace. Moreover, back-end models such as PLDA
formula. Some of them imply the use of DNNs in the form of [4], have post-processed the obtained deviations, lowering more
GMM posteriors [5] or bottlenecks [6]. Only in the last years and more the error rate. Fig 1 illustrates the standard i-vector
new approaches only based on DNNs are emerging [7] with the PLDA framework, which has obtained some of the best results
intention of substituting the i-vector as utterance embedding. in speaker verification, working with long utterances containing
Originally designed for long utterances with a unique a unique speaker.
speaker, the referred techniques applied to short acoustic audios However, most of these techniques strongly suffer when ap-
have demonstrated a significant loss of performance. This situa- plied to short utterances. These audios, because of its limited
tion is critical when considering other tasks such as diarization, length, do not contain the totality, or at least the majority, of
in which long utterances from a single speaker are uncommon. the vocal space. Moreover, the remaining phonemes are poorly
Hence any improvement in the understanding of short utter- represented due to limited information. Unfortunately, state-of-
ances can provide a boost in many speaker related techniques. the-art techniques are built to process the totality of the pho-
Multiple works have succeed in mitigating the degradation of netic space assuming the phonetic variability as pronunciation
short utterances. Some works attempt to work on the extraction deviations. Consequently, state-of-the-art models invent unreal
models [8]. Others handle the situation by compensating the phonetic content when necessary. Therefore, when comparing
This work has been supported by the Spanish Ministry of Economy
short utterances, decisions are made taking into account unreal
and Competitiveness and the European Social Fund through the 2015 speech, with no data to back it up. In conclusion, the compar-
FPI fellowship, the project TIN2017-85854-C4-1-R and Gobierno de ison of short utterances can depend more on the speech rather
Aragón /FEDER (research group T36 17R). than the speaker.
6 10.21437/IberSPEECH.2018-2
Strn Stst Strn Stst
GMM
GMM
Ntrn , Ftrn Ntst , Ftst
Ntrn , Ftrn Ntst , Ftst
I-VECTOR
I-VECTOR
EXTRACTOR
EXTRACTOR wtrn wtst
wtrn wtst CENTERING
CENTERING KL DIST. WHITENING
LENGTH NORM.
WHITENING
wtrn wtst
LENGTH NORM.
wtrn wtst PLDA
llr(wtrn , wtst )
PLDA
FUSION
score(wtrn , wtst )
llr(wtrn , wtst )
Figure 1: Standard I-vector PLDA framework. Given Figure 2: Proposed system. The zeroth order Baum Welch
some input utterances Strn and Stst , the system generates statistic from both enrollment and test utterances are compared
llr(wtrn wtst ) by means of KL distance, which is then fused to the original
final score.
3. Phonetic Mismatch Compensation of matching phonemes, the more reliable the score is. Similarly,
Short utterances are an already known problem in speaker the lower is the phoneme similarity, the less restrictive should
verification, with several contributions [8][9][10][11][12][13]. the score be, in order to gain robustness against mismatches.
Most of these solutions assume a sort of uncertainty term be- Considering the i-vector PLDA standard framework, we
cause of the missing information, which must be compensated. consider the KL distance as the metric between enrollment and
This uncertainty term summarizes about how limited is the in- test utterances. This metric, is formulated as follows:
formation in the utterances, but do not pay attention to the de-
tailed missing phonemes. Therefore, this term is used as a
sort of quality measure of the utterance representations. Con- KLdist (p, q) =KL(p||q) + KL(q||p) (1)
sequently, scores are only compensated by these representation Z ∞
p(x)
quality approximations, without any concern about the condi- KL(p||q) = p(x) ln dx (2)
−∞ q(x)
tional dependencies when comparing enrollment and test. It is
not as harmful comparing utterances with similar limited infor- where KL represents the Kullback Leibler divergence between
mation as with totally mismatched phonetic content. distributions p and q. KL divergence is not symmetric, hence
According to our understanding, the detailed phonetic in- not a distance, so we make use of the symmetric version in-
formation is an impressive side information to pay no attention stead. This distance will compare our phonetic information,
to. Besides its quality is increasing as long as ASR systems in this work the zeroth order Baum-Welch statistics, extracted
evolve. This sort of knowledge allows the identification of the from the GMM-UBM step in the pipeline. This information,
missing acoustic content, making possible some sort of com- related with the acoustic content in the utterance, has strong re-
pensation for the missing acoustic content and a fair comparison lationships with the desired phonetic content. Nevertheless, no
of short utterances with only the available audio. side information is required for its extraction.
Therefore, our proposal is a proof of concept, as a first at- The obtained distance can be taken into account in any pos-
tempt to include the phonetic information in the evaluation of terior point of the speaker verification system (i-vector extrac-
the trial. In this work we work on the phonetic mismatch be- tor, PLDA, etc). In this work, a fusion of the PLDA score with
tween enrollment and test utterances in a speaker verification the distance is considered, made by means of logistic regres-
system. For this reason we have defined a distance between sion. The schematic for the tested framework is illustrated in
the enrollment and the test utterance for a trial. This distance Fig 2
is defined to measure how different is the acoustic content of With this fusion, the new score is able to compensate the
enrollment and test utterances, hence measuring how fair the phonetic mismatch in the trials, providing at the same time some
trials are in terms of acoustic similarity. The higher the number sort of quality measure. However, this distance just analyzes the
7
Table 1: EER(%) and minDCF metrics for the original long Table 2: KL distance and Error (%) for both target and non-
utterances, the chopped short utterance and the phonetically target trials depending on the trial length: long utterances
balanced short utterance (Long), chopped short utterances (Short) and phonetically bal-
anced short utterances (Phon. Balanced). Error estimated at
Utterance EER(%) MinDCF NIST operation point.
Baseline Long Utterance 3.25 0.16 Utterance Long Short Phon. Balanced
Chopped Short Utterance 8.57 0.40
Distance
Phoneme Balanced Short Utterance 4.11 0.20
Target 1.06 3.62 2.55
Non-target 1.74 4.61 3.47
interaction between enrollment and test utterances in the scor- Error (%)
ing process, but does not analyze the utterance representation Target 28.43 80.40 40.79
itself. Non-target 0.06 0.01 0.03
4. Experiments
Our experiments try to analyze the relevance of the phonetic and short utterances are considered the baseline results for the
mismatch in short utterances with limited acoustic information. experiments onwards.
We have opted for a speaker verification task with NIST SRE The previous results show a significant impact of the pho-
datasets, with available long utterances (around 5 minutes of netic variability on the utterance modeling capabilities. How-
speech with a unique speaker). Cohorts from SRE04, SRE05, ever, it is still unclear how this variability affects the perfor-
SRE06 and SRE08 are used to construct the speaker verifi- mance of our system. Hence we have performed a study com-
cation system, the i-vector PLDA standard framework. This paring acoustic mismatch between enrollment and test utter-
system consists of a 2048-Gaussian GMM-UBM and a 400- ances with the error score. As a first approach, the acoustic
dimension i-vector extractor. I-vectors are centered, whitened mismatch is measured by means of the KL distance between
and length-normalized [14] before being evaluated in a 400- the distributions of the zeroth Baum Welch statistics for both
dimension PLDA. Gaussianized MFCCs with first and second the enrollment and the test utterances. Thus we are compar-
derivatives are the input for the system. ing which components of the GMM contribute to the i-vector
The described system has evaluated three subsets based on extraction for enrollment and test. The results are exposed in
NIST SRE10 det. 5 core-extended core-extended female ex- Table 2, comparing the newly proposed distance with the error
periment: The original utterances constitute the reference sys- at the evaluation operating point (CM ISS = 10, CF A = 1,
tem with long utterances. Short utterances are created by chop- Ptgt = 0.01). The results are differentiated between target and
ping random segments from the original ones, only reassuring non-target populations for a better understanding.
that the short utterance contains between 3 and 60 seconds of The results indicate that short utterances suffer from the
speech. Another short utterance subset is also extracted, select- acoustic mismatch between enrollment and test, being much
ing the frames so that the original and the extracted utterances more significant than in long utterances. This extra mismatch
share the same phoneme distribution. This subset is referred occurs with both target and non-target trial populations. How-
as phonetically balanced in the paper. This latest subset is im- ever, this extra mismatch does not have the same effect in the
possible to find in real life, but allows the analysis of short ut- error term. Whereas non-target populations are not affected in
terances with (short utterances) and without (phonetically bal- terms of error, target trials do, explaining the degradation of
anced) phonetic variability, making comparisons possible. short utterances. Trials with short utterances fail because the
The first analysis compares the performance of the three speaker verification system considers the phonetic variability as
subsets, evaluated with the reference speaker verification sys- speaker variability, not differentiating between them.
tem. Both sorts of short utterances are expected to yield de- The proposed solution is the compensation of the original
graded performance with regard to the original long ones due to scores by means of the phonetic distance between enrollment
the limited information. The relevance of the phonetic variabil- and test utterances. As a first approach, we propose a sim-
ity is checked by direct comparison between these two results. ple yet effective linear regression fusing two systems, the i-
In Table 1 we present the obtained performances for the long vector PLDA and the KL distance. This first approach helps
original utterances with respect to their shorter versions, either the speaker verification system to notice whether the acoustic
chopped or phonetically balanced. mismatch can be degrading the score or not. The results with
The results indicate that both types of variability imply a this score are shown in Table 3.
loss of performance, as expected. However, the level of degra- According to the results, consistent improvements have
dation is far from being the same. Whilst the standard chopped been obtained, reaching up to 10% relative improvements. Sig-
short segments obtain 163.69% relative degradation in terms of nificantly enough, not only short utterances get improved but so
EER, the phonetically balanced short utterances only gets de- do long utterances.
graded a relative 26.46%. Therefore, the estimation variability, Finally, it is possible to analyze the benefits of the phonetic
i.e. how robust is our i-vector due to limited information, is not compensation and its impact with the different populations (tar-
nearly as influential as the the phonetic variability, present in get and non-target) in our trial subsets. The comparison be-
real life short utterances. It is important to bear in mind that tween our baseline system and our compensated version is in-
the amount of data evaluated per short utterance is significantly cluded in Table 4
smaller than the original long utterance, sometimes ruling out The results indicate a significant reduction of the target tri-
up to 95% of the original audio. The obtained results for long als error (False Negative cases) with both short and long utter-
8
Table 3: EER(%) and MinDCF metrics for trials with original PLDA, etc.) unaltered. Further work should be done in order
long utterances and short utterances, evaluating with the stan- to determine best and more efficient ways to make use of this
dard i-vector PLDA system (Baseline) and our proposed com- phonetic information.
pensated version (Compensated)
Utterance EER (%) MinDCF

6. References
Long Utterance [1] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker
identification using Gaussian mixture speaker models,” IEEE
Baseline 3.25 0.15 Transactions On Speech And Audio Processing, vol. 3, no. 1, pp.
Compensated 2.93 0.15 72–83, 1995.
Short Utterance [2] P. Kenny, “Joint Factor Analysis of Speaker and Session Variabil-
Baseline 8.57 0.39 ity: Theory and Algorithms,” CRIM, Montreal,(Report) CRIM-
Compensated 8.00 0.39 06/08-13, pp. 1–17, 2005.
[3] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-
Table 4: Error (%) in NIST2010 evaluation point estimated for let, “Front-end Factor Analysis for Speaker Verification,” IEEE
the baseline and the compensated score with both target and Transactions on Audio, Speech and Language Processing, vol. 19,
non-target trials. The error is expressed for both long utter- no. 4, pp. 788–798, 2011.
ances (Long) and chopped short utterances (Short) [4] S. J. D. Prince and J. H. Elder, “Probabilistic Linear Discrimi-
nant Analysis for Inferences About Identity,” in Proceedings of
the IEEE International Conference on Computer Vision, 2007.
Utterance Long Short
[5] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A Novel Scheme
Baseline score for Speaker Recognition Using a Phonetically-aware Deep Neural
Network,” IEEE International Conference on Acoustics, Speech
Target 28.43 80.40 and Signal Processing (ICASSP) (2014), pp. 1714–1718, 2014.
Non-target 0.06 0.01 [6] M. McLaren, Y. Lei, and L. Ferrer, “Advances in Deep Neu-
Compensated score ral Network Approaches to Speaker Recognition,” International
Conference on Acoustics, Speech and Signal Processing, pp.
Target 8.18 32.18 4814–4818, 2015.
Non-target 0.76 0.80 [7] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero,
Y. Carmiel, and S. Khudanpur, “Deep Neural Network-based
Speaker Embeddings for End-to-end Speaker Verification,” iEEE
Spoken Language Technology Workshop (SLT), pp. 165–170,
ances. Nevertheless, this compensation generates a small in- 2016.
crease of False Positive decisions, which were almost null in
[8] R. Vogt, B. Baker, and S. Sridharan, “Factor analysis subspace es-
the baseline situations. timation for speaker verification with short utterances,” Proceed-
ings of the Annual Conference of the International Speech Com-
5. Conclusions munication Association, INTERSPEECH, pp. 853–856, 2008.
[9] A. Kanagasundaram, D. Dean, J. Gonzalez-Dominguez, S. Srid-
In this work we have successfully analyzed the impact of the haran, D. Ramos, and J. Gonzalez-Rodriguez, “Improving short
different variability factors that short utterances have: phonetic utterance based I-vector speaker recognition using source and
and estimation, providing a simple solution to start dealing with utterance-duration normalization techniques,” Proceedings of the
them. This solution is able to generate up to 10% relative im- Annual Conference of the International Speech Communication
provements in terms of error ratios. Association, INTERSPEECH, no. August, pp. 2465–2469, 2013.
We have identified two main sources of variability in short [10] A. Kanagasundaram, D. Dean, S. Sridharan, J. Gonzalez-
utterances to degrade the performance in speaker verification Dominguez, J. Gonzalez-Rodriguez, and D. Ramos, “Improving
systems: phonetic variability (what is said) and estimation vari- short utterance i-vector speaker verification using utterance
ability (how reliable is our representation). According to the ex- variance modelling and compensation techniques,” Speech
Communication, vol. 59, pp. 69–82, 2014. [Online]. Available:
periments, both sources of variability imply a decrease of per- http://dx.doi.org/10.1016/j.specom.2014.01.004
formance, being phonetic variability significantly much more
[11] S. Cumani, O. Plchot, and P. Laface, “Probabilistic linear discrim-
harmful than the estimation one. This is due to the fact that
inant analysis of i-vector posterior distributions,” ICASSP, IEEE
state-of-the-art technologies invent acoustic content when it is International Conference on Acoustics, Speech and Signal Pro-
missing, a common situation in short utterances. cessing - Proceedings, pp. 7644–7648, 2013.
Besides, our proposed distance metric between utterances [12] P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel,
has revealed that the loss in performance in short utterances “PLDA for Speaker Verification with Utterances of Arbitrary Du-
is due to the mismatch in target trials. Phonetic variability is ration,” Journal of Chemical Information and Modeling, vol. 53,
not differentiated from pronunciation variability, sustain of the no. 9, pp. 1689–1699, 2013.
speaker variability. Therefore, the speaker verification system [13] T. Hasan, R. Saeidi, J. H. L. Hansen, and D. a. Van Leeuwen, “Du-
does not differentiate between speech and speaker mismatch, ration mismatch compensation for i-vector based speaker recogni-
hence significantly increasing the amount of False Negative tion systems,” ICASSP, IEEE International Conference on Acous-
evaluated trials. tics, Speech and Signal Processing - Proceedings, no. 238803, pp.
Finally, our proposal to take advantage of this mismatch 7663–7667, 2013.
distance has obtained limited but consistent improvements. The [14] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of I-vector
fusion of the original PLDA log-likelihood ratio score with the Length Normalization in Speaker Recognition Systems,” in Pro-
KL distance has obtained improvements up to 10% for short and ceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, 2011, pp. 249–
long utterances. This result is specially satisfactory thanks to its 252.
simplicity, leaving the remaining framework (i-vector extractor,
9
IberSPEECH 2018
Restricted Boltzmann Machine Vectors for Speaker Clustering

Umair Khan, Pooyan Safari and Javier Hernando
TALP Research Center, Department of Signal Theory and Communications

Universitat Politecnica de Catalunya - BarcelonaTech, Spain
{umair.khan, javier.hernando}@upc.edu, pooyan.safari@tsc.upc.edu
Abstract petitive results as compared to the conventional i-vector based

speaker verification systems. This has lead us to apply the RBM
Restricted Boltzmann Machines (RBMs) have been used both
vector for learning speaker representation in the task of speaker
in the front-end and backend of speaker verification systems.
clustering.
In this work, we apply RBMs as a front-end in the context of
speaker clustering. Speakers’ utterances are transformed into Speaker clustering refers to the task of grouping speech seg-
a vector representation by means of RBMs. These vectors, ments in order to have segments from same speaker in the same
referred to as RBM vectors, have shown to preserve speaker- group. Ideally each group or cluster must contains speech seg-
specific information and are used for the task of speaker clus- ments that belong to the same speaker. On the other hand, utter-
tering. In this work, we perform the traditional bottom-up Ag- ances from same speakers must not be distributed among multi-
glomerative Hierarchical Clustering (AHC). Using the RBM ple clusters. Several approaches to speaker clustering task exist,
vector representation of speakers, the performance of speaker for example cost optimization, sequential and Agglomerative
clustering is improved. The evaluation has been performed on Hierarchical Clustering [12, 13, 14, 15]. Some approaches rely
the audio recordings of Catalan TV Broadcast shows. The ex- on commonly used statistical speaker modeling like Gaussian
perimental results show that our proposed system outperforms Mixture Models (GMMs) while others use features extracted
the baseline i-vectors system in terms of Equal Impurity (EI). using Deep Neural Networks (DNNs). For example in [16],
Using cosine scoring, a relative improvement of 11% and 12% BN features extracted from different DNNs are used for speaker
are achieved for average and single linkage clustering algo- clustering. In this work we consider the use of RBMs for vector
rithms respectively. Using PLDA scoring, the RBM vectors representation of speakers. We extend the use of RBM vec-
achieve a relative improvement of 11% compared to i-vectors tor [11] in the context of speaker clustering. First, we extract
for the single linkage algorithm. RBM vectors for all the speaker utterances in the same way
Index Terms: Speaker Clustering, Restricted Boltzmann Ma- as in [11]. Then, we perform a bottom-up AHC clustering for
chine Adaptation, Agglomerative Hierarchical Clustering. all the RBM vectors using cosine or Probabilistic Linear Dis-
criminant Analysis (PLDA) scores. We have found that the
RBM vector representation of speakers is successful in task of
1. Introduction speaker clustering as in speaker verification. The experimen-
In recent years, deep learning architectures have shown their tal results show that the RBM vector outperforms the conven-
success in various areas of image processing, computer vision, tional i-vectors based speaker clustering using both the cosine
speech recognition, machine translation and natural language and PLDA scoring methods.
processing. This fact has inspired the research community The rest of the paper is organized as follows. Section 2
to make use of these techniques in speaker recognition tasks explains the training of a global model referred to as Univer-
[1, 2, 3, 4, 5]. In speaker recognition tasks, deep learning ar- sal RBM (URBM), RBM adaptation for speaker utterances and
chitectures are used to extract bottle neck (BN) features and to RBM vector extraction followed by a Principal Components
compute GMMs posterior probabilities in a hybrid HMM-DNN Analysis (PCA) whitening and dimensionality reduction. Sec-
model [6, 7]. tion 3 contains a brief description of the speaker clustering sys-
Generative or unsupervised deep learning architectures like tem using RBM vectors. Section 4 is about the experimental
Restricted Boltzmann Machines (RBMs), Deep Belief Net- setup, database, evaluation metrics and the experiments carried
works (DBNs) and Deep Autoencoders have the ability of rep- out. In section 5 the obtained results are depicted and finally
resentational learning power. A first attempt to use RBMs at the some conclusions are drawn in section 6.
backend in a speaker verification task was made in [8]. Efforts
have been done by the authors, in order to learn a compact and
fixed dimensional speaker representation in the form of speaker 2. RBM Vector Representation
vector by using RBMs as a front-end [9]. They also make use In this work, we propose the use of a new speaker representation
of DBNs at the backend in the i-vector framework for speaker using RBMs in the context of speaker clustering. Fig. 1 shows
verification [10]. As a continuation to these works, a success- the basic block diagram of different stages in the RBM vector
ful attempt was made in our previous work to apply RBMs as extraction process and its input to the clustering module. First
a front-end for learning a fixed dimensional speaker representa- of all a global model which is referred to as Universal RBM
tion [11]. This vector representation of speaker was referred to (URBM), is trained using the features extracted from a large
as RBM vector. In [11] it has been shown that the RBM vec- amount of background speakers. Then the URBM is adapted
tor preserves speaker specific information and has shown com- to the features extracted from each speaker’s segments that are
This work has been developed in the framework of the to be clustered. These models are referred to as adapted RBMs.
Camomile Project (PCIN-2013-067) and the Spanish Project Deep- The visible to hidden connection weights of these adapted mod-
Voice (TEC2015-69266-P). els are used to generate the RBM vector for the corresponding
10 10.21437/IberSPEECH.2018-3
Figure 1: Block diagram showing different stages of the RBM vector extraction and its input to Bottom-up AHC.
2.2. RBM Adaptation

Once the URBM is trained, we perform speaker adaptation for
every speaker’s segment that has to be used for clustering task.
The adapted RBM model is trained only with the data of the
corresponding speaker’s segment, in order to capture speaker-
specific information. During adaptation the RBM model of
the speaker segment is initialized with the parameters (weights
and biases) of the URBM. This kind of adaptation technique
is successfully applied in [18, 19]. The adaptation is also car-
ried out using CD-1 algorithm. In other words, the adaptation
step drives the URBM model in a speaker-specific direction.
The weight matrix of the adapted models are supposed to con-
vey speaker-specific information of the corresponding speaker.
Fig. 2 shows the visualization of the connection weights of the
URBM (at the top of the figure) and of two randomly selected
speakers (Speaker 1 and Speaker 2, at the bottom of the figure).
From the figure, it is clear that the URBM weights are driven in
speaker-specific direction which can be discriminative.
2.3. RBM Vector Extraction

Figure 2: Comparison of URBM and adapted RBMs visible- After the adaptation step, an RBM model is assigned to each
hidden weight matrices. The URBM weights are quite different speaker’s segment. We concatenate the visible-hidden connec-
from the adapted RBM ones which shows that these parameters tion weights along with the bias vectors of the adapted speaker
convey speaker-specific information. RBMs in order to generate a higher dimensional speaker vec-
tor, referred to as RBM supervector. As in our previous work
[11], we apply a PCA whitening with dimensionality reduction
to the RBM supervector in order to generate the lower dimen-
speaker segment. Finally, the RBM vectors are further used for sional RBM vector. The PCA whitening transforms the original
speaker clustering task using cosine and PLDA scoring. The data to the principal component space. This results in reduc-
whole process has three main stages namely URBM training, ing the correlation between the data components. The PCA is
RBM adaptation and RBM vector extraction using PCA whiten- trained with the background RBM supervectors and applied to
ing with dimensionality reduction. the test RBM supervectors (used in clustering). All the RBM
supervectors are mean-normalized prior to applying PCA. In
our previous work [11], it has been shown that the RBM vector
2.1. URBM Training is successful in learning speaker-specific information in speaker
verification task. Thus, we make use of the RBM vector in the
In order to generate the RBM vector, the first step is to train a context of speaker clustering.
global model with a large amount of background data. This is
the URBM which is supposed to convey speaker-independent
information. The URBM is trained as a single model with the
3. Speaker Clustering
features extracted from all the background speakers. As, these We have considered the conventional bottom-up AHC cluster-
features are real valued, we have used Gaussian real-valued ing system with the options of single and average linkages.
units for the visible layer of the RBM. The RBM is trained us- We did not consider the model retraining approach because it
ing the CD-1 algorithm [17] which assumes that the inputs have is costly in terms of computations as compared to the linkage
zero mean and unit variance. Thus the features are Mean Vari- approaches to clustering [14]. The system starts with initial
ance Normalized (MVN) prior to the URBM training. Finally, number of clusters equal to the total number of speaker seg-
we trained the URBM with a large amount of training samples ments. Iteratively, the segments that are more likely to be from
generated from the background speakers’ features. The URBM the same speaker are clustered together until a stopping crite-
is supposed to learn both speaker and session variabilities from rion has reached. The stopping criterion can be thresholding
the background data [11]. the score in order to decide to merge clusters or it can be a
11
desired (known) number of clusters achieved. The clustering frames in order to generate 80-dimensional feature inputs to the
algorithm is based on computing a distance/similarity matrix RBMs. With a shift of one frame, we generate almost 10 million
M (X) between all the speakers’ segments. Where X is the samples for the URBM training. The large amount of training
set of segments to be clustered. Hence the RBM vectors of all samples will favor more efficient learning that will lead to more
the segments are extracted, the matrix M (X) is computed by accurate URBM. All the RBMs used in this paper comprise of
scoring all the RBM vectors against all. Thus for N RBM vec- 80 visible and 400 hidden units. The URBM was trained for 200
tors, the matrix M (X) has dimensions N × N . In every itera- epochs with a learning rate of 0.0005, weight decay of 0.0002
tion, the segments with minimum/maximum distance/similarity and a batch size of 100. All the adapted RBM models for the
scores are clustered together and the matrix M (X) is updated. test speaker segments are trained with 200 epochs with a learn-
The corresponding rows and columns of the clustered segments ing rate of 0.005, weight decay of 0.000002 and a batch size of
are removed from M (X) and a new row and column are added. 64. The PCA is trained with the background RBM supervectors
The new row and column contains the distance scores between as discussed in section 2.3. Finally, fixed dimensional RBM
the new and old clusters. The new scores are computed accord- vectors are extracted for the speakers’ segments and are used
ing to the linkage algorithm used. For example segments Sa in the speaker clustering experiments. Different dimensions of
and Sb are clustered in Sab . Then the scores between new clus- the RBM vectors are evaluated which will be discussed in the
ter (Sab ) and old segment (Sn ) are computed as follows: results section.
(a) Average Linkage: There are several metrics to measure the performance of
speaker clustering. For example cluster impurity (or con-
1 versely cluster purity), rand index, normalized mutual infor-
s(Sab , Sn ) = {s(Sa , Sn ) + s(Sb , Sn )} (1)
2 mation (NMI) and F-measure as described in [22]. We have
(b) Single Linkage: considered the Cluster Impurity (CI) measure in this work. CI
measures the quality of a cluster, to what extent a cluster con-
s(Sab , Sn ) = max{s(Sa , Sn ), s(Sb , Sn )} (2) tains segments from different speakers. However, this metric
has a trivial solution when there is only one segment per clus-
Where s(Sab , Sn ) is the score between new cluster Sab and old ter. To deal with this, Speaker Impurity (SI) is measured at the
segment Sn while s(Sa , Sn ) is the score between old segments same time. SI measures to what extent a speaker is distributed
Sa and Sn . among clusters. There is a trade-off between CI and SI [23].
In this way, the process is iterated until a stopping criterion CI and SI are plotted against each other in an Impurity Trade-
is met. There are two methods to control the iterations: (1) Fix off (IT) curve and an Equal Impurity (EI) point is marked as
a threshold, and (2) Add an additional information to the system working point.
about the desired (known) number of clusters. The system stops
when this number is reached. In this work, we did not let the 5. Results
system know any desired number of clusters and we have used
the thresholding method. We have tuned a threshold in order to Different lengths for RBM vectors as well as for i-vectors are
see the performance of the system at different possible working evaluated using cosine scoring and average linkage clustering
points. The system performance is measured with respect to a algorithm. The results are shown in the second column of Table
ground truth cluster labels. We will discuss evaluation metrics 1. From the Table, it can be observed that if the dimension is in-
in section 4. creased, the performance is improved, both in case of i-vectors
and RBM vectors, in terms of Equal Impurity (EI). However, in
case of i-vectors, the best choice is 800 dimension. In case of
4. Experimental Setup and Database RBM vectors, the 2000 dimensional RBM vectors performs bet-
The experiments were performed using the audios from ter than the others. In this case, a relative improvement of 11%
AGORA database, which contains audio recordings of 34 TV is achieved compared to 800 dimensional i-vectors. A further
shows of Catalan broadcast TV3 [20]. Each show comprises of increase in the length of RBM vectors beyond 2000, degrades
two parts, i.e., a and b. So there are 68 audio files in total, of the performance in terms of EI. The third column of Table 1
approximate length of 38 minutes each. These files contain seg- compares the performance of RBM vector with the baseline i-
ments from 871 adult Catalan and 157 adult Spanish speakers. vectors in case of single linkage algorithm for clustering using
For the clustering experiments in this work, we have selected 38
audio files for testing and the remaining 30 audios are used as a
background data. The background data is used to train the Uni- Table 1: Comparison of speaker clustering results for the pro-
versal Background Model (UBM), Total Variability (T) matrix, posed RBM vectors with i-vectors. The dimensions of vectors
URBM and PCA. From the testing audio files, we have manu- are given in parenthesis. Each column shows Equal Impurity
ally extracted 2631 speaker segments according to ground truth (EI) in % for different scoring and linkage combinations.
rich transcription. These segments belong to 414 speakers that
appears in the audios. EI% EI% EI%
For both the baseline and proposed systems, 20 dimen- Approach (Cosine (Cosine (PLDA
sional Mel-Frequency Cepstral Coefficients (MFCC) features Average) Single) Single)
are extracted using a Hamming window of 25 ms with 10 ms i-vector (400) 49.19 46.26 36.16
shift. For the baseline, a 512 components UBM is trained to i-vector (800) 46.66 42.19 35.91
extract i-vectors and the PLDA is trained with the background i-vector (2000) 46.79 42.83 35.89
i-vectors, using Alize toolkit [21]. For the proposed system,
more than 3000 speaker segments are extracted from the back- RBM vector (400) 51.36 39.66 37.36
ground shows according to the ground truth rich transcription. RBM vector (800) 47.20 40.02 32.36
For each segment, we concatenate the features of 4 neighboring RBM vector (2000) 41.53 37.14 31.68
12
70 70
i-vector (800) Cosine: EI=42.19%
i-vector (800) PLDA: EI=35.91%
RBM vector (2000) Cosine: EI=37.14%
60 60 RBM vector (2000) PLDA: EI=31.68%
Speaker Impurity (%)
Speaker Impurity (%)

50 50
40 40
i-vector (400): EI=49.19%
i-vector (800): EI=46.66%
i-vector (2000): EI=46.79%
30 RBM vector (400): EI=51.36% 30
RBM vector (800): EI=47.2%
RBM vector (2000): EI=41.53%
RBM vector (2400): EI=52.48%
RBM vector (3000): EI=50.51%
20 20
20 30 40 50 60 70 20 30 40 50 60 70
Cluster Impurity (%) Cluster Impurity (%)
Figure 3: Comparison of Impurity Trade-off (IT) curves for the Figure 4: Comparison of Impurity Trade-off (IT) curves for the
proposed RBM vectors with i-vectors. Different dimensions of proposed RBM vectors with i-vectors. Different dimensions of
RBM vectors are evaluated using cosine scoring with average RBM vectors are evaluated using cosine and PLDA scoring with
linkage algorithm for clustering. The dimensions of i-vectors single linkage algorithm for clustering. The dimensions of i-
and RBM vectors are given in parenthesis. vectors and RBM vectors are given in parenthesis.
cosine scoring. From the table it is seen that, single linkage

is a better choice for our experiments. In this case a mini- dimensional RBM vectors and 800 dimensional i-vectors give
mum EI of 37.14% is obtained with 2000 dimensional RBM the best results with cosine scoring and average linkage. From
vectors which has a relative improvement of 12% over 800 di- Figure 4, it can be seen that the RBM vectors performs better at
mensional i-vectors. Finally, we evaluated the proposed system all working points as compared to i-vectors using their respec-
using PLDA scoring as well. The PLDA is trained using back- tive cosine and PLDA scoring. However, at low Speaker Impu-
ground RBM vectors for 15 iterations. The number of eigen- rity regions, the RBM vector with cosine scoring outperforms
voices are set to 250, 450 and 500 for RBM vectors of dimen- the baseline i-vector with PLDA scoring. In overall, 2000 di-
sions 400, 800 and 2000 respectively. All the RBM vectors mensional RBM vector has a consistent improved performance
are subjected to length normalization prior to PLDA training. compared to i-vectors.
As per the previous results, we performed this experiment with
single linkage algorithm only. The results are compared with
i-vectors in the fourth column of Table 1. It is observed that 6. Conclusions
800 and 2000 dimensional RBM vectors has a better EI com-
pared to the respective similar dimensional i-vectors. In this In this paper we proposed the use of Restricted Boltzmann Ma-
case, the RBM vectors of dimension 2000 has the minimum EI chines (RBMs) for speaker clustering task. RBMs are used to
of 31.68% which results in a relative improvement of 11% over learn a fixed dimensional vector representation of speaker re-
the 800 dimensional i-vectors. However, in case of 400 dimen- ferred to as RBM vector. First, a Universal RBM is trained
sions, the i-vectors outperform RBM vectors. with a large amount of background data. Then an adapted RBM
The Impurity Trade-off (IT) curves for the baseline as well model per test speaker is trained. The visible-hidden weight ma-
as the proposed system are shown in Figure 3 and 4. In Fig- trices along with their bias vectors of these adapted RBMs are
ure 3 we have shown the evaluation of different dimensions of concatenated to generate RBM supervectors. These RBM su-
i-vectors and RBM vectors in the average linkage clustering us- pervectors are subjected to a PCA whitening and dimensionality
ing cosine scoring. It can be seen that RBM vectors of length reduction to extract RBM vectors. Two linkage algorithms for
2000 gives better performance than 800 dimensional i-vectors at Agglomerative Hierarchical Clustering are explored with RBM
all working points. On the other hand, RBM vectors of dimen- vectors scored using cosine and PLDA. Using cosine scoring
sions 400, 800, 2400 and 3000 performs worse than i-vectors. the performance of the proposed system is better for both the
It is observed that 400 and 800 dimensional RBM vectors could linkage algorithms as compared to i-vector based clustering. In
not capture enough information about the speaker while 2400 overall, single linkage algorithm with 2000 dimensional RBM
and 3000 dimensional RBM vectors include unnecessary infor- vectors is the best choice for our experiments, using both cosine
mation which degrades the performance. In Figure 4 we have and PLDA scoring. We conclude that the RBM vectors can be
shown a comparison of 2000 dimensional RBM vectors with successfully used as a speaker representation in a speaker clus-
800 dimensional i-vectors using both cosine and PLDA scor- tering task. The best dimension for RBM vectors is found out to
ing with single linkage algorithm for clustering. The choices be 2000 which gives better performance over i-vectors as well
of dimensions are based on the previous experiments as 2000 as RBM vectors of other dimensions.
13
7. References [20] H. Schulz and J. A. R. Fonollosa, “A catalan broadcast conversa-
tional speech database,” in Joint SIG-IL/Microsoft Workshop on
[1] F. Richardson, D. Reynolds, and N. Dehak, “Deep neural network Speech and Language Technologies for Iberian Languages, 2009,
approaches to speaker and language recognition,” IEEE Signal pp. 27–30.
Processing Letters, vol. 22, no. 10, pp. 1671–1675, 10 2015.
[21] A. Larcher, J. F. Bonastre, B. G. B. Fauve, K. A. Lee, C. Lévy,
[2] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme
H. Li, J. S. D. Mason, and J. Y. Parfait, “Alize 3.0-open source
toolkit for state-of-the-art speaker recognition.” in Interspeech,
2013, pp. 2768–2772.
1699. [22] C. D. Manning, P. Raghavan, and H. Schütze, “Introduction to in-
[3] P. Kenny, V. Gupta, T. Stafylakis, P. Ouellet, and J. Alam, “Deep formation retrieval,” Cambridge university press; 2008, pp. 158–
neural networks for extracting baum-welch statistics for speaker 163.
recognition,” in Proc. Odyssey, 2014, pp. 293–298. [23] D. A. van Leeuwen, “Speaker linking in large data sets,” Pro-
[4] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, “Deep ceedings of the Speaker and Language Recognition Odyssey, pp.
feature for text-dependent speaker verification,” Speech Commu- 202–208, 6 2010.
nication, vol. 73, pp. 1–13, 10 2015.
[5] K. Chen and A. Salman, “Learning speaker-specific characteris-
tics with a deep neural architecture,” IEEE Transactions on Neural
Networks, vol. 22, no. 11, pp. 1744–1756, 11 2011.
[6] L. Deng, D. Yu et al., “Deep learning: methods and applications,”
Foundations and Trends in Signal Processing, vol. 7, no. 3–4, pp.
197–387, 2014.
[7] T. Yamada, L. Wang, and A. Kai, “Improvement of distant-talking
speaker identification using bottleneck features of DNN.” in Inter-
speech, 2013, pp. 3661–3664.
[8] M. Senoussaoui, N. Dehak, P. Kenny, R. Dehak, and P. Du-
mouchel, “First attempt of boltzmann machines for speaker verifi-
cation,” in Odyssey 2012-The Speaker and Language Recognition
Workshop, 2012.
[9] O. Ghahabi and J. Hernando, “Restricted boltzmann machines for
vector representation of speech in speaker recognition,” Computer
Speech & Language, vol. 47, pp. 16–29, 1 2018.
[10] ——, “Deep learning backend for single and multisession i-vector
speaker recognition,” IEEE/ACM Transactions on Audio, Speech,
and Language Processing, vol. 25, no. 4, pp. 807–817, 4 2017.
[11] P. Safari, O. Ghahabi, and J. Hernando, “From features to speaker
vectors by means of restricted boltzmann machine adaptation,” in
ODYSSEY 2016-The Speaker and Language Recognition Work-
shop, 2016, pp. 366–371.
[12] H. Sayoud and S. Ouamour, “Speaker clustering of stereo audio
documents based on sequential gathering process,” Journal of In-
formation Hiding and Multimedia Signal Processing, vol. 4, pp.
344–360, 10 2010.
[13] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, “Automatic seg-
mentation, classification and clustering of broadcast news audio,”
in Proc. DARPA speech recognition workshop, 1997, pp. 97–99.
[14] H. Ghaemmaghami, D. Dean, S. Sridharan, and D. A. van
Leeuwen, “A study of speaker clustering for speaker attribution in
large telephone conversation datasets,” Computer Speech & Lan-
guage, vol. 40, pp. 23–45, 11 2016.
[15] S. E. Tranter and D. A. Reynolds, “An overview of auto-
matic speaker diarization systems,” IEEE Transactions on audio,
speech, and language processing, vol. 14, no. 5, pp. 1557–1565,
9 2006.
[16] J. Jorrı́n, P. Garcı́a, and L. Buera, “DNN bottleneck features
for speaker clustering,” Proc. Interspeech 2017, pp. 1024–1028,
2017.
[17] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algo-
rithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp.
1527–1554, 6 2006.
[18] P. Safari, O. Ghahabi, and J. Hernando, “Feature classification by
means of deep belief networks for speaker recognition,” in Signal
Processing Conference (EUSIPCO), 2015 23rd European. IEEE,
2015, pp. 2117–2121.
[19] O. Ghahabi and J. Hernando, “Deep belief networks for i-vector
based speaker recognition,” in Acoustics, Speech and Signal
IEEE, 2014, pp. 1700–1704.
14
IberSPEECH 2018
Speaker Recognition under Stress Conditions

Esther Rituerto-González, Ascensión Gallardo-Antolı́n, Carmen Peláez-Moreno
Signal Theory and Communications Department

University Carlos III Madrid
erituert@ing.uc3m.es, gallardo@tsc.uc3m.es, carmen@tsc.uc3m.es
Abstract or recorded under real conditions, along with the difficulty

involved in the labelling process.
Speaker Recognition systems exhibit a decrease in performance
when the input speech is not in optimal circumstances, for The research performed on this paper is part of a
example when the user is under emotional or stress conditions. project called ’BINDI: Smart solution for Women’s safety
The objective of this paper is measuring the effects of stress XPRIZE’ by UC3M4Safety group [4]. The UC3M4Safety is a
on speech to ultimately try to mitigate its consequences on multidisciplinary team for detecting, preventing and combating
a speaker recognition task. On this paper, we develop a violence against women from a technological point of view.
stress-robust speaker identification system using data selection The goal of this project is to develop a wearable solution that
and augmentation by means of the manipulation of the original will detect a user’s panic, fear and stress through physiological
speech utterances. An extensive experimentation has been sensor data, speech and audio analysis and machine-learning
carried out for assessing the effectiveness of the proposed algorithms. The ability to detect whether the voice belongs to
techniques. First, we concluded that the best performance is the user or to anyone else, even under stress conditions is where
always obtained when naturally stressed samples are included in this research comes in.
the training set, and second, when these are not available, their In this paper we want to analyze how does stress in speech
substitution and augmentation with synthetically generated affect speaker recognition rates. We aim to find techniques for
stress-like samples, improves the performance of the system. strengthening speaker recognition systems, either neutralizing
Index Terms: speaker recognition, speaker identification, the effects of stress or being able to model and synthesize
emotions, stress conditions, data augmentation, synthetic stress it from neutral speech, to create synthetically stressed speech
using data augmentation techniques.
1. Introduction The rest of the paper is organized as follows: in Section
2 we describe the state of the art in speaker recognition and
In recent years the interest to detect and interpret emotions discuss features and classifiers used in literature. In Section 3,
in speech as well as to generate certain emotions in speech we explain the methodology followed for the feature extraction
synthesis have grown in parallel. It is well-known that speech and the data augmentation techniques. Section 4 refers to the
recognition systems function less efficiently when the speaker experimental set-up and results, and finally in Section 5 we
is under an emotional state, and in fact, some studies consider discuss the conclusions and future work.
emotions in speech as a distortion [1].
To be able to synthesize an emotion in speech, it is
necessary to analyze what are the characteristics that make 2. Speaker Recognition Related Work
it different from neutral speech. The work done about
emotions in speech is very extensive, analysis are performed Speaker Recognition is the automatic detection of a person from
to study what features or combinations of them carry more the characteristics of their voices (voice biometrics) [5]. We can
information about emotions improving speech recognition rates distinguish two tasks, Speaker Identification and Verification.
[2], and some works aim to model emotions in speech by The first refers to the recognition of a particular user among a
manipulating systematically some of the parameters of human known number of users (a multiclass setting), and the second
speech, generating synthetic speech that simulates emotions [3]. aims at identifying one user versus the rest (binary setting).
Moreover, the record-keeping of databases with emotional
and neutral speech is difficult as they are either recorded 2.1. Features
by actors simulating speech under those emotions, or by
people under actual emotions, which could be complicated to In the literature, many features are usually used for
induce. Nevertheless, stress is not considered a proper emotion, Speaker Recognition, for example: Mel-Frequency Cepstral
although it is intimately related to anxiety and nervousness, it is Coefficients (MFCC) -due to their low complexity and
a state of mental or emotional tension resulting from adverse or high performance in controlled environments-, Phonetic and
demanding circumstances. Prosodic features [6] or the Linear Prediction coefficients (LP)
There is plenty of work about the effects of emotions [7]. All of these features exhibit good performance in the task
in Automatic Speech Recognition (ASR) or classification of when used in neutral or emotionless speech.
emotions in speech, but there is few work of the effects of For speaker recognition under stress conditions, however,
emotions in Speaker Recognition (SR), not to mention about there is hardly any previous work, even though, MFCCs, along
stressed speech on SR. Stressed speech is hard to simulate as with Linear Frequency Cepstral Coefficients (LFCC) and Linear
it appears together with physical changes such as the increase Prediction Cepstral Coefficients (LPCC) are cited as important
of heart rate and skin perspiration. There are also hardly features [8], together with the Pitch, Energy and Duration,
any databases in which stressed speech is either simulated which are features that seem to differ between speakers.
15 10.21437/IberSPEECH.2018-4
2.2. Data augmentation second, measuring those differences between stressed and
neutral frames for the same speakers.
Data augmentation (DA) is a commonly used strategy adopted
As a first outcome, we realized that locution speed reflects
to increase the quantity of training data. It is a key ingredient
the stress of a person, we tend to pronounce more words per
of the state of the art systems for image and speech recognition
second and produce longer pauses when stressed. In these same
[9]. It can act as a regularizer in preventing overfitting [10]
conditions, there is a tendency to rise the frequency of our
and improving performance in imbalanced class problems [11],
voices. Thus, the speed and pitch from audio signals are two
making the whole process more robust and achieving a better
variables that we aim to modify by using the SOX library [18]
performance. It is also very useful for small data sets, as it is
in order to artificially simulate speech under stress conditions.
our case, to augment the speech database and as a consequence
improve accuracy [12].
4. Experiments
2.3. Classifiers In this section we present the construction of our system in a
Methods such as Gaussian Mixture Models (GMM) are block by block basis: we introduce the database, the labelling
generally used for speaker recognition, Support Vector strategy, the preprocessing of the data, and the experiments
Machines are widely applied as well [13], [14]. However, carried out.
several studies suggest the use of Deep Learning for speaker
recognition [15] and others prove the improvement in SR 4.1. Corpus database
performance using Convolutional Neural Networks [16]. We used the so-called VOCE Corpus Database [19], a
In recent years Deep Learning algorithms have skyrocketed 45-speaker recordings database in neutral and stress conditions.
in many scientific fields specially when using a large number of For each of the users, speech was recorded on 3 different
data. But for this research, we aim to keep a balance between scenarios: recording, prebaseline and baseline, which were
computational complexity and accuracy, due to the constraints acquired respectively, in a public speaking setting where the
that our targeted device hardware imposes and the reduced speaker is supposed to be under stress conditions, the speaker is
amount of data originally available. Also, preliminary tests to reading a paper 24 hours before the speech, and again reading
compare GMM, SVM and Multi-Layer Perceptron (MLP) led the same paper 30 minutes but before the public speaking
us to chose the later, a precursor of Deep Neutral Networks, setting. The heart rate (HR) was also acquired every second
due its better performance. for the three recordings.
3. Methodology Table 1: Number of samples

In this section we describe the theoretical part of the proposed
Samples Neutral Stressed Total
system, the extraction of the features and the data augmentation
techniques. The block diagram for the training stage is Set 1 1389 3989 5378
represented in Figure 1. The test stage is identical with the Set 2 1716 4858 6574
exception of the Data Augmentation block. Total 3105 8847 11952
3.1. Feature Extraction However we only used 21 speakers out of the 45 due to
the lack of properly recorded HR information, noisy audios
The acoustic features of speech extracted from audio signals
or absence of recordings. We divided these 21 speakers into
should reflect both anatomy (e.g., size and shape of the throat
two sets, Set 1 was composed of 10 speakers whose HR were
and mouth) and learned behavioral patterns (e.g., voice pitch,
coherent with the recordings in the sense that, when a speaker
speaking style).
was reading the heart rate remained stable, but on the public
We worked with the features extracted in the work done by speaking setting the HR rose. Set 2 was made out of the other
Alba Mı́nguez [17] within BINDI for stress detection since the 11 remaining speakers. In Table 1 the number of samples per
database employed is the same. These are the pitch, first three setting are specified, each sample representing 1s audio frames.
formants, twelve Mel-Frequency Cepstral Coefficients and the
energy of the signals. The short-term features were computed
4.2. Preprocessing
every 10 ms of audio and then a temporal integration was
performed over 1s length segments, calculating the mean and For simplicity, we begin with a conversion from stereo to
standard deviation, and resulting in one feature vector per 1s mono of the audio recordings, followed by a downsampling
audio frames, which is the rate at which the accompanying heart from 44100Hz to 16000Hz to reduce the computational cost
rate measures used for labelling stress were taken. of the problem without loosing too much precision. Then, a
normalization of the signals in amplitude is achieved to be able
3.2. Data augmentation to compare between them, and finally the signals go through
a voice activity detector (VAD) [20] that removes silent audio
As for our device, we would hypothetically have neutral
frames as those don’t include valuable information to our task.
speech for the learning step and we may find stressed speech
for testing. For those reasons and regarding to the low
4.3. Labelling
number of samples we have, we considered the generation of
a synthetically stressed database performing data augmentation Labelling an audio signal to determine stress presence is a
for the particular case of stress conditions. To be able to delicate matter since there is not a prescribed way to do so given
produce stressed speech out of neutral utterances we carried stress is non binary and very subjective. Taking a pragmatical
out an analysis, first listening to the audio signals and detecting perspective, once more we relied on the work done by Alba
what differences could be appreciated between them, and Mı́nguez [17] where the recordings of this corpus were labeled
16
Figure 1: Block Diagram of the system
according to each user’s heart rate (HR). Every 1s audio frame 4.6. Synthetic Stress
is labelled as stressed or neutral using two different heart rate
We performed an analysis to measure the differences between
thresholds. We selected the binarization threshold that gave
the mean pitch from neutral to stressed audio frames for each
better results in their report, which was the 75% percentile of
speaker using VOICEBOX [20], and we also estimated the
the HR of the user.
average elocution speed for each user. To do this, we obtained
an automatic transcription of each of the recordings by using
4.4. Balancing the data Google Speech Recognition [21] and computed afterwards the
mean number of words per second.
Soon we realized that the data instances were not balanced for The differences of pitch from neutral to stressed speech
each speaker. An adjustment needs to be made for each set were between -2% and +7%, increasing an average of 2.2%.
and conditions to get consistent estimates as all classes have the As regards to the elocution speed, subjectively, it seems to
same importance. Nevertheless, the use of an over-sampling increase in stressed speech, but our analysis gave us the opposite
technique would have a big drawback in our case because conclusion. The number of words per second was higher when
some users have significantly more samples than others, and the user was reading a text, 2.2 words/s in mean, than when
this would create too many artificial samples. To cope with the speaker was performing an oral presentation, 1.85 words/s.
this problem we cropped randomly the neutral samples by a By listening to the signals, we determined that the words were
threshold of 120 samples for both sets, and stressed samples pronounced faster but there were more pauses in between them,
over a threshold of 300 samples. Applying an over-sampling leading to a lower elocution rate in overall.
technique (in particular, SMOTE) [11] to the new cropped data Thus, we have changed the locution speed and the pitch
culminated in new samples resulting in a balanced data set. from the original database, to produce synthetically stressed
samples of speech. The pitch was modified in steps of
4.5. Preliminary Experiments [-6%, -3%, +3%, +6%], and the signals were reproduced at
the following speeds [-20%, -15%, -10%, -5%]. All these
Originally, for an initial experimental set-up we used the data modifications are applied to the original sets and result in an
available for Sets 1 and 2 (21 speakers). This preliminary augmentation of data, one new synthetic set per modification.
experiment is made to observe the behaviour of mismatch
conditions’ experiments on the speaker recognition rate. First
of all, we divided the data in neutral (NS) and stressed speech
(S) and experimented training with one type of speech and
testing with the other, and then mixing both types. In order to
get reliable results, these experiments were repeated 50 times
where, in each repetition, at least 50% data was randomly
chosen for testing. The results in terms of accuracy (percentage
of audio segments correctly classified) are in Table 2.
Table 2: Results for match and mismatch settings

Figure 2: Original and Synthetic databases scheme
Train Test Mean (%) Std (%)
Figure 2 presents an schematic of the original data and
NS 96.73 0.33
NS its counterpart synthesized one. SS and SSS represent the
S 79.21 0.90
synthetically stressed collection from NS and S respectively.
S 95.87 0.28
S For the experimental set-up we always use the same test set,
NS 90.89 0.49
a 30% of the samples of original stressed speech, which is
MIX MIX 96.05 0.12
represented in green. Additionally, the same 30% in the
synthetic super-stressed set was removed to achieve a more
As a first conclusion, match settings are beneficial and accurate comparison between experiments. In Figure 3 we
mismatch are not, as was expected. When training with neutral enumerate the data used for training for each of the experiments.
and testing with stress, accuracy decreases, so it seems that The results of the pitch modifications experiments for Set 1
stressed speech does have different characteristics compared to are presented in Figure 4, and the ones for speed modifications
neutral speech. On the contrary, when training with stress and for the same set in Figure 5. From these figures, we can see that
testing with neutral utterances, the decrease in accuracy is not the changes that improve the accuracy the most are Pitch -3%
that important, leading us to think that stressed speech could and +3%, and although in speed the results are very similar, the
be sparse data in which neutral speech could be contained but one that in general works worse is Speed -20%.
not vice versa. About the mixed conditions experiments, the From the experiments carried out for Set 1, we decided to
accuracy reached a 96.05%, achieving a good result for this perform the following modifications to Set 2 pitch [-3%, +3%]
particular task. and signal speed [-15%, -10%, -5%].
17
Figure 3: Equivalence between training data and number of experiment.
The data augmentation experiments are 3, 4, 5, 8, 9 and 10.

The outcome is indeed positive, the best results are achieved in
experiment 8 with a 99.45% of accuracy for Sets 1 + 2. These
results show us that augmenting the data boosts the SR rate.
One of our objectives with these experiments was that
experiment 4 could outperform experiment 2, meaning that we
accomplished the task of generating appropriate synthetically
stressed speech out of neutral. That goal has not been achieved,
but at least in Table 3 experiment 4 is better than 6 which, in turn
outperforms 1, for Set 1 and for Set 1+2, settling that stressing
speech synthetically and using it as training data alongside with
Figure 4: Results for Synthetic datasets, Set 1 Pitch the original data, increases the performance of the SR system.
Modifications
5. Conclusions and future work
In this research our goal was to analyze how stressed speech
affects Speaker Recognition systems. We have identified a
problem, which is that stressed speech affects negatively when
SR systems are trained only with neutral speech.
In the experiments for data substitution, depending on the
difference between the synthetic data and the original one,
the substitution outperforms the original data. Besides, the
modifications over the speed of the audio signals work better
for substituting audio utterances than the modifications in pitch.
As regards to the experiments for augmenting the database
Figure 5: Results for Synthetic datasets, Set 1 Speed with artificial stress, we can conclude that the generation of
Modifications different synthetically stressed utterances of speech and its
addition to the database improves substantially the SR results,
reaching a 99.45% of accuracy rate in experiment 8 for Sets
After that, we joined Sets 1 and 2, transforming the problem 1+2.
in a 21-speaker SR task, and combined all the synthetic stress Due to limited time and computational power, several
data, multiplying by 5 the original dataset. We repeated the 10 experiments and methods remained unexplored and left for
experiments 20 times in order to obtain reliable results. The future work:
outcome is available in Table 3, and the equivalence between • Our target in this research is a Speaker Identification task, a
Training Set and Case number is shown in Figure 3. multiclass problem. However, the objective of the device to
be built in BINDI is a Speaker Verification system. These
Table 3: Results for extensive experimentation two approaches are not straightforwardly comparable but we
believe that the problems and solutions can be translated to
one another.
Case Set 1 Mean Set 1 Std Set 1+2 Mean Set 1+2 Std
• To simulate a real environment in which the recorded voice is
1 89.71 0.56 78.55 0.6 not clean, we could add noise to the same database used and
2 98.59 0.16 97.37 0.21 analyze its effect.
3 98.48 0.23 97.21 0.26 • Further analyzing the differences between neutral and
4 89.97 0.39 80.46 0.53 stressed speech could lead us to finding new modifications to
5 99.93 0.05 99.16 0.11 perform to neutral speech to transform it into an appropriate
6 89.72 0.53 78.19 0.71 synthetically stressed speech.
7 99.88 0.07 99.21 0.13 • Implementing new methods for recording stressed speech
8 99.91 0.07 99.45 0.08 using BINDI such as Stroop Effect games [22] in which the
9 99.94 0.06 99.22 0.11 speaker should experiment stress.
10 99.91 0.07 98.97 0.14 • With the use of data augmentation techniques we have
collected a much larger database and we could therefore
There are two types of experiments in Table 3: those where employ more powerful Deep Learning algorithms in the
we substitute data and the ones where we augment data. As future.
for substituting the original set by a synthetically stressed one,
we have experiments 6 and 7, to be compared with experiments
1 and 2 respectively. Data substitution achieves similar results 6. Acknowledgements
to the experiments with original data when using synthetic data This work is partially supported by the Spanish
obtained from NS speech for training (case 1 vs. case 6) and Government-MinECo projects TEC2014-53390-P and
better recognition rates when using synthetic data obtained from TEC2017-84395-P.
S speech for training (case 2 vs. case 7).
18
7. References [20] M. Brookes, “Voicebox: Speech processing toolbox for matlab
[software],” 01 2011. [Online]. Available: http://www.ee.ic.ac.
[1] J. H. Hansen and S. Patil, “Speaker classification,” uk/hp/staff/dmb/voicebox/voicebox.html
C. Müller, Ed. Berlin, Heidelberg: Springer-Verlag,
2007, ch. Speech Under Stress: Analysis, Modeling [21] A. Zhang, “Speech recognition (version 3.8) [software],”
and Recognition, pp. 108–137. [Online]. Available: http: 2017. [Online]. Available: ”https://github.com/Uberi/speech
//dx.doi.org/10.1007/978-3-540-74200-5\ 6 recognition”
[2] F. Dellaert, T. Polzin, and A. Waibel, “Recognizing emotion in [22] J. Ridley Stroop, “Studies of interference in serial verbal
speech,” vol. 3, 12 1996. reactions,” in Journal of Experimental Psychology: General, vol.
[3] I. Murray and J. L. Arnott, “Toward the simulation of emotion 121, 03 1992, pp. 15–23.
in synthetic speech: A review of the literature on human vocal
emotion,” vol. 93, pp. 1097–108, 03 1993.
[4] “UC3M4SAFETY - Multidisciplinary team for detecting,
preventing and combating violence against women,” 2017.
[Online]. Available: http://portal.uc3m.es/portal/page/portal/
inst estudios genero/proyectos/UC3M4Safety
[5] A. Poddar, M. Sahidullah, and G. Saha, “Speaker verification
with short utterances: a review of challenges, trends and
opportunities,” IET Biometrics, vol. 7, no. 2, pp. 91–101, 2018.
[6] E. Shriberg, Higher-Level Features in Speaker Recognition.
Berlin, Heidelberg: Springer Berlin Heidelberg, 2007,
pp. 241–259. [Online]. Available: https://doi.org/10.1007/
978-3-540-74200-5\ 14
[7] J. P. Campbell, “Speaker recognition: a tutorial,” Proceedings of
the IEEE, vol. 85, no. 9, pp. 1437–1462, Sep 1997.
[8] D. A. Reynolds, T. Quatieri, and R. B. Dunn, “Speaker verification
using adapted gaussian mixture models.” vol. 10, no. 1, p. 19–41,
2000.
[9] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation
(VTLP) improves speech recognition,” 2013. [Online]. Available:
http://www.cs.toronto.edu/∼ndjaitly/jaitly-icml13.pdf
[10] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices
for convolutional neural networks applied to visual document
analysis,” in ICDAR, 2003.
[11] K. W. Bowyer, N. V. Chawla, L. O. Hall, and W. P. Kegelmeyer,
“SMOTE: synthetic minority over-sampling technique,” CoRR,
vol. abs/1106.1813, 2011. [Online]. Available: http://arxiv.org/
abs/1106.1813
[12] I. Rebai, Y. BenAyed, W. Mahdi, and J.-P. Lorré, “Improving
speech recognition using data augmentation and acoustic model
fusion,” Procedia Computer Science, vol. 112, pp. 316 – 322,
2017.
[13] W. M. Campbell, D. E. Sturim, and D. A. Reynolds,
“Support vector machines using gmm supervectors for speaker
verification,” IEEE Signal Processing Letters, vol. 13, no. 5, pp.
308–311, May 2006.
[14] K. A. Abdalmalak and A. Gallardo-Antolı́n, “Enhancement of
a text-independent speaker verification system by using feature
combination and parallel structure classifiers,” Neural Computing
and Applications, vol. 29, no. 3, pp. 637–651, Feb 2018.
[15] K. R. Farrell, R. J. Mammone, and K. T. Assaleh, “Speaker
recognition using neural networks and conventional classifiers,”
IEEE Transactions on Speech and Audio Processing, vol. 2, no. 1,
pp. 194–205, Jan 1994.
[16] M. McLaren, Y. Lei, N. Scheffer, and L. Ferrer, “Application
of convolutional neural networks to speaker recognition in noisy
conditions,” pp. 686–690, 01 2014.
[17] A. Mı́nguez-Sánchez, “Detección de estrés en señales de voz
[Stress detection in voiced signals],” p. 86, 06 2017. [Online].
Available: https://github.com/minguezalba/Stress Detection
[18] R. Bittner, E. Humphrey, and J. Bello, PySOX: Leveraging the
Audio Signal Processing Power of SOX in Python. International
Conference on Music Information Retrieval (ISMIR-16), 8 2016.
[19] A. Aguiar, M. Kaiseler, C. M, J. Silva, M. H, and P. Almeida,
“Voce corpus: Ecologically collected speech annotated with
physiological and psychological stress assessments.” 05 2014.
[Online]. Available: https://repositorio-aberto.up.pt/bitstream/
10216/85669/2/133351.pdf
19
IberSPEECH 2018
Bilingual Prosodic Dataset Compilation for Spoken Language

Translation
Alp Öktem1 , Mireia Farrús1 , Antonio Bonafonte2
Universitat Pompeu Fabra, Spain

1
2
Universitat Politècnica de Catalunya, Spain
1
{alp.oktem, mireia.farrus}@upf.edu
2
antonio.bonafonte@upc.edu
Abstract works [5, 6] rely on small data that is created in labora-

tory conditions and thus partially reflecting naturalness
This paper builds on a previous methodology that ex- of conversational language.
ploits dubbed media material to build prosodically an- Our previous work [7] outlined a methodology for
notated bilingual corpora. The almost fully-automatized compiling such corpora without the need for preparing
process serves for building data for training spoken lan- bilingual prompts and a recording process. Instead, it
guage models without the need for designing and record- exploits professionally created bilingual material that is
ing bilingual data. The methodology is put into use by available through dubbed movies. Dubbing is a carefully
compiling an English-Spanish parallel corpus using a re- designed process where the movie content is first trans-
cent TV series. The collected corpus contains 7000 par- lated and then acted by professionals to reflect original
allel utterances totaling to about 10 hours of data an- movie lines. A methodology that exploits this parallel
notated with speaker information, word-alignments and nature can be used to create bilingual material for any
word-level acoustic features. Both the extraction scripts language pair where dubbing is performed. Using this
and the dataset are distributed open-source for research framework, we present an English-Spanish bilingual cor-
purposes. pus of 7000 parallel TV series segments annotated with
Index Terms: bilingual corpora, spoken machine trans- prosodic features to be used in spoken language transla-
lation, prosody tion research.
Contributions of this work can be summarized
1. Introduction as follows: (1) We cover the shortcomings of the
movie2parallelDB framework [8] introduced in [7] by im-
Recent approaches in speech-to-speech translation re- proving word-alignment and parallel segment extraction
search gave focus on the transfer of para-linguistic in- processes, (2) we expand the methodology to automati-
formation between the languages involved. These ap- cally include speaker information from movie scripts, (3)
proaches extend on the classic pipeline of S2S translation, we present an English-Spanish parallel corpus (Heroes
which consists of automatic speech recognition (ASR), Corpus) of 10 hours of audio together with transcriptions
machine translation (MT) and text-to-speech synthesis and prosodic features available through this link1 .
(TTS), and introduces models that connect directly the
information in input and target speech signal. Prosody,
which is the linguistic information encoded in cues such 2. Methodology
as stress, intonation and rhythm, directly influences the The methodology introduced by Öktem et al. in [7] con-
communicative value of the source utterance and thus sists of three stages: (1) a monolingual step, where au-
needs to be carried to the output to achieve a complete dio+text segments are extracted from the movie in both
translation. For example, the effect of emphasis trans- languages using transcripts and cues in subtitles, (2)
fer in S2S translation is shown to influence directly the prosodic feature annotation and (3) alignment of mono-
quality of translation [1]. lingual material to extract the bilingual segments. They
Text data to train machine translation models are discuss some shortcomings of their methodology. Firstly,
collected from utterances carrying the same linguistic in- they report that the word-aligner tool they use is not pre-
formation in different languages. Same way, data-driven cise and fast enough. Secondly, they don’t address the
models that deal with prosody transfer necessitate audio problem of subtitle/audio mismatch in the dubbed lan-
data that not only carries the same linguistic information, guage. This problem occurs because translation of the
but also the same para-linguistic content, encoded in each original movie script for subtitling and dubbing are inde-
language’s prosody. For example, Anumanchipalli et. al pendent processes that require care in different aspects.
[2] collected 200 parallel sentences from a flight magazine While text translation for subtitles can be straightfor-
and recorded using a bilingual speaker to achieve intent ward, dubbing requires the translated utterances to be
transfer in S2S translation. Truong et. al [3] recorded in sync with the lip movements of the actors. Because
966 parallel segments with acted emphasis in order to of this reason, many times the dubbed sentences differ
achieve emphasis transfer. A fairly recent project SIWIS substantially from subtitles.
[4] that focuses on translation of Swiss languages had 171 Another problem in their approach is in the bilingual
prompts with emphasis instructions recorded by many
speakers as their training data. These and other similar 1 http://hdl.handle.net/10230/35572
20 10.21437/IberSPEECH.2018-5
L2 prosodic features and for segmenting multi-speaker por-
L1 tions, we have used a speech-to-text aligner software.
audio Word Multi-speaker segments were split from the words fol-
subtitles Segment extraction alignment lowing speech-dashes. For merging incomplete segments,
punctuation information was used.
script Speaker
Prosody annotation annotation 2.2. Speech-to-text alignment
monolingual segments For speech-to-text alignment, we used the open-source

tool Montreal Forced Aligner (MFA) [9]. Forced align-
Bilingual Alignment
ment process is built on an automatic speech recognition
L1 bilingual segments L2 system and requires its own acoustic models and a pro-
nunciation dictionary. Although pre-trained models for
both English and Spanish is provided through the tool’s
Figure 1: Overall corpus extraction pipeline. Audio
website2 , Spanish pronunciation dictionary isn’t openly
excerpts are first processed in each language and then
available. For this reason, we have created a Spanish pro-
aligned to obtain bilingual segments.
nunciation dictionary3 that uses the same phoneme set as
MFA using word list from the open-source spell checker
tool ISpell 4 and obtaining their phonetic transcriptions
segment alignment stage. In order to extract the paral- using TransDic 5 .
lel sentences from two languages, they first translate the
sentences in one language to the other using a machine
2.3. Speaker annotation through scripts
translation system. Then, sentences ordered in both lan-
guages are matched by selecting most similar pairs of Movie scripts, which contain dialogue and scene infor-
sentences in terms of a translation scoring metric. As mation, are valuable pieces of information for determin-
sentence structures can differ between two scripts, this ing the segment speaker labels. Scripts follows approxi-
approach employing an unreliable MT system can lead mately the same format: Actor/actress name is followed
to mismatched segments. by the line they say. And in between, there might be
This section explains how we addressed these short- non-spoken information in brackets. An example excerpt
comings and also ensured the following requirements for from a movie script of TV series Heroes is given below:
the segments to be extracted: (1) They should con-
tain the utterance of only one speaker, (2) transcriptions
Claire: What did you do? What the hell is going
should be exactly what is being spoken in the audio, (3)
on?
it should have f0/intensity and speech rate information
[Caption: Manhattan 16 years ago]
labeled on word level.
Noah: (in Japanese) We think she died in the
The overall scheme of the corpus extraction method- fire.
ology is illustrated in Figure 1. Claire: Dad? (Hiro covers her mouth to be quiet
.)
2.1. Audio segment mining using subtitles Kaito: (in Japanese) Once again. Not a request.
Subtitles are the source for obtaining both (1) audio tran- (Kaito hands baby Claire to Noah.)
scriptions, (2) timing information related to utterances in
a movie. These information are contained in a standard
Unlike subtitles, scripts don’t have timing informa-
srt subtitle file, entry by entry like the structure below:
tion. In order to map subtitle segments with the speaker
1 information we followed an automatic procedure. We
00:00:09,980 --> 00:00:12,256 first removed all non-spoken text, which is included in
Please, tell me who I am, brackets. Then, speaker tags and corresponding lines are
2 extracted with regular expressions depending on the for-
00:00:12,540 --> 00:00:13,974 mat of the script. Next, segments coming from subtitles
and what the future holds. are mapped one by one to lines in the script. If 70%
Where are we? of the words in a subtitle segment is included in a script
3 turn, then the segment is labeled with the speaker of that
00:00:14,740 --> 00:00:16,572 turn. We found this metric to successfully label 95% of
-We’re in New York. the segments.
-Where is everyone? Scripts are usually only available in the original lan-
guage. The dubbed language segments are labeled the
Each subtitle entry is represented by an index, time same as their matches after the alignment step.
cues and the script being spoken at that time in the
movie. The script portion can consist of multiple (#2), 2 https://montreal-forced-aligner.readthedocs.io/
complete (#2,3) or incomplete sentences (#1) and from 3 Resource available in: https://github.com/TalnUPF/
single (#1,2) or multiple speakers (#3). Using only the phonetic_lexica
time cues for extracting audio segments with complete 4 https://www.gnu.org/software/ispell/
sentences of a single speaker does not suffice. 5 https://sites.google.com/site/juanmariagarrido/
For both the objective of obtaining word-level research/resources/tools/transdic
21
2.4. Word-level acoustic feature annotation Although the TSure threshold catches most of the
one-to-one mapping segments, we realized that many of
Each word in the extracted segments is automatically an-
them fall below this threshold even if they map. So,
notated with the following acoustic features: mean fun-
we added another decision step that if one-to-one map-
damental frequency (f0), mean intensity, speech rate and
ping correlation scores higher than merged pairings and
silence intervals (pauses) before and after. The first two
it scores above a TOK threshold, then it is preferred as a
features are extracted with the ProsodyTagger toolkit
matched pair.
[10] built on Praat [11]. Pause information is calculated
from word-boundary information and speech rate is cal-
2.6. Output format
culated using:
We needed to store the corpus segments in a convenient
#syllables in word way to use with machine learning based applications. We
word speech rate = (1)
word duration used the Proscript library [12] for storing the enhanced
To represent speaker independent, perceptual acous- transcripts. This library makes it possible to store and
tic variations in the segments, both f0 and intensity val- manipulate speech transcript related data. The segments
ues are converted into logarithmic semitone scale relative are stored in csv files that keep the information listed in
to the speaker norm value. Thus, speaker mean values Table 1. A csv file containing all the segments is created
were represented by zero values in both cases. Semitone for each episode as well.
values are calculated with the corresponding formula:
x Table 1: Segment information kept in a Proscript format
semitone(x, norm) = 12 ∗ log( ) (2) csv file.
norm
2.5. Cross-lingual segment alignment based on Information Details
subtitle cues
word tokenized
The first three methodologies presented in this section id unique word id
dealt with extraction of segments in each language. This timing start and end times
subsection explains how segments extracted for each lan- pause coming before and after
guage are aligned to create the bilingual segment pairs. punctuation attached to beginning and end
We have developed an aligning process based on tim- f0 in Hertz and log-scale (semitones)
ing information of the extracted segments. Note that intensity in Decibels and log-scale
the segment alignments can be one-to-one, one-to-many, speech rate relative to syllables
many-to-one or many-to-many depending on the sen-
tencing structure in the subtitles. To create our own
alignment algorithm based on time cues, we first defined
a metric that measures the correlation percentage be- 3. Compiling the Heroes corpus
tween two sets of ordered segments S=hs1 , ..., sN i and
We put our methodology into practice by compiling a cor-
E=he1 , ..., eN i:
pus from the science fiction TV series Heroes 6 . Originat-
correlating ing from United States, Heroes ran in TV channels world-
segments correlation = max(0, × 100) (3) wide between the years 2006 and 2010. The whole series
span
consists of 4 seasons and 77 episodes and is dubbed into
correlating = min(esN , seN ) − max(ss1 , ss1 ) (4) many languages including Spanish, Portuguese, French
and Catalan. Each episode runs for a length of 42 min-
span = max(eeN , seN ) − min(es1 , ss1 ) (5) utes.
whereesx andeex denote the starting and ending time We chose this series as we had access to the DVD’s
of the xth segment in set E, ssx and sex denote the starting with Spanish dubbing. Also, we found it to have the
and ending time of the xth segment in set S. Spanish subtitles closest to the Spanish dubbing scripts
The alignment procedure is as follows: Two indexes among other series.
iE , iS are kept which slide through the segments of each
language. First, segments corresponding to each index 3.1. Raw data acquisition
are checked if they correlate more than the TSure thresh-
The DVD’s of the series were obtained from the Pompeu
old. If they do, they are assigned as a one-to-one matched
Fabra University Library. Episodes were extracted using
pair. If not, the possibilities of one-to-many, many-to-
the Handbrake software and were saved as Matroska for-
one or many-to-many matches are considered. This is
mat (mkv) files. Mkv files can hold multiple channels of
done through computing the correlations between com-
audios and subtitles embedded in it like DVDs. In order
binations of the current and two following segments and
to run our scripts we first needed to extract the audio
selecting the most correlating segment set pair. While
and subtitle pairs for both languages. Audio is extracted
considering combinations of the segments it is made sure
using the mkvextract command line tool7 . As subtitles
that two merged segments belong to the same speaker
were embedded as bitmap images in the DVD, we had to
and are not more than 10 seconds far from each other.
If the combined segment set pair with highest correla- 6 Produced by Tailwind Productions, NBC Universal Tele-
tion has a correlation of more than TM erged threshold, vision Studio (2006-2007) and Universal Media Studios (2007-
then the combinations are merged into one segment and 2010)
paired with each other. 7 https://mkvtoolnix.download/
22
run optical character recognition (OCR) in order to get English Spanish
srt format subtitles. As OCR is an error-prone process, Avg. # sentences (subtitles) 647 554
the resulting srt files needed to be spell checked. Avg. # sentences
628 513
We collected English and Spanish audio of 21 episodes (extracted)
totaling to 25 hours of raw audio and their corresponding Avg. # segments 526 459
subtitles. The episode scripts were obtained from a fan- Avg. # parallel segments 334
site in the Internet8 . Table 4: Averages numbers for each episode.
3.2. Manual subtitle correction work

4. Discussion
A speech corpus necessitates properly transcripted
speech segments. Our method is based on obtaining tran- The first version of the Heroes corpus shows that our
scriptions from subtitles. Although subtitles are highly automated method for bilingual corpus building is suc-
reliable sources for obtaining proper transcriptions in the cessful in terms of the quality of the segments extracted.
original language of the movie, this is not the case in the Our manual inspections show that the segments are cor-
dubbed languages. This is due to the fact that dubbing rectly aligned and the transcriptions are correctly stored
transcript needs to satisfy visual alignment such as lip with the audio segments.
movements, whereas subtitles do not. Also, subtitles are The Spanish subtitle correction task was the only
often done in a more concise way to facilitate easy read- time-consuming part of the whole process. However, it
ing. In our case, we observed that the Spanish subtitles showed that it is also useful for obtaining clean paral-
were matching the Spanish audio in approximately 80% lel segments. Subtitle segments that were removed dur-
of the cases. To accommodate this issue, we manually ing the correction process ensured the elimination of un-
corrected the Spanish subtitles to match with the Span- wanted audio portions.
ish audio. Both subtitle transcripts and timestamps had We can interpret Table 4 to show us the amount of
to be corrected. This process was done using a subtitle information loss at various stages. The first one being
editing program Aegisub9 . the word-alignment process where in average 5% of the
An advantage the manual correction process gives is sentences are lost due to the failure of the word aligner
the opportunity to filter out unwanted audio portions in segmenting words. We found out that many of the
that would end up in the corpus. Subtitle segments that segments that are lost this way were either noisy or erro-
contained noise and music, overlapping or unintelligible neous. The biggest loss happens at the stage of alignment
speech and speech in other languages (e.g. Japanese) were where in average 30% of the segments in each language
removed during this process. The spell checking and are left unaligned. This percentage is directly affected by
timestamps and script correction of 21 episodes was done the alignment parameters explained in Section 2.5. For
by two annotators and took 60 hours in total. example, selecting a lower TSure leads to detecting more
aligned segments but also to more mismatches. A simi-
3.3. Heroes corpus in numbers lar logic applies to TOK . Also, choosing a lower TM erged
We present the statistics of the first preparation sprint leads to more coverage of the sentences but more as com-
of The Heroes Corpus. 21 episodes from season 2 and binations with others, leading to fewer and longer seg-
season 3 were processed. The totaled audio durations of ments. After experimenting with a handful of parameter
7000 parallel segments approaches 10 hours (see Table combinations, we decided on this configuration for ob-
2). Counts of several linguistic units in the final parallel taining the corpus presented in this paper: TSure = 70%,
corpus are presented in Table 3. A summary of how much TM erged = 80% and TOK = 30%.
of the content in one episode ended up in the dataset in
average is presented in Table 4. 5. Conclusions
English Spanish We have presented an English-Spanish bilingual corpus
Total duration 4:45:36 4:43:20 of dubbed TV series content. The corpus that consists
Avg. duration/segment 0:00:02.44 0:00:02.42 of 7000 parallel audio segments with transcriptions and
Table 2: Corpus duration information. annotated prosodic features is made openly available. We
hope both our methodology and the corpus we compiled
be useful for the speech-to-speech translation research
English Spanish
community.
# words 56320 48593
# tokens 72565 63014
# sentences 9892 9397 6. Acknowledgements
Avg. # words/sentence 5.69 5.17
Special thanks to annotators Sandra Marcos Bonet and
Avg. # words/segment 8.04 6.94
Laura Gómez Fisas for their work during the Spanish
Avg. # sentences/segment 1.41 1.34
subtitle correction process. The annotation work carried
Table 3: Word, token, sentence counts and average word by the annotators was financed with the 2018 Maria de
count for parallel English and Spanish segments. Maeztu Reproducibility Award from Department of In-
formation and Communication Technologies of Universi-
tat Pompeu Fabra received by the first author. The sec-
8 https://heroes-transcripts.blogspot.com/ ond author is funded by the Spanish Ministry through
9 http://www.aegisub.org/ the Ramón y Cajal program.
23
7. References
[1] A. Tsiartas, P. G. Georgiou, and S. S. Narayanan, “A
study on the effect of prosodic emphasis transfer on over-
all speech translation quality,” in 2013 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Pro-
cessing, May 2013, pp. 8396–8400.
[2] G. K. Anumanchipalli, L. C. Oliveira, and A. W. Black,
“Intent transfer in speech-to-speech machine transla-
tion,” in Spoken Language Technology (SLT) Workshop.
IEEE, 2012, pp. 153–158.
[3] Q. T. Do, S. Sakti, and S. Nakamura, “Toward expres-
sive speech translation: A unified sequence-to-sequence
lstms approach for translating words and emphasis,” in
INTERSPEECH, 2017.
[4] J.-P. Goldman, P.-E. Honnet, R. Clark, P. N. Garner,
M. Ivanova, A. Lazaridis, H. Liang, T. Macedo, B. Pfis-
ter, M. S. Ribeiro, E. Wehrli, and J. Yamagishi, “The si-
wis database: a multilingual speech database with acted
emphasis,” 2016.
[5] P. D. Agüero, J. Adell, and A. Bonafonte, “Prosody
generation for speech-to-speech translation,” in Interna-
tional Conference on Acoustics, Speech, and Signal Pro-
cessing (ICASSP), vol. 1. IEEE, 2006, pp. 557–560.
[6] T. Kano, S. Takamichi, S. Sakti, G. Neubig, T. Toda,
and S. Nakamura, “An end-to-end model for cross-
lingual transformation of paralinguistic information,”
Machine Translation, Apr 2018. [Online]. Available:
https://doi.org/10.1007/s10590-018-9217-7
[7] A. Öktem, M. Farrús, and L. Wanner, “Automatic ex-
traction of parallel speech corpora from dubbed movies,”
in Proceedings of the 10th Workshop on Building and Us-
ing Comparable Corpora (BUCC), Vancouver, Canada,
2017, pp. 31–35.
[8] A. Öktem, “movie2parallelDB: Automatic parallel
speech database extraction from dubbed movies,”
2018. [Online]. Available: https://github.com/alpoktem/
movie2parallelDB
[9] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and
M. Sonderegger, “Montreal Forced Aligner: Trainable
text-speech alignment using Kaldi,” in Proc. Interspeech,
2017, pp. 498–502.
[10] M. Dominguez, M. Farrús, and L. Wanner, “An
automatic prosody tagger for spontaneous speech,” in
Proceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Technical
Papers. The COLING 2016 Organizing Committee,
2016, pp. 377–386. [Online]. Available: http://www.
aclweb.org/anthology/C16-1037
[11] P. Boersma and D. Weenink, “Praat: Doing phonet-
ics by computer [Computer software], retrieved from
http://www.praat.org/,” 2017.
[12] A. Öktem, “Proscript: Python library for prosodic
annotation of speech segments,” 2018. [Online]. Available:
https://github.com/alpoktem/proscript
24
IberSPEECH 2018
Building an Open Source Automatic Speech Recognition System for Catalan

Baybars Külebi1 , Alp Öktem2,1
1
Col·lectivaT SCCL, Barcelona, Spain
2
Universitat Pompeu Fabra, Barcelona, Spain
bkulebi@collectivat.cat, alp@collectivat.cat
Abstract only provided through a centralized server with a fee or a us-

age limit and does not guarantee data privacy. Also, the fact
Catalan is recognized as the largest stateless language in Eu- that they are closed-source clears away any possibility for cus-
rope hence it is a language well studied in the field of speech, tomization to different needs. Regarding these, although there
and there exists various solutions for Automatic Speech Recog- seems to be reasonable research and development in Catalan
nition (ASR) with large vocabulary. However, unlike many of ASR systems, there exists no project with an emphasis on free
the official languages of Europe, it neither has an open acoustic and open sourced access to its resources.
corpus sufficiently large for training ASR models, nor openly In this work, we present our development of an open source
accessible acoustic models for local task execution and personal large vocabulary ASR system for Catalan. To build a free and
use. In order to provide the necessary tools and expertise for the open system, we incorporated a variety of free tools and re-
resource limited languages, in this work we discuss the develop- sources for our use. We chose to use the HMM-based speech
ment of a large speech corpus of broadcast media and building recognition system CMU Sphinx principally for its practicality
of an Catalan ASR system using CMU Sphinx. The resulting in application development. Although state-of-the-art systems
models have a WER of 35,2% on a 4 hour test set of similar are neural network based, CMU Sphinx is still a popular choice
recordings and a 31.95% on an external 4 hour multi-speaker as an ASR toolkit for its computationally low-cost architecture,
test set. This rate is further decreased to 11.68% with a task active community of developers and rich documentation. As for
specific language model. 240 hours of broadcast speech data training data we exploited the broadcast media material pub-
and the resulting models are distributed openly for use. licly available through the Catalan public television TV3. Both
Index Terms: speech recognition, audio corpus, Catalan, open acoustic and text resources were collected automatically and
source processed for model training.
With the motivation of making the language accessi-
1. Introduction ble to speech technology developers, we have made both
Development of technology that involves spoken input necessi- the models and the dataset available online. The ready-to-
tates three components in order to accomplish the conversion of use models, instructions for deployment and a basic script
human speech to machine readable form: (1) Automatic speech for decoding are distributed in https://github.com/
recognition (ASR) system, (2) acoustic model and (3) language collectivat/cmusphinx-models. 240 hours of broad-
model. There exists various open source alternatives for ASR cast speech data collected during this process is also accessible
systems such as CMU Sphinx, Kaldi and DeepSpeech, however, through this repository.
publicly available acoustic and language models to use with
these applications are mainly available for highly-resourced lan- 2. Speech Recognition System
guages. As speech corpora development needed to build the
acoustic models is a high-cost process, it is harder to find open Basic architecture of an ASR system consists of five elements as
source models for lesser-used languages. This puts these lan- can be seen in the Figure 1. Before the central decoder can ac-
guages in major disadvantage in terms of accessibility in tech- cept speech signals, they need to be converted into a sequence of
nologies with a voice interface. fixed size acoustic vectors through a process called feature ex-
Catalan with an estimated 9.1 million speakers among 4 dif- traction. Later the recognizer makes use of the acoustic model,
ferent countries1 , is the ninth most spoken language in Europe the language model and the pronunciation dictionary in order
and second in Spain. There has been development in Cata- to decode the vector sequences into the most likely word se-
lan speech technology both from inside and outside of Cat- quences they represent. In most cases, a language model con-
alonia, and both research and commercial oriented. Research sists of N-grams in which the probability of occurrence of a
projects that involve Catalan speech recognition include large- word depends on its N − 1 predecessors and the pronuncia-
vocabulary TECNOPARLA [1] and an earlier telephone conver- tion dictionary provides the mapping between each word and
sation targeted project [2], both from Universitat Politécnica its phonetically written form.
de Catalunya (UPC). The speech recognition models resulting The core part of the recognition relies on how the speech
from these projects are not available for public use. Only the is modeled, which depends on how acoustic model parameters
corpus used to develop the former one is available on demand are defined and furthermore how they are trained. In many clas-
and is of limited size. On the commercial spectrum, cloud ASR sic ASR systems, Hidden Markov Models (HMMs) are used for
services by companies like Google, Facebook and Speechmat- modeling the phones (or tri-phones) as finite states each with as-
ics provide services in Catalan. However, these services are sociated probability distributions. Within the ASR architecture,
HMMs are used for evaluating the probability of given speech
1 According to the estimate by Secretary of Language Policies of features (forward-backward algorithm). Furthermore, their use
Government of Catalonia permit the estimation of the the best model parameters for pro-
25 10.21437/IberSPEECH.2018-6
tional language such as interviews or programs with clear dic-
Language Pronounciation tation with minimum amount of background noise or music.
Models Dictionary
Based on these priorities, we downloaded approximately 490
hours of video with their corresponding subtitles in srt format
Speech Feature
Audio Vectors
Graphemes
from 17 programs, the distribution of topics and durations for
Feature
Extraction Decoder
each program can be seen in the table 1.
(Front-end)
3.2. Acoustic data preparation

For training the acoustic models for an ASR system, audio seg-
Speech Acoustic
Corpus Models ments with lengths between 5-20 seconds are needed. The video
files we have downloaded have lengths ranging from approxi-
mately 7 minutes to 75 minutes. In order to cut these video files
to meaningful segments with desired lengths, we have taken ad-
Figure 1: General architecture of an ASR system. vantage of the subtitling.
Before undertaking the segmentation process, as a start, we
first had to verify that the subtitle files correspond correctly to
ducing the given speech features (Baum-Welch algorithm) as the video recordings. In this stage we did not assess the quality
well as decoding the speech feature observables to the most of the subtitle cues, but rather checked for more general prob-
likely state sequences (beam search algorithm). For a review lems, which we have determined to be three; (1) if the down-
of use HMM-based ASR see [3]. loaded subtitle file is empty, (2) if the subtitles correspond to a
For our task at hand we decided to use the CMU Sphinx different program, or (3) if the subtitles correspond to the cor-
ASR system, which has been developed at Carnegie Mellon rect video but with wrong timestamps, possibly due to a wrong
University over many decades starting from the original Sphinx- frame per second assumption made during the export. In order
I [4] to its more recent and advanced incarnations of Sphinx- to solve these problems, we first checked whether the subtitle
3 [5] and Sphinx-4 [6]. For this work, we specifically used file had cues or not. If it had, we looked at the end time stamp
the PocketSphinx package within the Sphinx-5prealpha which of the last cue and compared it with the duration of the video.
incorporates algorithms and codes from the previous CMU In the case of an empty subtitle file, or subtitle cues extending
Sphinx packages [7, 8, 9]. Comparison studies of open-source beyond the video duration, we have eliminated the video and
ASR tools based on training of similar datasets show that Kaldi subtitle from further processing.
is far superior in terms of recognition accuracy, mostly due to The training segments are defined as word sequences which
its inclusion of advanced techniques such as Deep Neural Net- start and end with silences. We have used the minimum sepa-
works (DNNs) [10]. However a quick look at the standard ration duration of 100 ms, and any sequential cues with separa-
recipes of Kaldi show that, its training tasks are mostly built tions smaller than this amount are grouped together in batches.
to be executed in CPU and GPU clusters whereas CMU Sphinx Our assumption was that longer separations are more likely to
can train reasonably detailed acoustic models in desktop sys- signal the beginning and the end of sentences or simply are most
tems. In addition, although DNN based models are the state-of- like to represent the pauses in speech, hence practically defining
the-art, PocketSphinx holds the advantage of being easily de- the start and the end of the audio segments. After the cues are
ployable in resource limited environments such as hand-held grouped into batches, their durations are further evaluated and
devices. Unfortunately due to its dependence in external li- in the case the durations did not fall between 5 and 20 seconds
braries such as LAPACK, Kaldi’s usage in hand-held devices they were eliminated.
end-up being problematic due to memory limitations [11]. Before starting the actual segmentation of the video files,
the text content is further filtered. This included getting rid of
3. Data collection and preparation the explanatory captions; i.e. any content within parenthesis or
brackets, and cues starting with #. Additionally, we have elim-
We have decided to exploit readily available data on the Inter- inated the cues which had content presumably not in Catalan.
net for compiling the necessary training data. We found this As a final step we have normalized the text by eliminating the
approach to be less costly compared to approaches that involve punctuation and symbols. There are exceptions in this stage due
script preparation and recording. to the nature of Catalan morphology, since dashes and apostro-
phes might be necessary combining pronouns with verbs.
3.1. Raw data collection Finally, the complete audio that is extracted from the video
file is cut according to this list of the calculated and filtered
In order to gather Catalan audio with transcriptions, we have
segments. During the actual segmentation process, we convert
taken advantage of the fact that Catalan public television makes
the audio files into mono channel with a sampling of 16 kHz.
its programs accessible online for the general public2 . Firstly
we have revised the full list of available programs and elimi-
3.3. Text and language data preparation
nated the ones which have inappropriate subtitles; such as the
live interviews, talk shows and daily news which necessitate Before starting the actual training process using the acoustic
the captioning to be done spontaneously. Effectively these cap- data, a phonetic lexicon needs to be built matching each word
tions are not synchronized and approximate hence those pro- as they appear in the transcriptions with its phonetic descrip-
grams are not considered. Secondly we have chosen the pro- tion. This was achieved by using the grapheme to phoneme
grams with modalities of speech that are useful for ASR train- conversion tool embedded in the rule based speech synthesis
ing. This meant looking for programs with natural conversa- tool espeak. To construct a phonetic lexicon, we first extracted
all the unique words in the transcriptions, converted them to
2 http://www.ccma.cat/tv3/ their IPA (International Phonetic Alphabet) written format and
26
Table 1: The programmes used in constructing the ASR system. Table shows their respective themes, total downloaded durations and
the final duration used for the training.
Programme name Theme Downloaded Filtered

Duration Duration
Zona UFEC Sports 22:42:22 8:31:20
Art Endins Outreach 6:22:9 2:04:31
El dia de demà Current events 8:55:59 3:22:25
Quan arribin els marcians Culture (Current Events) 17:45:38 7:26:05
Tot un món Current events 5:20:12 1:39:16
Valor afegit Current events 47:45:34 26:20:37
Afers Exteriors Current events, Outreach 53:01:55 30:31:17
A la presò Outreach 5:55:23 0:53:01
Arts i Oficis Outreach 12:42:31 6:00:14
Benvinguts a l’hort Gastronomy, Outreach 11:35:16 6:58:15
Collita pròpia Outreach 10:01:30 6:09:03
Catalunya des del mar Current events, Outreach 5:40:51 2:07:23
De llit en llit Outreach 3:00:33 1:02:05
Detectiu Outreach, Fiction 6:47:56 2:22:11
Economia en colors Entertainment 16:03:59 7:44:17
Els dies clau Outreach 6:19:19 3:23:36
Fotografies Outreach 5:31:49 2:29:39
Generació Selfie Outreach 6:51:19 2:51:46
Històries de Catalunya Outreach 11:28:34 5:37:31
La salut al cistell Gastronomy, Outreach 12:04:46 7:49:46
L’ofici de viure Outreach 4:19:38 1:26:45
Millennium Outreach 29:25:58 16:11:53
Trinxeres Outreach 7:15:55 2:37:20
Programa sindical CCOO Outreach 12:16:42 7:03:45
Quèquicom Outreach 52:23:31 27:10:41
Sota terra Outreach, Entertainment 11:36:03 4:07:41
Veterinaris Outreach 49:42:21 21:20:23
Via llibre Outreach, Entertainment 46:59:25 24:52:51
Total 489:57:23 240:15:52
finally converted these IPA version to the CMU Sphinx readable between 130 and 6800 Hz; i.e. 12 cepstra using the C0 as the
format. In total we have used 37 phonemes, consistent with the energy component plus their deltas and delta deltas adding up to
literature on Catalan phonetic corpus [12]. 39 total parameters (1s c d dd). For the acoustic model train-
Additional information on the structure of Catalan language ing, our Gaussian mixture model contains 32 Gaussian densi-
is necessary for the decoding phase. Within an ASR system ties, and 6000 tied HMM states.
the statistical information on the linguistic grammar and syn- In our process, we started by estimating the transition prob-
tax represented through the language model, and these models abilities of the Context-Independent (CI) HMMs for forced
can be prepared using a sufficiently large text corpus. In this alignment of the acoustic data. For the forced alignment itself
work, we have taken advantage of the subtitle of our audio cor- we used the sphinx3 align executable that needed to be com-
pus and merged them with the Catalan OpenSubtitles Corpus piled apart from the Sphinx-5prealpha library. In this step, the
[13] to build a basis for our language models. audio files are aligned with their respective transcriptions using
The final corpus is cleaned from all symbols and punc- the CI models, in the case when there is a mismatch with the
tuation, and numbers are normalized using espeak tool. The transcription and the alignment result, the audio files are elimi-
complete corpus has 5.3 million tokens with around 100,000 nated for the following steps. After the non-aligning segments
unique tokens which we used for compiling the phonetic are eliminated we were left with 240 hours of total audio. The
lexicon. Using approximately 58k words (which appear at final amount of audio per programme used is shown in table 1.
least twice in the corpus) we have prepared one 3-gram The transition probabilities of the CI HMMs re-estimated
(OT large 3gram) and another 4-gram (OT large 4gram) lan- using filtered data set, and following this phase a complete list
guage model in ARPA format using the CMU Language Model of tri-phones (58289 in our case) are built and their transition
toolkit (CMUCLMTK). probabilities are estimated in the form of Context-Dependent
(CDs) HMMs. These tri-phones account for both between-word
and within-word contexts, however since the training data might
4. Training not account for all the possibilities, the unseen tri-phones are
For our training process we have used the standard CMU Sphinx tied to the seen tri-phones using decision trees.
training steps with very minor changes. The training starts with We performed our training in a resource limited environ-
extraction of the Mel-Frequency Cepstral Coefficients (MFCC) ment. For four threads of Intel(R) Atom(TM) CPU N2800 with
27
1.86GHz, the whole training process took about 120 hours. 5. Future Work
The most important and basic step for improving our ASR sys-
4.1. Evaluation
tem is to use a better pronunciation dictionary using a better
In order to evaluate the word error rate (WER) of the acous- grapheme to phoneme conversion system. In this work, for
tic models we wanted to make sure that the test voices do not its ease of use we have taken advantage of espeak, however
appear in the training corpus. In order to evaluate the acous- the festival based FESTCAT speech synthesis system is specif-
tic models for similar recordings, we downloaded different TV3 ically implemented for Catalan and allows for a more refined
programmes that were not used in the training. 4 hours of new grapheme to phoneme conversion. Training the acoustic model
TV3 recordings evaluated with the OT large 4gram language with this improved pronunciation dictionary will allow for bet-
models resulted in a WER of %35,2. For the decoding we used ter results overall.
the standard decoding script within the CMU Sphinx. For the acoustic data itself depending on the sound quality,
In order to evaluate the accuracy of the models in a cleaner background music levels and the speaker mistakes, it should be
environment and also to guarantee 100% speaker exclusion we clustered into clean and other, similar to the librispeech dataset
decided to use another test set for evaluation round. For this [16]. Additionally, we plan to do a gender diarization model,
we used FESTCAT corpus, which is specifically designed for determining whether the voice is male of female for each seg-
creating a speech synthesis system for Catalan [14, 15], and ment, in order to assess the gender balance of the whole dataset.
consists of 28 hours of recordings from 10 different voices (5 With these acoustic models, it will be possible to do align-
female, 5 male). For evaluating our ASR system we used 4 total ment of an audio with its given text. This process will not only
hours of 4 female and 4 male voices with their corresponding be useful in cleaning the dataset itself, but also will allow ex-
transcripts. It should be noted that due to the clean environment tending the current set without relying on the cue start and end
of the recordings, the FESTCAT dataset also represents a more times within the subtitles. This implies further Catalan acoustic
ideal audio quality. data could be assembled by using the audio and just its corre-
Due to our restricted text corpus, we have created another sponding transcription.
set of language models in addition to the ones explained in the Related to this possibility, another tool we would like to
subsection 3.3 specifically for the test decoding. The second set develop is a system of automatic punctuation in Catalan. The
of language models uses a corpus of FESTCAT text plus the cor- readability of recognized transcripts depend a lot on sentence
pus explained in the subsection 3.3. Using this new corpus we segmentation and correct punctuation. The methods for train-
created one 3-gram language model with the most frequent 20k ing punctuation engines using recurrent neural networks (RNN)
words (OTF 20k 3gram) and two other models with the most are very well developed, especially with the use of a large text
frequent 58k words (OTF large 3gram, OTF large 4gram) sim- corpus [17]. But also recently it was demonstrated that acous-
ilar to the OT large models. Whereas for the OT large 4gram tic data with word-aligned transcriptions can be used to create
language model we ended up with a WER of 31.95%, the best prosody based punctuation models [18]. For now this type of
results were attained by the OTF large 4gram model at 11,68%. models have only been trained for English. With the possibility
The results for each language model with the corresponding of doing word level alignments for Catalan, we will be able to
real-time decoding factor (xRT) for an Intel i7-4510U 3 Ghz train one in Catalan in the recent future.
Quad-core architecture is shown in the table 2. The high preci-
sion OTF results, show that if the acoustic conditions are perfect 6. Conclusions
and the language models are “in-domain,” our acoustic models
can recognize voices that are not in its training set reasonably In this paper, we have described building of an ASR system for
well. In addition, for our cases the main factor which improved a new language, using only publicly available resources. Ap-
the recognition precision was the amount of pruning that the plying our methodology on Catalan, we compiled a dataset of
corpus was subjected to for constructing the language models. 240 hours of transcribed broadcast speech and used it to develop
Whereas moving from the most frequent 20k words to most fre- large-vocabulary speech recognition models, both of which are
quent 58k words makes a considerable improvement, the effect distributed openly online. The accuracy of the resulting models
of using 4-gram instead of 3-gram seems to be very small, prob- show that they can be a base for speech technology developers
ably due to our specific test condition. However one important to access the Catalan speaking community. Building a voice
difference between the 3-gram and 4-gram models is the xRT, input interface for a desktop or mobile application is easy as in-
for which the 3-gram models are considerably faster than the 4- stalling the CMU Sphinx toolkit3 and placing the models in its
gram models. Note that we did not undertake any optimization installation directory. The ASR system further gives the possi-
of the decoding parameters neither for best precision not for the bility to adapt acoustic and language models for more special-
best computational performance. ized vocabularies and acoustic environments. We believe that
the practical and low-cost setup of CMU Sphinx makes it an
Table 2: The WER and xRT results for different language mod- important player amongst other ASR engines, despite the more
els for the FESTCAT test dataset. modern neural network based alternatives. It keeps its relevance
especially for minority languages which have little open acous-
tic data resources available.
Language WER xRT
Model (%)
7. Acknowledgements
OT large 4gram 31,95 0.952
OTF 20k 3gram 22,50 0.872 This project was funded by Softcatalà. The authors would like
OTF large 3gram 12,11 0.900 to thank Antonio Bonafonte for his guidance during the writing
OTF large 4gram 11,68 1.002 of this paper.
3 https://cmusphinx.github.io/wiki/tutorial/
28
8. References [18] A. Öktem, M. Farrus, and L. Wanner, “Attentional parallel RNNs
for generating punctuation in transcribed speech,” in 5th Interna-
[1] H. Schulz, M. Ruiz, and J. A. R. Fonollosa, “TECNOPARLA - tional Conference on Statistical Language and Speech Processing
Speech technologies for Catalan and its application to speech-to- SLSP 2017, Le Mans, France, 2017.
speech translation,” Procesamiento del lenguaje natural, vol. 41,
pp. 319–320, Sep 2008.
[2] J. Mariño, J. Padrell, A. Moreno, and C. Nadeu, “Workshop
on speech recognition based on very large telephone speech
databases,” C. Draxler, 2000, pp. 57–61.
[3] M. Gales, S. Young et al., “The application of hidden markov
models in speech recognition,” Foundations and Trends in Signal
[4] K.-F. Lee, H.-W. Hon, and R. Reddy, “An overview of the sphinx
speech recognition system,” in Readings in speech Recognition.
Elsevier, 1990, pp. 600–610.
[5] P. Placeway, S. Chen, M. Eskenazi, U. Jain, V. Parikh, B. Raj,
M. Ravishankar, R. Rosenfeld, K. Seymore, M. Siegler et al.,
“The 1996 hub-4 sphinx-3 system,” in Proc. DARPA Speech
recognition workshop, vol. 97. Citeseer, 1997.
[6] P. Lamere, P. Kwok, E. Gouvea, B. Raj, R. Singh, W. Walker,
M. Warmuth, and P. Wolf, “The CMU sphinx-4 speech recogni-
tion system,” in IEEE Intl. Conf. on Acoustics, Speech and Signal
Processing (ICASSP 2003), Hong Kong, vol. 1, 2003, pp. 2–5.
[7] D. Huggins-Daines, M. Kumar, A. Chan, A. W. Black, M. Rav-
ishankar, and A. I. Rudnicky, “Pocketsphinx: A free, real-time
continuous speech recognition system for hand-held devices,” in
Acoustics, Speech and Signal Processing, 2006. ICASSP 2006
Proceedings. 2006 IEEE International Conference on, vol. 1.
IEEE, 2006, pp. I–I.
[8] D. Huggins-Daines and A. I. Rudnicky, “Mixture pruning and
roughening for scalable acoustic models,” in Proceedings of the
ACL-08: HLT Workshop on Mobile Language Processing, 2008,
pp. 21–24.
[9] ——, “Combining mixture weight pruning and quantization for
small-footprint speech recognition,” in Acoustics, Speech and Sig-
nal Processing, 2009. ICASSP 2009. IEEE International Confer-
ence on. IEEE, 2009, pp. 4189–4192.
[10] C. Gaida, P. Lange, R. Petrick, P. Proba, A. Malatawy, and
D. Suendermann-Oeft, “Comparing open-source speech recogni-
tion toolkits,” Tech. Rep., DHBW Stuttgart, 2014.
[11] P. Vojtas, J. Stepan, D. Sec, R. Cimler, and O. Krejcar, “Voice
recognition software on embedded devices,” in Asian Conference
on Intelligent Information and Database Systems. Springer,
2018, pp. 642–650.
[12] I. Esquerra, C. N. Camprub, L. Villarrubia, and P. Len, “De-
sign of a phonetic corpus for speech recognition in Catalan,” in
Workshop on Language Resources for European Minority Lan-
guages at the Conference on Language Resources and Evaluation
(LREC), Granada, 1998.
[13] J. Tiedemann, “News from OPUS - A collection of multilingual
parallel corpora with tools and interfaces,” in Recent Advances in
Natural Language Processing, N. Nicolov, K. Bontcheva, G. An-
gelova, and R. Mitkov, Eds. Borovets, Bulgaria: John Benjamins,
Amsterdam/Philadelphia, 2009, vol. V, pp. 237–248.
[14] A. Bonafonte, J. Adell, I. Esquerra, S. Gallego, A. Moreno, and
J. Pérez, “Corpus and voices for Catalan speech synthesis,” in Pro-
ceedings of LREC Conference 2008, 2008, pp. 3325–3329.
[15] A. Bonafonte, L. Aguilar, I. Esquerra, S. Oller, and A. Moreno,
“Recent work on the FESTCAT database for speech synthesis,” in
Proceedings of LREC Conference 2008, 2009, pp. 3325–3329.
[16] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
rispeech: an ASR corpus based on public domain audio books,”
in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE
[17] O. Tilk and T. Alumäe, “Bidirectional recurrent neural network
with attention mechanism for punctuation restoration,” in Inter-
speech 2016, 2016.
29
IberSPEECH 2018
Multi-Speaker Neural Vocoder

Oriol Barbany1,2 , Antonio Bonafonte1 and Santiago Pascual1
1
Universitat Politècnica de Catalunya, Barcelona, Spain
2
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
oriol.barbanymayor@epfl.ch, antonio.bonafonte@upc.edu, santi.pascual@upc.edu
Abstract of WaveNet [1], we can inject additional controlling features

in the input that span long-term structures of the speech, like
Statistical Parametric Speech Synthesis (SPSS) offers more frames of acoustic information or pitch contours. In [3], a TTS
flexibility than unit-selection based speech synthesis, which system is presented which consist of two modules. The first
was the dominant commercial technology during the 2000s one transforms the linguistic features (phoneme, stress, length
decade. However, classical SPSS systems generate speech with of sentence, etc.) into an acoustic features sequence. Then this
lower naturalness than unit-selection methods. Deep learning conditions the second module, a SampleRNN, which generates
based SPSS, thanks to recurrent architectures, surpasses classi- the waveform, hence acting as a neural vocoder. A joint op-
cal SPSS limits. These architectures offer high quality speech timization of models improves the generated speech. In this
while preserving the desired flexibility in choosing the param- work we focus on the neural vocoder based on SampleRNN.
eters such as the speaker, the intonation, etc. This paper ex- The main motivation is to generalize the system proposed in [3]
poses two proposals conceived to improve deep learning-based to many speakers as a shared deep neural network (DNN) struc-
text-to-speech systems. First a baseline model, obtained by ture, as it achieves better results in generating quality speech
adapting SampleRNN, making it as a speaker-independent neu- than learning the parameters of a single isolated speaker [5, 6].
ral vocoder that generates the speech waveform from acoustic As in our previous work, it transform acoustic features (MFCC,
parameters. Then two approaches are proposed to improve the F0, etc.) into speech waveform. However, the speaker code is
quality, applying speaker dependent normalization of the acous- added as a new feature to control the identity. Our final goal is
tic features, and the look ahead, consisting on feeding acoustic to control the phonetic content using the acoustic features and
features of future frames to the network with the aim of bet- the speaker characteristics by means of the speaker id feature.
ter modeling the present waveform and avoiding possible dis- This would allow to generate different synthetic voices using
continuities. Human listeners prefer the system that combines limited labelled data, as the neural vocoder is trained on unla-
both techniques, which reaches a rate of 4 in the mean opinion belled data. It could also be applied to voice conversion [7].
score scale (MOS) with the balanced dataset and outperforms In this case, the input to the vocoder are the acoustic features
the other models. of one speaker and the speaker id of other speaker. While the
Index Terms: deep learning, speech synthesis, recurrent neural capability of changing the speaker identity influences the archi-
networks, text-to-speech, SampleRNN, time series tecture and the proposals of this work, this paper is focused on
how to apply the acoustic features to get the best quality, with-
out changing the speaker identity with respect to the speaker
1. Introduction associated with the input acoustic features.
Deep learning has revolutionized almost every engineering This work proposes two contributions: speaker dependent nor-
branch over the latest years, also being successfully applied malization of the acoustic features, and an acoustic-features
to text-to-speech (TTS) where it yields state-of-the-art perfor- look ahead mechanism. The first proposal aims to perform
mance and overcomes classical statistical approaches. The voice conversion. The acoustic features fed to the network and
time series problem has been completely leveraged by recur- used to condition the generated speech, such as pitch or Mel-
rent neural networks (RNNs) and their variants, making them frequency cepstral coefficients (MFCC), depend on the speaker,
lead to very good results in the speech synthesis field. More- which means that the input to indicate the speaker identity is
over, deep generative models can generate speech sample by redundant. This redundancy is identified by the network, which
sample as first proposed in WaveNet [1], which achieved very assigns low weights to the speaker identity and make it irrele-
fine-grained waveform amplitudes and outperformed previous vant. The speaker-dependent normalization aims to give impor-
statistical parametric speech synthesis (SPSS) models as a deep tance to the speaker identity by isolating the features from the
auto-regressive system. This paper exposes two of the propos- speaker and thus forcing the network to use the speaker identity
als presented in the bachelor’s thesis of the first author [2] that to generate natural speech for each user.
were applied to better model the speech generated with a multi- The look ahead approach questions the causality of time series
speaker deep learning-based model. modeling, which is not needed unless input features are ex-
In our previous work [3], SampleRNN [4] was adapted to gen- tracted at real time and therefore not known beforehand. In the
erate coherent speech in Spanish. SampleRNN models the joint case of TTS systems, the text which will be uttered is known
distribution of speech samples using a hierarchical RNN, and before hand and therefore, the acoustic features of all the sig-
it can be used to generate speech-like sequences. As the gen- nal are known or can be predicted before generating the speech
eration is unconditioned (only previous waveform samples are waveform. Moreover, in natural speech, the phonemes sound
inputs), the generated sequence is a random speech-like wave- different depending on the context and thus can change depend-
form with short-time coherent structure that emulates speech ing on future phonemes (co-articulation). By giving informa-
but does not resemble proper spoken contents. As in the case tion of the future behavior of the predicted sequence, there are
30 10.21437/IberSPEECH.2018-7
no discontinuities and artifacts can be reduced. This is trans-
lated into better quality speech as rated by human listeners.
The proposals were initially conceived to improve the speech
obtained with a deep generative network able to model multiple
speakers with the same structure. Nevertheless, the speaker-
dependent normalization (see section 2.2) could be used as a
new pre-processing technique in a variety of problems, and
the look ahead approach (section 2.3) can be generalized to Figure 1: Classical speaker-independent normalization
time series modeling. Current state-of-the art TTS models like
WaveNet [1], Tacotron [8] or VQ-VAE [9] already model sev-
eral speakers with a unique model, but do not apply speaker-
dependent normalizations, which is shown to deteriorate results.
The look ahead proposal is not either mentioned, but it outper-
formed our baseline model and could be applied to other time
series modeling system.
In the next section, first the baseline system is presented. It
consist of SampleRNN [4], extended to generate speech con- Figure 2: Proposed speaker-dependent normalization
ditioned to acoustic features and speaker identity. Then, the
speaker dependent normalization and the look ahead are intro-
duced. In section 3, the experimental setup is described. Sec- in figure 3. This changes the previous equation as a new depen-
tion 4 presents the experimental results that show how both pro- dence factor is included, thus new formulation follows equa-
posals outperform the baseline system. tion (2), where lt ∈ R43 stands for a 43-dimensional acoustic
vector corresponding to the analysis window of the current sam-
ple xt .
2. Multi-speaker Network
2.1. Baseline
Y
T
P (X|L) = P (xt |x1 , . . . , xt−1 , lt ) (2)

The proposed neural vocoders are based on SampleRNN, an un- t=1
conditional end-to-end neural audio generation model [4] that Differing from the original SampleRNN model [4] and apart
consists of two recurrent modules running at different clock from the previously mentioned addition of the acoustic condi-
rates that aim to model the short and long term dependencies tioners that allow to synthesize coherent speech, the authors
of speech signals, and one module with auto-regressive multi- also incorporated the blocks in the left of figure 3. These
layer perceptrons (MLPs) that processes speech sample by sam- aim to differentiate among all the speakers of the database by
ple. The authors of SampleRNN reported that gated recurrent means of embedding an identifier which is also used to condi-
unit (GRU) [10] cells worked slightly better than long short- tion the model jointly with the aforementioned Ahocoder fea-
term memory (LSTM) ones, hence this is the recurrent architectures. Hence lt is augmented to include the speaker identity by
ture adopted for this work. The three tier architecture provides concatenating the embedding to the acoustic features, resulting
flexibility in allocating the amount of computational resources in a vector l̂t ∈ R49 .
for modeling different levels of abstraction and results very ef-
ficient in memory during training. The final output of Sam-
2.2. Speaker-dependent feature normalization
pleRNN model is the probability of the current sample value
conditioned on all the previous values of the sequence that can Features fed to a neural network are often previously normal-
be expressed following the chain rule of probability as stated ized to control the magnitude of both the activations and gra-
in equation (1). This follows a Multinoulli distribution, which dients in training. With the hypothesis of having speaker-
could be unintuitive due to the naturalness of speech signals, dependent features, an independent normalization for each of
which are real-valued, but achieves better results as it does not the speakers was proposed to isolate the speech features from
assume any distribution shape of the data and thus can more eas- the source. Maximum and minimum values for each of the pa-
ily model arbitrary distributions. In this work, speech samples rameters were found within the training partition so it could
are quantized with 8 bits, having therefore 256 possible values. happen that some features of the train or validation partitions
Differing from the linear quantization proposed in SampleRNN, overpass the bounds. The chosen normalization function was
we apply a µ−law companding transformation [11] before clas- a simple feature scaling that follow equation (3), which bound
sifying into the 256 possible classes to flatten the Laplacian-like each of the features from 0 to 1. This approach could be also ap-
distribution of the speech signals. plied with other normalization functions like the z−score, i.e.
statistical normalization. This last option was not tested be-
Y
T fore the writing of this paper due to the low improvement in
P (X) = P (xt |x1 , . . . , xt−1 ) (1) results of this only modification (see table 1). Nevertheless,
t=1 as it can be seen in the same table, this approach outperforms
the other models when combined with the look ahead approach
In order to generate speech coherent spoken contents, the model (explained in section 2.3). Therefore, a statistical normalization
was conditioned like in [3] with acoustic features obtained with could also be tested in future work.
Ahocoder [12], a high-quality harmonics-plus-noise vocoder
that predicts a set of features that can characterize speech sig- x − xmin
nals. The adapted model with its conditioning inputs is depicted x̂ = (3)
xmax − xmin
31
This proposal aims to give importance to the speaker identity same amount of samples to train, we choose to use all the avail-
to ideally allow voice conversion without the need of a com- able data per speaker to avoid restricting all of them to only 14
plex mapping of features. Inspiration came from the behav- minutes of speech instead of an hour. The total duration of the
ior of the pitch for every speaker, which is depicted for both whole dataset including the six speakers amounts to 5.25 hours,
speaker-independent and speaker-dependent normalizations in which we divide into 80% for training, 10% for validation and
figures 1 and 2 respectively. These plots illustrate the evolu- 10% for testing.
tion of the logarithmic fundamental frequency for four different
speakers including two males and two females that read the ex- 3.2. Feature Design and Hyper-parameters
act same text and thus are very similar once normalized follow-
ing a speaker-dependent approach. Note that there is some time The acoustic parameters are extracted with Ahocoder in frames
shifting due to different duration of phonemes and pauses but of length 15 ms shifted every 5 ms, obtaining 40 Mel-
the signal is yet very similar. frequency cepstral coefficients, the maximum voiced frequency
After the classical speaker-independent normalization (figure (fv), the logarithmic F0 value and the voiced/unvoiced flag
1), it is very easy to distinguish between females (75, 76) and (uv). To tackle the discontinuity in the logF0 statistics in un-
males (79, 80). This means that it would be impossible to per- voiced signals, the extracted pitch is post-processed with a log-
form voice conversion because the network doesn’t need the linear interpolation for the unvoiced segments following previ-
speaker identifier for being this information implicit in the fea- ous strategies [14].
tures. This is why this redundancy is translated into the futil- All these features are thus scaled following either the proposed
ity of this input observed when trying to change the speaker speaker-dependent or the more classical speaker-independent
identity at will. The behavior of the pitch once normalized by normalization. The normalized features are then rearranged to
speaker is very similar if the intonation is comparable. Nev- match the speech samples dimensions used in training and the
ertheless, the other features that are fed to the network (see speaker embedding is added as an independent input to the sys-
next section) resulted in very similar normalizations for both tem, as mentioned earlier.
speaker-independent and speaker-dependent approaches. The learning strategy was to train each of the models derived
from the previous proposals with mini-batch stochastic gradient
2.3. Look ahead descent (SGD) using a mini-batch size of 128 and minimizing
the negative log-likelihood (NLL). The chosen optimizer is the
In the modeling of non-real-time sequences such as the genera- adaptive moment estimation (ADAM) [15] for its effectiveness
tion of speech in a TTS system, the features that will be fed to in many problems and ease of use. It is an SGD algorithm with
the network are known beforehand. This means that, in contrast adaptive learning rate, having an initial value of 10−4 , which
with a possible phone call where both ends are talking at real we enhance for our task with an external rate controller known
time, the features that will condition the sequence at future time as scheduler. This had two milestones at epochs 15 and 35. In
steps are always known and thus can be used to better model the each of these milestones, the learning rate is scaled down by
generated signal. a factor 0.1, which counterattacks the sudden changes in the
With this idea in mind, the causality that speech synthesizers loss curve that shows up at first epochs. Weight normalization
inherited from the vocoders used in decoding is questioned and [16] is also used in the 1D-convolutional layers to speed-up the
both the current and future windows of features are fed to the convergence of the model.
network. This results in a larger model because the number of
features is duplicated at each time step but also achieves better
3.3. Subjective evaluation
quality without the need of more features.
Note that the look ahead approach modifies the architecture be- As this is a generative task that involves synthesized nuances
cause the upper right 1D-convolution block doubles its input in the speech that are difficult to evaluate with any objective
size (the original value of 43 is crossed out in the figure and re- metric, a mean opinion score (MOS) test is conducted. The
placed by 86 to accept both the features of the current and future MOS is a rating of the naturalness of the speech signal with an
frames). integer scale ranging from 1 to 5. The meaning of each scale
value is translated as Excellent (5), Good (4), Fair (3), Poor (2),
and Bad (1).
3. Experimental setup
In total 4 systems are evaluated combining both proposed im-
In this section we characterize the experimental conditions to provements with all possibilities: (1) speaker dependent nor-
evaluate the previous approaches. First we describe the speech malization and look-ahead (Spk-D + LA); (2) speaker indepen-
data used to estimate the models. Then, the acoustic parameters, dent normalization and look-ahead (Spk-Ind + LA); (3) speaker
architecture and learning hyperparameters are outlined. Finally, dependent normalization (Spk-D); and (4) only speaker inde-
the methodology used to evaluate the system is described. pendent normalization (Spk-Ind). Hence to perform the test 25
subjects were asked to rate each of the 4 proposed systems un-
3.1. Dataset der a set of 8 test utterances, one per modelled speaker (4 males
and 4 females). In total 32 systems were prompted to be rated
The speech dataset used in the experiments is formed by six per listener, and they could listen the different systems as many
Spanish voices from the TC-STAR project [13], where half of times as required to compare and rate them. For each sentence,
them are males and the other half are females. The database the transcription of the audio was provided to ease the listening,
was unbalanced with one of the female speakers barely hav- and the audios of each of the different systems synthesizing the
ing a quarter of speech recording time compared to the others. same sentence, were disposed side by side to compare, having a
Notwithstanding some works like [14] recommend balancing random order per utterance (i.e. the system identity was hidden
the data per user so that all speakers have approximately the and mixed among the different utterances).
32
Window 80 Speaker ID
future samples
Window
80 samples
Ahocoder Ahocoder (128 × 13) spki
Tier 3
(128 × 13 × 43) condi (128 × 13 × 80) xi , . . . , xi+79 Embedding

Number of
embeddings: 6
(128 × 13 × 43) condi+1 Embedding size: 6
Conv1D Conv1D Conv1D

Channels:
Channels: 80→1024 Channels: 6→1024
Take 20 Z

4Z
3 ;86→1024
Kernel size: 1 Kernel size: 1
last samples Kernel size: 1
GRU
Input size: 1024
Hidden size: 1024
GRU
Input size: 1024
Hidden size: 1024
(128 × 52 × 20) xi+60 , . . . , xi+79
ConvT 1D
Channels:
1024→1024
Kernel size: 4
Tier 2 c(3) c(3) c(3) c(3)

Conv1D
Channels: 20→1024
Kernel size: 1
GRU
Input size: 1024
Hidden size: 1024
GRU
Input size: 1024
Hidden size: 1024
ConvT 1D
Channels:
1024→1024
Kernel size: 20
c(2) c(2) c(2) c(2)

Tier 1
Sample-Level
Module
Figure 3: Conditioned SampleRNN with multi-speaker adaptation
Normalization: Spk-D Spk-Ind Spk-D Spk-Ind 5. Conclusions

Look ahead: No No Yes Yes
Speaker-dependent normalization achieve substantially better
Female: 3.3 3.3 3.8 3.6 results in naturalness for the generation of speech modeled
Male: 3.6 3.6 4.0 3.8 with balanced datasets when combined with the look ahead
Total: 3.5 3.5 3.9 3.8 approach. Regarding the second proposal, it outperforms the
Table 1: Table with MOS results comparing the 4 proposed sys- previously achieved scores and reaches state-of-the-art results
tems: (Spk-D + LA), (Spk-Ind + LA), (Spk-D) and (Spk-Ind). when combined with the speaker-dependent normalization.
For each score the higher the better. Speaker-dependent normalization was not enough for voice
conversion purposes, so more complex architectures were pro-
posed in [2]. Nevertheless, human listeners preferred the speech
modeled with speaker-dependent normalization and, given the
4. Results similarity of the features normalized for each speaker, a better
quantization could be applied for coding applications or for de-
The test results are shown in table 1, obtained by averaging ployment of neural networks with limited resources.
the evaluation of the 25 listeners for every system. Some of Whilst the speaker-dependent normalization itself does not
the comments written by the subjects that took the test high- seem to improve the results obtained with the classical speaker-
lighted the difference in quality between males and females. independent feature scaling, when combined with the look
This is the reason why table 1 does a separation between gen- ahead approach, it achieves the best score with the balanced
ders that shows that best results are indeed obtained with male dataset. In summary, with the combination of the two propos-
voices. This is attributed to the unbalanced speech database, als, a state of the art MOS score has been achieved for a multi-
which contained one female speaker with only 25% of record- speaker speech synthesis system. Both of these approaches
ings compared to all other speakers. Seemingly, this affected were novelties introduced in this work and results show that
the modeling of female voices, which were noisier. Overall we they can be potentially beneficial to other TTS systems as well
can observe that best scores are achieved when we combine both as for a bunch of other applications involving features from dif-
proposals with the (Spk-D + LA) system. ferent sources and modelling of non-real-time sequences.
The samples belonging to the best system combining both pro-
posals can be heard in the GitHub of the main author 1 , as well 6. Acknowledgements
as the code that is publicly available.
This research was supported by the project TEC2015-69266-P
1 https://github.com/Barbany/Multi-speaker-Neural-Vocoder (MINECO/FEDER, UE).
33
7. References
[1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,
“WaveNet: A Generative Model for Raw Audio,” ICASSP,
IEEE International Conference on Acoustics, Speech and Signal
Processing - Proceedings, pp. 1–15, 2016. [Online]. Available:
http://arxiv.org/abs/1609.03499
[2] O. Barbany Mayor, “Multi-Speaker Neural Vocoder,” Bachelor’s
thesis, Universitat Politècnica de Catalunya, 2018.
[3] A. Bonafonte, S. Pascual, and G. Dorca, “Spanish statisti-
cal parametric speech synthesis using a neural vocoder,” Proc.
Interspeech, pp. 1998–2001, 2018.
[4] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,
A. Courville, and Y. Bengio, “SampleRNN: An Unconditional
End-to-End Neural Audio Generation Model,” ICLR, pp. 1–11,
2017. [Online]. Available: http://arxiv.org/abs/1612.07837
[5] Y. Fan, Y. Qian, F. K. Soong, and L. He, “Unsupervised
speaker adaptation for DNN-based TTS synthesis,” ICASSP,
IEEE International Conference on Acoustics, Speech and Signal
Processing - Proceedings, pp. 4475–4479, 2015.
[6] S. Pascual and A. Bonafonte, “Multi-output RNN-LSTM for
multiple speaker speech synthesis and adaptation,” in 24th
European Signal Processing Conference, EUSIPCO 2016,
Budapest, Hungary, August 29 - September 2, 2016, 2016,
pp. 2325–2329. [Online]. Available: https://doi.org/10.1109/
EUSIPCO.2016.7760664
[7] T. Toda, L. H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu,
and J. Yamagishi, “The voice conversion challenge 2016,” Proc.
Interspeech, vol. 08-12-Sept, pp. 1632–1636, 2016.
[8] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J.
Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio,
Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A.
Saurous, “Tacotron: A fully end-to-end text-to-speech synthesis
model,” CoRR, vol. abs/1703.10135, 2017. [Online]. Available:
[9] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural
Discrete Representation Learning,” in NIPS, 2017. [Online].
Available: http://arxiv.org/abs/1711.00937
[10] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase
Representations using RNN Encoder-Decoder for Statistical
Machine Translation,” CoRR, 2014. [Online]. Available: http:
//arxiv.org/abs/1406.1078
[11] ITU-T. Recommendation G. 711, “Pulse Code Modulation (PCM)
of voice frequencies,” 1988.
[12] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “Harmonics plus
Noise Model based Vocoder for Statistical Parametric Speech
Synthesis,” IEEE Journal on Selected Topics in Signal Processing,
vol. 8, no. 2, pp. 184–194, 2014.
[13] A. Bonafonte, H. Höge, I. Kiss, A. Moreno, U. Ziegenhain,
H. V. D. Heuvel, H. Hain, X. S. Wang, and M. N. Garcia, “TC-
STAR : Specifications of Language Resources and Evaluation for
Speech Synthesis,” Proceedings of the Language Resources and
Evaluation Conference LREC06, pp. 311–314, 2006.
[14] S. Pascual de la Puente, “Deep learning applied to speech synthe-
sis,” Master’s thesis, Universitat Politècnica de Catalunya, 2016.
[15] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” CoRR, vol. abs/1412.6980, 2014. [Online].
[16] T. Salimans and D. P. Kingma, “Weight normalization:
A simple reparameterization to accelerate training of deep
neural networks,” CoRR, vol. abs/1602.07868, 2016. [Online].
34
IberSPEECH 2018
Improving the Automatic Speech Recognition through the improvement of

Laguage Models
Andrés Piñeiro-Martı́n, Carmen Garcia-Mateo, Laura Docio-Fernandez
atlanTTic Research Center, Multimedia Technologies Group, University of Vigo, Spain

apinheiro@gts.uvigo.es, carmen.garcia@uvigo.es, ldocio@gts.uvigo.es
Abstract work topologies such as LSTM (Long Short-Term Memory) [7]

have also been applied. We should note that models of this type
Language models are one of the pillars on which the perfor-
are far more complex than the n-gram model, and hence it takes
mance of automatic speech recognition systems are based. Sta-
more time to create them and they require more training data.
tistical language models that use word sequence probabilities
Furthermore, these models cannot be used straight-
(n-grams) are the most common, although deep neural networks
forwardly in decoding, due to the large amount of computa-
are also now beginning to be applied here. This is possible due
tional resources needed, so the usual approach to their appli-
to the increases in computation power and improvements in al-
cation is to perform a rescoring stage on a previously obtained
gorithms. In this paper, the impact that language models have
word lattice using a n-gram language model. There are sev-
on the results of recognition is addressed in the following sit-
eral algorithms that can implement this rescoring, such as those
uations: 1) when they are adjusted to the work environment of
cited in [8], [9] and [10].
the final application, and 2) when their complexity grows due to
increases in the order of the n-gram models or by the applica- The current paper extends the work described in our previ-
tion of deep neural networks. Specifically, an automatic speech ous study [11]. The impact on ASR performance in that study
recognition system with different language models is applied was analyzed when the quality of the text data corpora used for
to audio recordings, these corresponding to three experimental training the language models was increased and improved. The
frameworks: formal orality, talk on newscasts, and TED talks present paper looks specifically at the effect of enlarging the
in Galician. Experimental results showed that improving the complexity of the models, increasing the order of the statistical
quality of language models yields improvements in recognition models, and applying deep neural networks.
performance. In addition, it should be noted that when working with mi-
Index Terms: automatic speech recognition, language model, nority languages such as Galician, obtaining the necessary data
deep neural networks. to train language models can itself be difficult. Therefore, this
study also analyses how to use a modern system when working
with languages for which the available data is limited.
1. Introduction The paper is organized as it follows: in Section 2 the ASR
Automatic Speech Recognition (ASR) is essential in applica- system is described. Section 3 then presents the experimental
tions where interaction with the user is through spoken com- framework. Section 4 reports the experimental results. Section
munication, so that this can remain natural and effective. Such 5 provides a discussion of these, and finally Section 6 offers
systems base their performance on acoustic models (AM) and some final conclusions and suggestions for further lines of re-
language models (LM) that allow for the representation of the search.
statistical properties of speech and language.
These days, with increases in computing power, the use 2. Description of the ASR system
of deep neural networks has spread to many fields. The cur-
rent ASR systems are predominantly based on acoustic Hybrid The ASR system was built using the Kaldi toolkit [12]. The
Deep Neural Network–Hidden Markov Models (DNN-HMMs) acoustic models use a hybrid DNN-HMM modeling strategy
[1] and the n-gram language model [2] [3]. However, in re- with a neural network based on Dan Povey’s implementa-
cent years, Neural Network Language Models (NNLM) have tion in Kaldi [13]. This implementation uses a multi-spliced
begun to be applied [4] [5] [6]. In these, words are embed- TDNN (Time Delay Neural Network) feed-forward architecture
ded in a continuous space, in an attempt to map the semantic to model long-term temporal dependencies and short-term voice
and grammatical information present in the training data, and in characteristics. The inputs to the network are 40 Mel-frequency
this way to achieve better generalization than n-gram models. cepstral coefficients extracted in the Feature Extraction block
The arrival of GPUs, multi-core GPUs and increases in com- with a sampling frequency of 16 KHz. In each frame, we ag-
puting power have made it possible to apply deep neural net- gregate a 100-dimensional iVector to a 40-dimensional MFCC
works (DNNs) with multiple hidden layers, which are able to input.
capture high-level, discriminating information about input fea- The topology of this network consists of an input layer fol-
tures. The depth of the created network (the number of hidden lowed by 5 hidden layers with 1024 neurons with RELU activa-
layers), together with the ability to model a large number of tion function. Asymmetric input contexts were used, with more
context-dependent states, results in a reduction in Word Error context in the background, which reduces the latency of the neu-
Rate (WER). The type of neural networks most often used in ral network in on-line decoding, and also because it seems to be
language modelling are recurrent neural networks (RNNLMs). more efficient from a WER perspective. Asymmetric contexts
The recurrent connections present in these networks allow the of 13 frames were used in the past, and 9 frames in the future.
modelling of long-range dependencies that improve the results Figure 1 shows the topology used and Table 1 the layerwise
obtained by n-gram models. In more recent work, recurrent net- context specification corresponding to this TDNN.
35 10.21437/IberSPEECH.2018-8
3. Experimental framework
This section briefly describes both the text corpora used to train
the LMs and the testing data used in the experiments.
3.1. Text corpora

Four text corpora were used (more detailed information in [11]):
• DUVI: texts from the DUVI (Diario da Universidade de
Vigo). This is a small corpus, but it has very clean and
representative text, including a large number of current
words and expressions.
• epub library: a set of 5,000 Spanish novels trans-
lated into Galician using an automatic translator [17]. It
should be noted that the results obtained using the LM
when trained with this text may contain systematic errors
due to these Spanish-Galician translations. One of the
Figure 1: TDNN used in acoustics models [13] objectives in creating this corpus arises from the problem
of finding large text corpora in minority languages. The
results obtained illustrate the impact of using an LM with
Table 1: Context specification of TDNN in Figure 1
text translated from a language for which large quantities
Layer Input context of text can be found.
1 [-2,+2] • Ghoxe, Eroski and Vieiros newspapers: text from the
2 [-1,2] Galicia Hoxe digital newspaper, published in Santiago de
3 [-3,3] Compostela; text from the Eroski news page, which con-
4 [-7,2] tains news text of a varied nature; and Vieiros, another
5 {0} digital newspaper published entirely in Galician. It is an
extensive corpus, with clean and representative material,
in that it contains news on a variety of themes.
The network has been trained with material corresponding • CORGA: texts from the Corpus de Referencia del Gal-
to TC-STAR [14], with 79 hours of speaking in Spanish, and to lego Actual (CORGA), composed of different represen-
Transcrigal [15], with 30 hours of speaking in Galician. tative texts from books, newspapers, magazines, plays,
In terms of the language models, when working with n- audiovisual material and blogs. It is a medium-sized cor-
gram LMs the model was trained using the SRI Language Mod- pus with very carefully prepared and representative kinds
eling Toolkit. N-gram models of order 3 and 4 were used, that of texts.
is, trigrams and tetragrams. A modified Kneser-Ney discount- From the above text corpora, four LMs have been trained
ing of Chen and Goodman has also been applied, together with through a mixture of single models. The mixtures carried out
a weight interpolation with lower orders [16]. were the following:
For training the RNNLMs, the Kaldi RNNLM [12] soft- • CLM1: language model trained with the texts from
ware was used. The neural network language model is based on DUVI, Ghoxe, Eroski and Vieiros newspapers, and
a RNN with 5 hidden layers and 800 neurons, where TDNN CORGA.
layers with activation function RELU, and LSTM layers are
combined. The training is performed using Stochastic Gradient • CLM2: directly combining the already-trained language
Descent (SGD), and in several epochs (in our case, 20 epochs). models based on DUVI, Ghoxe, Eroski and Vieiros
All RNNLM models have been trained with the same material newspapers, and CORGA in a new language model. By
as in the case of n-gram statistical models. Figure 2 shows the doing this, we seek to analyze the differences in results
topology of the network used. between training new models with the whole text and
mixing the already-trained models.
• CLM3: combining the single language models based on
TDNN LSTM TDNN LSTM TDNN
the epub library and CORGA.
• CLM4: combining the single language models based on
DUVI, the epub library, Ghoxe, Eroski and Vieiros news-
papers, and CORGA.
Input Output
Table 2 shows the main characteristics of these combined
800
models.
neurons
relu-renorm relu-renorm relu-renorm 3.2. Testing data
fast-lstm fast-lstm Tests were made on three audio corpora with different charac-
teristics.
Figure 2: RNNLM topology used in the ASR system
1. First Corpus: Formal Orality. Corpus with audio
recordings of formal orality corresponding to literary
36
Table 2: Characteristics of the combined Language Models In view of these findings, it can be concluded that model
combinations do not significantly reduce the averaged WER,
No INV words
Training text size
OOV words but do lead to an improvement in the confidence interval. On
(M of words) average these combined models present more robust results than
CLM1 720.000 118 2,5 % any of the single models.
CLM2 730.000 118 2,5 %
It is interesting to compare the results obtained by CML1
CLM3 630.000 352 2,8 %
CLM4 900.000 435 2,3 %
and CML2. We recall that CML1 is trained by combining all
the texts, whereas in CML2 the previously-trained models are
mixed. Table 3 shows that the average WER values are lower for
CML2, and therefore it is better to combine previously-trained
single language models.
readings and orally produced and read speeches. It con-
sists of 30 files with an average duration of 3:50 minutes
4.2. Experiment 2
per recording and a total duration of approximately 115
minutes (about 2 hours). In this Experiment a tetragram rescoring on the lattice obtained
in the previous experiment is performed. Table 4 shows the
2. Second Corpus: Speech in newscasts. Speech in news-
average WER together with the 95% confidence interval of the
casts. A corpus with audio recordings of television news-
rescoring results.
casts from TVG (Televisión de Galicia). They present
a mixture of spontaneous and planned speech or read
Table 4: Tetragram Language Models Rescoring Results
speech, but with more contemporary themes and vocab-
ulary than in the first corpus. It consists of 10 files with
an average duration of 34 minutes per recording and a SLM CLM1 CLM2 CLM3 CLM4
total duration of 340 minutes (5 hours and 40 minutes). First WER% 17.60 17.51 17.52 16.72 17.01
Corpus CI-95% ± 1.84 ± 1.75 ± 1.72 ± 1.76 ± 1.75
3. Third Corpus: Speech in TED Talks. A corpus with
Second WER% 22.80 21.46 21.40 22.64 21.79
audio recordings from TED Talks in Galician [18]. They Corpus CI-95% ± 2.65 ± 2.63 ± 2.61 ± 2.68 ± 2.74
present planned speech but are not read, being of a spon-
Third WER% 19.45 18.72 18.52 18.45 17.97
taneous nature. It consists of 10 files with an average
Corpus CI-95% ± 2.62 ± 2.46 ± 2.56 ± 2.68 ± 2.60
duration of 16 minutes per recording and a total duration
of 163 minutes (2 hours and 43 minutes). Average WER in
19.95 19.23 19.14 19.27 18.92
analysis corpora
4. Experimental results
Three experiments were carried out to assess the impact of the The average WER and the confidence interval for the three
different language models on ASR performance: corpora analyzed is reduced. In the first corpus the average
• Experiment 1: recognition using a single-pass decoding WER is reduced by approximately 1% (expressed in absolute
strategy using 3-gram LMs of Table 2. terms), obtaining a value of 16.72%. The reduction is similar in
the second corpus, going from 22.80% to 21.40% of WER. In
• Experiment 2: rescoring with 4-gram language models. the third corpus it goes from 19.45% to 17.97%.
• Experiment 3: rescoring with RNNLMs. It is also interesting to compare the average WER for the
three corpora shown in Table 4. The lowest average value is
4.1. Experiment 1 obtained by CLM4, the model that combines the greatest num-
ber of single LMs, and the one with the least out of vocabulary
The mixture of models described in Section 3.1 has been tested (OOV) words. It also shows how all the combinations of models
in each of the corpora described in Section 3.2. The results can obtain lower WER results than single models, that is, when ap-
be seen in Table 3. The SLM column shows the best result plying the rescoring of tetragrams, the CLMs are clearly supe-
obtained by a single LM, that is, without mixing LMs. rior, being more robust in confidence interval and with a lower
average WER.
Table 3: Combined Language Models Results
4.3. Experiment 3
SLM CLM1 CLM2 CLM3 CLM4
In this last experiment a rescoring with the RNNLM is applied
First WER% 21.02 17.61 17.51 17.55 18.14
Corpus CI-95% to the lattices obtained by the decoding of experiment 2, that is,
± 1.84 ± 1.76 ± 1.71 ± 1.78 ± 1.77
a rescoring RNNLM is applied to the lattice resulting from the
Second WER% 21.39 21.56 21.52 23.46 22.86
rescoring of tetragrams. For this, RNNLMs have been trained
Corpus CI-95% ± 2.63 ± 2.58 ± 2.55 ± 2.78 ± 2.70
using the same text as in the previous experiment. Table 5
Third WER% 19.07 18.77 18.68 19.48 19.18 shows the results.
Corpus CI-95% ± 2.24 ± 2.44 ± 2.57 ± 2.70 ± 2.81 In the first corpus all language models reduce the aver-
age WER obtained. An absolute reduction of up to 1.5% is
achieved, reaching the value of 16.05% of average WER. How-
Table 3 shows that for the first and second corpus it is not ever, in the second and third corpus, applying the rescoring with
possible to reduce the average WER with any of the LM com- RNNLMs does not reduce the WER obtained for all LMs.
binations, but it is possible to reduce the confidence interval of The average WER in the three analyzed corpora (final row
the results. Only for the third corpus was the average WER in Table 5) is slightly reduced compared to the values obtained
obtained slightly reduced, improving the results of single LMs. in experiment 3, achieving a value of 18.83% when using the
37
Table 5: RNNLM Rescoring Results 5.2. Use of limited data in a modern system
SLM CLM1 CLM2 CLM3 CLM4 To train the models that the ASR systems uses, large corpora of
audio (for the acoustic model) and text (for the language model)
First WER% 16.05 16.82 16.63 16.52 16.57
Corpus CI-95% are necessary. In both cases, the type of data that is collected
± 1.69 ± 1.74 ± 1.76 ± 1.77 ± 1.80
must be representative of the speech that one wants to recog-
Second WER% 22.98 21.61 21.37 23.44 21.39
nize.
Corpus CI-95% ± 2.62 ± 2.67 ± 2.58 ± 2.83 ± 2.73
Obtaining a large corpus of audio recordings, with their cor-
Third WER% 19.27 19.07 18.73 18.82 18.52
responding transcriptions, and which are representative of the
Corpus CI-95% ± 2.94 ± 3.01 ± 2.95 ± 3.27 ± 2.95 speech, is not easy when working with minority or less well-
Average WER in
19.43 19.16 18.91 19.59 18.83
resourced languages. Large databases with this information are
analysis corpora simply not available. Although there may be TV or radio sta-
tions that broadcast in the language, it is difficult to obtain such
audio material with accurate transcriptions, and they often lack
variety in terms of speech types.
CLM4. We can conclude that the best strategy to reduce WER This study has shown that one solution to deal with the
has been to combine the language models that have provided shortage of resources in acoustic modeling is to use data from
the best results in the first experiment. The combined models languages with similar phonetics. In our case, looking at Gali-
increase vocabulary size, provide more training texts and there- cian, the acoustic models have been trained using data from
fore reduce the OOV words, while also providing a greater ro- Spanish, multiplying by 4 the amount of information. Spanish,
bustness against variation in speech. being a widely spoken language, has the necessary resources
to obtain correctly transcribed audio corpus. Therefore, in our
5. Discussion acoustic modeling, approximately 70% of data has been used in
Spanish (over 79 hours of Spanish speaking). The other 30%
This section offers a discussion of the WER results and the use corresponds to more than 30 hours of Galician.
of data in a modern system when working with a minority lan- For language models, a text corpus was created using data
guage. obtained from: 1) different magazines and newspapers pub-
lished in the language; 2) downloading the information present
5.1. WER results in the Galician version of Wikipedia; 3) information obtained
The reduction of the average WER achieved in the first corpus from small text corpora. In order to increase the amount of data
is greater than that achieved in the second and third. Such a dif- available, another solution was to obtain a large corpus of data
ference in behavior between the first corpus and the other two in another language, and translate it into Galician using a free
may be due to the different character of the linguistic samples automatic translation tool.
[11]. The first corpus is composed mainly of read texts (writ-
ten language), while the second corpus presents a high num- 6. Conclusions
ber of speakers, a heterogeneous mixture of speech types (read
The results obtained for Galician ASR are promising. Im-
language, statements by different speakers, situations including
proving the training text of the language models and applying
noise, music, a mixture of Galician and Spanish, among others).
RNNLM in decoding resulted in reducing the average WER ob-
To see how such a heterogeneous mixture of speech affects
tained. However, it has also been shown that increasing the
the results, all interviews were removed from the recordings,
complexity of the system leads to more training data. The strate-
leaving only the speech of presenters and reporters. The results
gies applied to work with minority and less well- resourced lan-
show an absolute reduction of more than 7% compared to the
guages have also contributed to the positive results in recogni-
best case for this corpus, obtaining an average WER of 13.88%.
tion.
Therefore, the speech type of this corpus clearly does affect the
results here. As a future line of research, we plan to improve the acoustic
A detailed analysis of the recognition errors can lead to a and language models of the ASR, as well as to use more efficient
further reduction of the average WER. Yet it must be taken into algorithms in the decoding stage.
account that some of the errors, at least in the oral corpus (not
read), must be assumed to be inevitable. These errors reflect 7. Acknowledgements
the doubts and errors in speakers’ pronunciation, deviations in
This work has received financial support from the Spanish
forms, etc. They might appear as errors in the transcription on
Ministerio de Economı́a y Competitividad through project
which the calculation of the WER is based, but in which the
’TraceThem’ (TEC2015-65345-P), from the Xunta de Galicia
recognition is in fact successful.
(Agrupación Estratéxica Consolidada de Galicia accreditation
Finally, in order to check how far we can get with the cur- 2016-2019, Galician Research Network TecAnDaLi ED431D
rent training data, a new RNNLM model was trained, introduc- 2016/011) and the European Union (European Regional Devel-
ing the transcripts of the analyzed corpora into the development opment Fund – ERDF). Our gratitude to the Ramon Piñeiro
text. With this, we sought to model the network so that it was Institute of the Xunta de Galicia for allowing the use of the
specifically prepared to recognize the corpus on which it was CORGA material and for its collaboration in the labeling of the
going to be tested. Of course, applying this technique is only second and third corpora.
possible when the correct transcripts are available. The results
show an absolute WER reduction of 0.2%, that is, a small and
not significant improvement. This result leads us to conclude 8. References
that it is difficult to continue reducing the WER with this train- [1] G. Hinton, L. Deng, D. Yu, and Y. Wang. 2012. Deep Neural Net-
ing data and with these algorithms. works for Acoustic Modeling in Speech Recognition. IEEE Signal
38
Processing Magazine, vol. 9, no. 3, pp. 82-97.
[2] J. T. Goodman. 2001. A bit of progress in language modeling.
Computer Speech and Language, vol. 15, no. 4, pages 403–434.
[3] D. Jurafsky and J.H. Martin. 2008. Speech and Language Pro-
cessing: An Introduction to Language Processing, Computational
Linguistics, and Speech Recognition.
[4] T. Mikilov, S. Kombrink, A. Deoras, L. Bruget, and J. Cer-
nicky. 2011. RNNLM-recurrent neural network language model-
ing toolkit, in Proc. of the 2011 ASRU Workshop, pages 196-201.
[5] E. Arisoy, T.N. Sainath, B. Kingsbury, and B. Ramabhadran.
2012. Deep Neural Network Language Models. In NAACL-HLT
Workshop on the Future of Language Modeling for HLT, pages
20-28, Stroudsburg, PA, USA. Association for Computational
LInguistics.
[6] Y. Bengio, R. Ducharme, P. Vincent, and C. Juavi. 2003. A neural
probabilistic language model. Journal of Machine Learning Re-
search, 3:1137-1155.
[7] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y.
Wu. 2016. Exploring the limits of language modeling. In arXiv
preprint arXiv:1602.02410.
[8] H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel, D.
Povey, and S. Khudanpur. 2018. A pruned rnnlm lattice-rescoring
algorithm for automatic speech recognition, In ICASSP.
[9] M. Sundermeyer, Z. Tuske, R. Schluter, and H. Ney. 2014. Lat-
tice decoding and rescoring with long-span neural network lan-
guage models. En Fifteenth Annual Conference of the Interna-
tional Speech Communication Association.
[10] X. Chen, X. Liu, A. Ragni, Y. Wang and M. Gales. 2017. Future
word contexts in neural network language models. ArXiv preprint
arXiv:170805592.
[11] A. Piñeiro, C. Garcı́a and L. Docı́o. 2018. Estudio sobre el im-
pacto del corpus de entrenamiento del modelo de lenguaje en las
prestaciones de un reconocedor de habla. Sociedad Española para
el Procesamiento del Lenguaje Natural.
[12] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlı́cek, Y. Quian, P. Schwarz, J.
Silovský, G. Stemmer, and K. Veselý. 2011. The Kaldi Speech
Recognition Toolkit. In ASRU.
[13] V. Peddinti, D. Povey and S. Khudanpur. 2015. A time delay neu-
ral network architecture for efficient modeling of long temporal
contexts. In Proceedings of INTERSPEECH 2015.
[14] L. Docı́o, A. Cardenal and C. Garcı́a. 2006. TC-STAR 2006
automatic speech recognition evaluation: The uvigo system. In
Proc. Of TC-STAR Workshop on Speech-to-Speech Translation,
ELRA, Parı́s, France.
[15] C. Garcı́a, J. Tirado, L. Docı́o and A. Cardenal. 2004. Transcrigal:
A bilingual system for automatic indexing of broadcast news. In
IV International Conference on Language Resources and Evalua-
tion.
[16] A. Stolcke. 2002. SRILM – An extensible language modeling
toolkit. Proceedings of the International Conference on Statisti-
cal Language Processing, Denver, Colorado.
[17] I. Alegrı́a, I. Arantzabal, M. Forcada, X. Gómez, L. Padró, J.R.
Pichel, and J. Waliño. 2006. OpenTrad: Traducción automática de
código abierto para las lenguas del estado Español. Procesamiento
del Lenguaje Natural.
[18] TEDxGalicia. x=independent organized TED event.
http://www.tedxgalicia.com/
39
IberSPEECH 2018
Towards expressive prosody generation in TTS for reading aloud applications

Mónica Domı́nguez1 , Alicia Burga1 , Mireia Farrús1 , Leo Wanner2,1
Universitat Pompeu Fabra, Barcelona, Spain

1
2
Catalan Institue for Research and Advanced Studies (ICREA), Barcelona, Spain
monica.dominguez@upf.edu, alicia.burga@upf.edu, mireia.farrus@upf.edu, leo.wanner@upf.edu
Abstract web-retrieved texts with prosodic cues in the context of the

Knowledge-Based Information Agent with Social Competence
Conversational interfaces involving text-to-speech (TTS) ap- and Human Interaction Capabilities (KRISTINA [3]) in its role
plications have improved expressiveness and overall natural- as reader. The cues are derived from the Information Structure,
ness to a reasonable extent in the last decades. Conversa- which is derived automatically for the discourse that is to be
tional features, such as speech acts, affective states and infor- read – based on studies which show that when prosody reflects
mation structure have been instrumental to derive more expres- the Information Structure of a sentence, the overall comprehen-
sive prosodic contours. However, synthetic speech is still per- sion of the message improves; see, e.g. [5] for German.
ceived as monotonous, when a text that lacks those conversa- In the early 2000ies, there were some attempts to introduce
tional features is read aloud in the interface, i.e. it is fed directly some basic concepts of Information Structure in TTS applica-
to the TTS application. In this paper, we propose a methodol- tions, in particular, thematicity, understood as the partition of a
ogy for pre-processing raw texts before they arrive to the TTS sentence into theme (i.e., what the sentence in about) and rheme
application. The aim is to analyze syntactic and information (or (i.e., what is said about the theme); see [6, 7] among others.
communicative) structure, and then use the high-level linguistic However, a binary flat representation of thematicity of this kind
features derived from the analysis to generate more expressive has been proved to be insufficient to describe long complex sen-
prosody in the synthesized speech. The proposed methodology tences, whereas the hierarchical tripartite approach proposed in
encompasses a pipeline of four modules: (1) a tokenizer, (2) a [8] within the Meaning–Text Theory yields a better correspon-
syntactic parser, (3) a communicative parser, and (3) an SSML dence to prosodic patterns as shown in [9, 10].
prosody tag converter. The implementation has been tested in We introduce an experimental setup that targets reading
an experimental setting for German, using web-retrieved arti- aloud of a newspaper article in German. The setup is based
cles. on the sentence as the linguistic referent unit and on a formal
Index Terms: speech synthesis, thematicity, prosody, commu- representation of thematicity that follows the principles of the
nicative structure, information structure. Meaning–Text Theory. The implementation of the proposed
thematicity-based prosody enrichment encompasses four mod-
1. Introduction ules, one of which is off-the-shelf (i.e., the syntactic parser)
and the rest have been developed by the authors: (1) a tok-
The advances in the development of Embodied Conversational
enizer, which prepares the input needed for the parser; (2) a
Avatars (ECAs) in the last decades have allowed the exploration
communicative parser, which derives thematicity spans from
of increasingly complex communicative strategies in the field
morpho-syntactic features; and (3) an SSML prosody tag con-
natural language interaction. Contextual information (such as
verter, which assigns a variety of SSML prosody tags based on
profile information with respect to culture, age and even emo-
the thematicity structure of each sentence.
tional state) is being used in the process of dialogue manage-
The rest of the paper is structured as follows: Section 2
ment, reasoning and expressive speech generation for the in-
motivates this work, sets the context of the problem and refers
troduction of ECAs in social settings, for instance as social
to previous work in this area. In Section 3, we present the
companions for children or elderly, or as health-care assistants
proposed methodology to automatically derive communicative
[1, 2, 3]. In this context, the expressiveness of the generated
structure from text and, thereupon, to generate prosodic con-
speech plays an important role, which tends to be overlooked.
tours using SSML tags. Section 4 introduces a sample imple-
Such expressiveness is mostly conveyed by means of prosody
mentation using a female German voice in MaryTTS. The out-
and usually involves an intertwined set of linguistic and para-
put of this implementation is evaluated by means of a perception
linguistic functions that are difficult to grasp in a computational
test in Section 5. Finally, conclusions are drawn in Section 6.
model.
Recent studies have demonstrated that high-level linguis-
tic structures such as information (or communicative) structure, 2. Motivation and Background
produced by the natural language generation (NLG) module in Different linguistic schools have long stated that Information
the course of sentence or text generation, are instrumental in the Structure (IS), and, in particular, the dichotomy referred to
derivation of a more varied prosody in text-to-speech (TTS) ap- as theme–rheme [11], given–new [12], or topic–focus [13] is
plications [4]. When such features are not available, synthetic related to intonation.1 Moreover, prosody structure on the
speech is perceived as monotonous, especially in long mono- grounds of thematicity partitions plays a key role in the under-
logue discourse as observed in reading aloud exercises. In this standing of a message [14]. Empirical studies in different lan-
case, raw text material (for instance, web-retrieved information) guages provide evidence that when thematicity and prosody are
is passed as input to the TTS without any contextual or linguis-
tic information. 1 In our work, we use the first denotation, i.e., theme–rheme or the-
In what follows, we propose a methodology for enriching maticity.
40 10.21437/IberSPEECH.2018-9
appropriately correlated with each other, comprehension of the prosodic cues based on the analysis of the available corpus of
message is positively affected (cf., e.g., [5] for German and [15] read speech annotated with hierarchical thematicity (in the NLG
for Catalan). Therefore, there is reason to assume that a conver- module) and prosody has been proved to yield an improvement
sational application considering the notions of content packag- in the perception of expressiveness of the synthesized speech
ing by means of the relation between thematicity and prosody [4]. However, there is still no application that can provide an
will benefit from the same advantages as in natural conversa- automatic derivation of thematicity-based prosody cues for raw
tion environments. Most of all, conversational avatars in ap- texts that arrive to the TTS application, such as the implemen-
plications for children in educational settings [16], applications tation proposed in this paper.
for those with special needs [1] as well as for elderly [2] and,
in particular, for those with cognitive impairments [17], would 3. Methodology
greatly benefit from such a communicatively-oriented improve-
ment. This paper proposes an approach that tests the formal represen-
State-of-the-art conversational applications, in particular tation of information (or communicative) structure proposed by
TTS systems, do not yet include communicative information. Mel’čuk [8] and its correspondence to prosody in the context of
The task is not trivial. It involves, in the first place, a commu- a concept-to-speech (CTS) application, where text coming from
nicative theoretical model, automatic tools to parse the Infor- a web-retrieved service is input to the TTS engine.
mation Structure of a text, and, last but not least, a generative
model of the related prosodic contour. Some preliminary at- 3.1. Objectives
tempts to include thematicity in TTS applications were made
Our work envisages the study of the IS–prosody interface from a
in the past. Consider, for instance, Steedman’s work [6] on
methodological perspective based on a speech synthesis imple-
the correlation of theme and rheme to rising and falling into-
mentation setup. The proposed methodology has the following
nation patterns, which was tested in the Festival speech synthe-
underlying goals:
sizer [18], and the creation of dedicated tags in MaryTTS [19]
for the notions of givenness and contrast [20]. However, these • to provide automatic tools to investigate the effect of
attempts have a major shortcoming in that they use a flat bi- thematicity–prosody correspondence in human-machine
nary thematicity structure, which does not suffice to describe interaction contexts;
the complexity of content packaging, especially in relation to
prosody. Our previous studies (see, e.g. [21]) suggest that hi- • to explore the advantages and limitations of a
erarchical thematicity based on propositions as described by thematicity-based prosody enrichment in speech synthe-
Mel’čuk [8] constructs a more versatile scaffolding for com- sis;
municative modeling of computer interaction with humans. As • to provide a preliminary scaffolding to incrementally
already mentioned above, Mel’čuk’s methodological approach add other communicative dimensions, registers and lan-
has also demonstrated to be instrumental in natural language guages;
generation applications [22, 23].
In contrast to IS models that propose a partition of sentences Such a methodology addresses two main research issues
into a theme and a rheme, Mel’čuk [8] argues in the context of in this field: (i) the lack of implementation settings of the IS–
the Meaning–Text Theory for a tripartite hierarchical division prosody correspondence and (ii) testing of the integration of the
(‘theme’, ‘rheme’, and ‘specifier’ – the element which sets the IS–prosody interface in computational settings.
utterance’s context) within propositions that further permits em-
beddedness of communicative spans; consider (1) for illustra- 3.2. Pipeline
tion of hierarchical thematicity (annotated following the guide-
The proposed pipeline sketched in Figure 1 includes four mod-
lines established in [24]) of the sentence Ever since, the remain-
ules:
ing members have been desperate for the United States to rejoin
this dreadful group. A total of five partitions are identified, in- 1. Tokenizer: it splits the text into sentences and words.
cluding three spans at level 1 (L1), a specifier (SP1), theme (T1) Punctuation marks are also tokenized as required to serve
and rheme (R1), and two embedded spans at level 2 (L2)2 in the as input for the syntactic parser.
rheme, a theme (T1(R1)) and a rheme (R1(R1)).3
2. Syntactic parser: The parser by Bohnet [25, 26] is used.
(1) [Ever since,]SP1 [the remaining members]T1 [have been This parser is trained on the TIGER Penn Treebank [27]
desperate [for the United States]T1(R1) [to rejoin this and outputs a fourteen-columned CONLL file.
dreadful group.]R1(R1)]R1
3. Communicative parser. This rule-based system derives
A hierarchical thematicity structure of this kind has been thematicity labels from syntactic structure. It outputs a
shown to correlate better with ToBI labels than binary flat the- CONLL file with an added column for communicative
maticity [9]. Such a correlation still does not solve the prob- structure (i.e., the output CONLL has fifteen columns).
lem of a one–to-one mapping between a specific intonation la- For now, it only derives hierarchical thematicity labels.
bel (e.g., H*) to a static acoustic parameter (e.g., an increase 4. SSML prosody converter. It converts the thematicity
of 50% in fundamental frequency). A more varied range of spans derived by the communicative parser to SSML
2 Levels are connected to the concept of embeddedness of spans: for spans and assigns a variety of prosody tags to each span.
instance, a main theme (T1 at L1) may be subdivided into further the- This module is based on the tool presented in [28].
maticity spans, which will belong to L2 thematicity.
3 As more than one thematicity span may exist within the same The use of the Speech Synthesis Markup Language (SSML)
proposition, abbreviations include a number (e.g., ‘SP1’) that indicates [29] convention for prosody enrichment, as proposed in [28],
the number of occurrences at each level (e.g., ‘SP2’ would be the second facilitates the integration of the methodology proposed in this
specifier in a specific thematicity level). paper within the context of TTS applications.
41
corpus contains eight texts with a total of 1,418 words.In what
follows, we present the experimental setup with respect to the
prosody enrichment procedure and the IS–prosody correspon-
dence.
The open source software MaryTTS5 [19] was used for
the implementation. The default synthesized speech output has
been enriched using MaryXML prosody specifications6 , which
follow the SSML recommendation7 .
The SSML prosody tags allow control of six optional at-
tributes (overall pitch, pitch contour, pitch range, speech rate,
duration and volume). These attributes can be modified inde-
pendently or in combination. For our implementation, overall
pitch and speech rate were chosen individually and in combina-
tion. Absolute (e.g., ‘+50Hz’ for increasing a specific amount
Figure 1: Communicative generation pipeline.
of hertz (Hz) in F0) and relative values can be used to applied
the modification. An example of a SSML prosody tag for mod-
ification of two prosodic elements is presented below:
Example (1)
<prosody rate=”-10%” pitch=”+20%”>text to be modi-
fied </prosody>
Moreover, the SSML boundary tag that controls the intro-
duction of pauses at a specific location was also used after each
thematicity span. The duration of the break is specified in mil-
liseconds (ms). An example of SSML boundary tag is intro-
duced below:
Example (2)
Text before the break <boundary duration=”100”/>text
after the break.
Figure 2: Example of the output of the communicative parser in
CONLL format annotated with thematicity. The correspondence between thematicity and prosody is
presented as variations from referent prosody tag values in-
volving fundamental frequency (F0) and speech rate (SR) over
3.3. The Communicative Parser thematicity spans (cf. Table 1). We propose testing a varied
range of values generated automatically, against a manual im-
The main contribution of this paper is a rule-based communica- plementation following the findings in [10, 4], where a variety
tive parser for texts in German. In what follows, we sketch the of prosodic cues for each thematicity span is presented based on
core functions and algorithm of the parser.4 corpus analysis.
The parser is implemented as a python script that requires
a CONLL file with the part-of-speech (POS) and dependency F0 SR
syntax analysis per token in each sentence. Clauses are the
T1 +15% -15%
main syntactic cue to detect propositions. Thus, The main al- R1 +10% +10%
gorithm loops over POS and dependency relations columns to SP1 +20% -10%
P +15% -10%
identify complex and coordinated clauses in the first place and
label propositions. Then, thematicity is labeled focusing on the
detection of specifiers and themes, which usually have as syn- Table 1: Referent prosody tag values for L1 thematicity.
tactic correlates frontal modifiers and subjects respectively.
Several functions have been scripted for finding proposi- Table 1 shows the referent modification for theme (T1),
tions, thematicity and annotate them following the guidelines rheme (R1) and specifier (SP1) spans within L1 thematicity.
established in [24]. Those guidelines establish the convention Propositions are defined as clauses that contain a finite verb and
of using square brackets (”[” and ”]”) to establish the beginning they are the referent units for thematicity segmentation. They
and end of a thematicity span and keys (”{” and ”}”) to signal can include L1 and L2 spans and embrace under one commu-
beginning and end of a proposition. The resulting output is a nicative label different types of syntactic relationships, for ex-
CONLL file that has one column at the end with the annotation ample coordination, juxtaposition and subordination. The ref-
of thematicity (cf. Figure 2). erent values assigned to each span are chosen randomly within
a range of plus minus 5 points in each new sentence. Thus,
4. Experimental Setup even though the annotation of thematicity in this experiment
is restricted to the sentence domain, an automatic variation is
A working corpus has been created from web-retrieved text in envisaged to generate a different range of prosodic parameters
German on advice for sleeping routines and local news. The across sentences.
4 The code of both the communicative parser and the thematic- 5 Available at http://mary.dfki.de/
6 http://mary.dfki.de/documentation/maryxml/
ity to SSML module is available in the following repository un-
der a GNU v.3 licence: https://github.com/TalnUPF/ index.html
KRISTthem2prosModule 7 https://www.w3.org/TR/speech-synthesis/
42
5. Evaluation S1 S2 S3 S4 S5 S6 Average
DEF 47% 29% 29% 71% 41% 47% 44%
For the evaluation of automatic assignment of thematicity-based AUT 53% 71% 71% 29% 59% 53% 56%
prosody, a selection of newspaper articles in German has been
done. From those articles, a selection of sentences with differ-
Table 3: Results from the pairwise comparison for the default
ent communicative structures has been made for the perception
and automatic modification
test, as detailed below.
For the evaluation of the thematicity-based prosody en-
richment module, expressiveness was assessed by means of a
perception test using: (1) a Mean Opinion Score (MOS) with 6. Conclusions
a 5-point Likert scale: 1-bad, 2-poor, 3-fair, 4-good, and 5- Given the relevant role of the Information Structure–prosody in-
excellent; and a pairwise comparison. Seventeen participants terface in human communication, it seems reasonable that next
took part in the evaluation, all of them either native speakers of generation conversational agents face new challenges in adopt-
German or proficient speakers. The test was conducted fully in ing communicatively-oriented models. Current speech tech-
German and participants were informed that our goal is to inves- nologies have been oblivious to advances in theoretical fields
tigate if synthesized speech was perceived as better expressing studying this correlation, basically due to the lack of a for-
the communicative content of the sentence taking into account mal representation of the communicative (or information) struc-
prosodic variability. Six sentences were included in the per- ture and limited capabilities of prosody enrichment standards to
ception test representative of different complexity in syntax and achieve variability in implementation settings.
communicative structure: The present study provides a methodology for a more ver-
S1 Warme Fuß und Vollbäder direkt vor dem Schlafengehen satile integration of the IS–prosody interface in TTS for reading
fördern den Nachtschlaf. aloud applications. Such a methodology contributes in several
aspects to the state of the art: (i) a formal description of hi-
S2 Der Begriff der Schlafhygiene bezeichnet Verhal- erarchical thematicity is used; (ii) a communicative parser that
tensweisen, die einen gesunden Schlaf fördern. derives thematicity labels is introduced; and (iii) the prosodic
S3 Dafür sorgen, dass das Schlafzimmer ruhig und dunkel cues are automatically derived and tested in a TTS application.
ist und eine angenehme Temperatur hat. All in all, this study pivots the transition from theoretical work
S4 Landrat Thomas Reumann schlägt vor, den Fi- on the IS–prosody interface to the integration of thematicity-
nanzierungsantrag zu stellen, will aber erst im Haushalt based prosody enrichment to achieve more expressive synthe-
2018 Gelder einstellen. sized speech.
A limitation of the current study is that it only considers
S5 Das funktioniert nur, wenn alle mitmachen.
relative acoustic parameters over rather large text segments.
S6 Im übrigen betonte er, dass der Landkreis nicht allein Key aspects of prosody modeling, like F0 contour generation
sei, sondern Städte und Gemeinden als Partner habe, die in terms of prominence and phrasing remain to be looked into.
den Beschluss mittragen müssten. Future work is, furthermore, aimed at exploring other dimen-
Three samples of each sentences were included in the sions of communicative structure like emphasis and foreground-
MOS test: (1) the default TTS output (DEF); (2) auto- edness within the framework that has been proposed in this pa-
matic thematicity-based modifications (AUT) and (3) manual per.
thematicity-based prosody modifications (MAN). The pairwise
comparison included the default TTS output versus the auto- 7. Acknowledgements
matic thematicity-based prosody modification. A total of fifty-
This work is part of the KRISTINA project, which has received
one answers are considered in the evaluation. Table 2 shows
funding from the European Unions Horizon 2020 Research
results of the MOS test. In all cases, the best scoring sample
and Innovation Programme under the Grant Agreement num-
is the thematicity-based prosody modification (either manual or
ber H2020-RIA-645012. It has been also partly supported by
automatic). This supports the initial hypothesis that thematicity-
the Spanish Ministry of Economy and Competitiveness under
based prosody modifications are perceived as more expressive.
the Maria de Maeztu Units of Excellence Programme (MDM-
In sentences 2, 3, 5 and 6 the best scoring option is the automatic
2015-0502). The third author is partially funded by the Ramón
version, whereas sentences 1 and 4 score best for the manual
y Cajal program.
version of the modification. These results are in line with the
pairwise comparison shown in Table 3, where all choices go for
the thematicity-based modification except for sentence 4.
S1 S2 S3 S4 S5 S6 Average
DEF 2.65 3.35 3.12 2.71 3.18 3.06 3.01
AUT 3.00 3.53 3.41 2.65 3.53 3.71 3.30
MAN 3.12 2.88 3.06 3.00 2.71 2.88 2.94
Table 2: Results from the MOS test for the thematicity-based

prosody enrichment module
A t-test shows that overall (average) results for the auto-

matic prosody modifications (AUT = 3.30) achieve statistical
significance at p <0.05 compared to the default score (DEF =
3.01).
43
8. References [16] D. Prez-Marn and I. Pascual-Nieto, “An exploratory study on
how children interact with pedagogic conversational agents,” Be-
[1] B. L. Mencı́a, D. D. Pardo, A. H. Trapote, and L. A. H. Gómez, haviour & Information Technology, vol. 32, no. 9, pp. 955–964,
“Embodied Conversational Agents in Interactive Applications for 2013.
Children with Special Educational Needs,” in Technologies for In-
clusive Education: Beyond Traditional Integration Approaches, [17] P. Wargnier, G. Carletti, Y. Laurent-Corniquet, S. Benveniste,
D. Griol Barres, Z. Callejas Carrión, and R. L.-C. Delgado, Eds. P. Jouvelot, and A. S. Rigaud, “Field evaluation with cognitively-
Hershey, USA: IGI Global, 2013, pp. 59–88. impaired older adults of attention management in the Embodied
Conversational Agent Louise,” in IEEE International Conference
[2] A. Ortiz, M. del Puy Carretero, D. Oyarzun, J. J. Yanguas, on Serious Games and Applications for Health, SeGAH 2016, Or-
C. Buiza, M. F. Gonzalez, and I. Etxeberria, Elderly Users in lando, Florida, USA, 2016, pp. 1–8.
Ambient Intelligence: Does an Avatar Improve the Interaction?
Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 99– [18] A. W. Black and P. A. Taylor, “The Festival Speech Syn-
114. thesis System: System documentation,” Human Commun-
ciation Research Centre, University of Edinburgh, Scot-
[3] L. Wanner, E. André, J. Blat, S. Dasiopoulou, M. Farrús, land, UK, Tech. Rep. HCRC/TR-83, 1997, avaliable at
T. Fraga, E. Kamateri, F. Lingenfelser, G. Llorach, O. Martı́nez, http://www.cstr.ed.ac.uk/projects/festival.html.
G. Meditskos, S. Mille, W. Minker, L. Pragst, D. Schiller,
A. Stam, L. Stellingwerff, F. Sukno, B. Vieru, and S. Vrochidis, [19] M. Schröder and J. Trouvain, “The German Text-to-Speech
“KRISTINA: A Knowledge-Based Virtual Conversation Agent,” Synthesis System MARY: A Tool for Research, Development and
in Proceedings of the 15th International Conference on Practi- Teaching,” International Journal of Speech Technology, vol. 6,
cal Applications of Agents and Multi-Agent Systems (PAAMS), no. 4, pp. 365–377, 2003. [Online]. Available: http://mary.dfki.de
Oporto, Portugal, 2017. [20] F. Kügler, B. Smolibocki, and M. Stede, “Evaluation of informa-
[4] M. Domı́nguez, M. Farrús, and L. Wanner, “Thematicity-based tion structure in speech synthesis : The case of product recom-
Prosody Enrichment for Text-to-Speech Applications,” in Pro- mender systems perception,” in ITG Conference on Speech Com-
ceedings of the 9th International Conference on Speech Prosody munication, IEEE, 2012, pp. 26–29.
2018 (SP2018), Poznań, Poland, 2018. [21] M. Domı́nguez, M. Farrús, and L. Wanner, “Combining acous-
tic and linguistic features in phrase-oriented prosody prediction,”
[5] D. Meurers, R. Ziai, N. Ott, and J. Kopp, “Evaluating Answers to
in Proceedings of the 8th International Conference on Speech
Reading Comprehension Questions in Context: Results for Ger-
Prosody, Boston, USA, 2016, pp. 796–800.
man and the Role of Information Structure,” in Proceedings of the
TextInfer 2011 Workshop on Textual Entailment, ser. TIWTE ’11. [22] L. Wanner, B. Bohnet, and M. Giereth, “Deriving the Commu-
Stroudsburg, PA, USA: Association for Computational Linguis- nicative Structure in Applied NLG,” in Proceedings of the 9th Eu-
tics, 2011, pp. 1–9. ropean Workshop on Natural Language Generation at the Bian-
nual Meeting of the European Chapter of the Association for
[6] M. Steedman, “Information structure and the syntax-phonology Computational Linguistics, 2003, pp. 100–104.
interface,” Linguistic inquiry, vol. 31, no. 4, pp. 649–689, Fall
2000. [23] M. Ballesteros, B. Bohnet, S. Mille, and L. Wanner, “Data-driven
sentence generation with non-isomorphic trees,” in Proceedings
[7] M. Haji-Abdolhosseini and S. Müller, “Constraint-Based Ap- of the Annual Conference of the North American Association
proach to Information Structure and Prosody Correspondence,” for Computational Linguistics – Human Language Technologies
in Proceedings of the 10th International Conference on Head- (NAACL – HLT), 2015.
Driven Phrase Structure Grammar. CSLI Publications, 2003,
pp. 143–162. [24] B. Bohnet, A. Burga, and L. Wanner, “Towards the annotation of
penn treebank with information structure,” in Proceedings of the
[8] I. A. Mel’čuk, Communicative Organization in Natural Lan- Sixth International Joint Conference on Natural Language Pro-
guage: The semantic-communicative structure of sentences. Am- cessing, Nagoya, Japan, 2013, pp. 1250–1256.
sterdam, Philadephia: Benjamins, 2001.
[25] B. Bohnet and J. Nivre, “A Transition-Based System for Joint
[9] M. Domı́nguez, M. Farrús, A. Burga, and L. Wanner, “Using hier- Part-of-Speech Tagging and Labeled Non-Projective Dependency
archical information structure for prosody prediction in content- Parsing,” in Proceedings of the 2012 Joint Conference on Empiri-
to-speech applications,” in Proceedings of the 8th International cal Methods in Natural Language Processing and Computational
Conference on Speech Prosody, Boston, USA, 2016, pp. 1019– Natural Language Learning (EMNLP-CoNLL ’12), Jeju Island,
1023. Korea, 2012, pp. 1455–1465.
[10] M. Domı́nguez, M. Farrús, and L. Wanner, “Compilation of cor- [26] ——, “he Best of BothWorlds – A Graph-based Completion
pora to study the information structureprosody interface,” in 11th Model for Transition-based Parsers,” in Proceedings of the 13th
edition of the Language Resources and Evaluation Conference Conference of the European Chapter of the Association for Com-
(LREC2018), Mijazaki, Japan, 2018. putational Linguistics (EACL), Avignon, France, 2012, pp. 77–87.
[11] M. Halliday, “Notes on Transitivity and Theme in English, Parts [27] S. Brants, S. Dipper, P. Eisenberg, S. Hansen, E. König, W. Lez-
1-3,” Journal of Linguistics, vol. 3, no. 1, pp. 37–81, 1967. ius, C. Rohrer, G. Smith, and H. Uszkoreit, “TIGER: Linguis-
tic Interpretation of a German Corpus,” Journal of Language and
[12] R. Schwarzschild, “Givenness, avoidf and other constraints on the
Computation, no. 2, pp. 597–620, 2004.
placement of accent,” Natural Language Semantics, vol. 7, no. 1,
pp. 141–177, 1999. [28] M. Domı́nguez, M. Farrús, and L. Wanner, “A thematicity-based
prosody enrichment tool for cts,” in Proceedings of the 18th An-
[13] E. Hajičová, B. Partee, and P. Sgall, Topic-Focus Articulation, Tri- nual Conference of the International Speech Communication As-
partite Structures, and Semantic Content. Kluwer Academic sociation (INTERSPEECH 2017), Stockholm, Sweden, 2017, pp.
Publishers, Dordrecht, 1998. 3421–2.
[14] H. H. Clark and S. E. Haviland, “Comprehension and the given- [29] P. Taylor and A. Isard, “SSML: A Speech Synthesis Markup Lan-
new contract,” Discourse production and comprehension. Dis- guage,” Speech Communication, vol. 21, no. 1-2, pp. 123–133,
course processes: Advances in research and theory, vol. 1, pp. February 1997.
1–40, 1977.
[15] M. Vanrell, I. Mascaró, F. Torres-Tamarit, and P. Prieto, “Intona-
tion as an Encoder of Speaker Certainty: Information and Con-
firmation Yes-No Questions in Catalan,” Language and Speech,
vol. 56, no. 2, pp. 163–190, 2013.
44
IberSPEECH 2018
Performance evaluation of front- and back-end techniques for ASV spoofing

detection systems based on deep features
Alejandro Gomez-Alanis1 , Antonio M. Peinado1 , Jose A. Gonzalez2 , and Angel M. Gomez1
1
University of Granada, Granada, Spain
2
University of Malaga, Malaga, Spain
{ agomezalanis,amp,amgg }@ugr.es, j.gonzalez@uma.es
Abstract features for spoofing detection [6, 7, 11]. This technique con-
sists of employing deep neural networks in the front-end of the
As Automatic Speaker Verification (ASV) becomes more anti-spoofing system which are fed by speech features, so that
popular, so do the ways impostors can use to gain illegal ac- the deep features extracted by the neural network are passed to a
cess to speech-based biometric systems. For instance, impos- classifier in order to make the final detection decision (genuine
tors can use Text-to-Speech (TTS) and Voice Conversion (VC) or spoof). The core idea is to take advantage of the nonlin-
techniques to generate speech acoustics resembling the voice of ear modeling and discriminative capabilites of deep neural net-
a genuine user and, hence, gain fraudulent access to the sys- works which have shown to be suitable for feature engineering
tem. To prevent this, a number of anti-spoofing countermea- [3], not only for spoofing detection, but also for speech recog-
sures have been developed for detecting these high technol- nition [4], speaker recognition [3], and speech synthesis [5].
ogy attacks. However, the detection of previously unforeseen In this work, we compare the performance of different fea-
spoofing attacks remains challenging. To address this issue, tures and back-ends in an anti-spoofing system which extracts
in this work we perform an extensive empirical investigation deep features [6] in order to detect VC and TTS attacks. This
on the speech features and back-end classifiers providing the anti-spoofing system employs a convolutional neural network
best overall performance for an antispoofing system based on (CNN) plus a recurrent neural network (RNN) and gets a sin-
a deep learning framework. In this architecture, a deep neural gle spoofing identity representation per utterance. Although a
network is used to extract a single identity spoofing vector per similar comparison has already been studied in [7], our study
utterance from the speech features. Then, the extracted vectors presents three important differences: (1) our anti-spoofing sys-
are passed to a classifier in order to make the final detection tem employs a CNN to extract convolutional features at the
decision. Experimental evaluation is carried out on the stan- speech frame level, (2) we compare the performance of classical
dard ASVSpoof2015 data corpus. The results show that classi- features, such as FBANKs and MFCCs, with the performance
cal FBANK features and Linear Discriminant Analysis (LDA) of the recent popular CQCC features [8], and (3) we combinate
obtain the best performance for the proposed system. different features and classifiers in order to find the combination
Index Terms: Automatic speaker verification, spoofing detec- which offers the best performance.
tion, deep neural networks, deep features, classifier. This paper is organized as follows. Section 2 describes the
features and back-ends we are going to compare in a CNN +
1. Introduction RNN anti-spoofing system. Then, in Section 3, we outline the
speech corpora, the network training, and the performance eval-
Automatic Speaker Verification (ASV) aims to authenticate the uation details. Section 4 discusses the results of the different
identity claimed by a given individual [1]. However, most ASV features and back-ends in the deep neural network based anti-
systems are vulnerable to spoofing attacks, in which an impos- spoofing system. Finally, we present the conclusions derived
tor try to gain fraudulent access to the system by presenting to from this research in Section 5.
the ASV system speech acoustics resembling the voice of a gen-
uine user. Four types of spoofing attacks have been identified
[2]: (i) replay (i.e. using pre-recorded voice of the target user), 2. System description
(ii) impersonation (i.e. mimicking the voice of the target voice), This section is devoted to the description of the anti-spoofing
and also either (iii) text-to-speech synthesis (TTS) or (iv) voice system. First, Section 2.1 describes different voice features:
conversion (VC) systems to generate artificial speech resem- FBANK, MFCC and CQCC. The neural network architecture
bling the voice of a legitimate user. The aim of this work is to for deep feature extraction is detailed in Section 2.2. Further-
develop robust anti-spoofing countermeasures for either VC or more, Section 2.3 describes different classifiers (back-ends):
TTS based attacks. Linear Discriminant Analysis (LDA), Support Vector Machine
The performance of anti-spoofing systems can meaning- (SVM), and One-Class Support Vector Machine (One-Class
fully vary depending on the voice features used to feed them. SVM).
Due to this, voice features have attracted the attention of a num-
ber of researchers [8, 9, 10]. However, anti-spoofing systems
2.1. Speech features
based on neural networks usually use classical voice features,
such as FBANKs, and to the best of our knowledge, the new As demonstrated in [11], traditional log MEL filterbank features
popular CQCC features have not been employed yet to feed (FBANK) are effective for detecting spoofing attacks with sys-
these types of systems. tems based on neural networks. These features are obtained by
In the last years, the technique of deep features extraction passing the Short Time Fourier Transform (STFT) magnitude
have been explored to obtain more discriminative and effective spectrum through a Mel-filterbank and applying a log opera-
45 10.21437/IberSPEECH.2018-10
tion. However, FBANK features are usually high-correlated.
One way to decorrelate these features is to apply the Discrete
Cosine Transform (DCT) to get the classical Mel Frequency
Cepstral Coefficient (MFCC) features.
In [8], CQCC features are proposed for spoofing detection,
which are obtained using the Constant Q Transform (CQT). The
Q factor is a measure of the selectivity of each filter and is de-
fined as the ratio between the center frequency and the band-
width of the filter. In contrast to the STFT, whose Q factor
increases when moving from low to high frequencies as the
bandwidth is the same for all filters, the bandwidth of the fil-
ters employed in the CQT is not constant, and this results in
getting a higher frequency resolution for low frequencies and a
higher temporal resolution for high frequencies. In this manner,
the CQCC features try to imitate the human perception system
which is known to approximate a constant Q factor between
500Hz and 20kHz [20].
In this work, we employ the classical FBANK and MFCC Figure 1: Front-end architecture of the anti-spoofing system
features, as well as the popular CQCC features, to feed the anti- which extracts a spoofing identity vector per utterance (N rep-
spoofing system. resents the number of context windows per utterance). This sys-
tem is proposed in [6].
2.2. Front-end
The front-end architecture of the anti-spoofing system is shown 2.3. Back-end

in Fig. 1. A context window of W frames (centered at the frame
being processed) is used to obtain the input signal spectral fea- After deep feature extraction, every utterance is represented by
tures which are fed into the system. Then, the CNN provides a single spoofing identity vector. A back-end classifier is then
a deep feature vector per window, and all deep features vectors applied on these vectors to do the final detection decision. In
of the considered utterance are processed by the RNN which this section three classifiers will be tested: LDA, SVM and one-
computes an embedding vector for the whole utterance. We call class SVM.
this the spoofing identity vector. Since the front-end is trained
to perform utterance-level classification of the attacks, this em- A. Linear Discriminant Analysis
bedding vector should provide more discriminative information LDA assumes that each class density can be modeled as a mul-
for spoofing detection than the raw speech features. tivariate Gaussian
In this architecture, the CNN plays the role of a frame-level
deep feature extractor providing one feature vector for each con-
1 1 T
Σ−1
text window of W frames. In order to this, the CNN acts as a N (x|µk , Σk ) = p 1 e− 2 (x−µk ) k
(x−µk )
, (2)
classifier whose task consists of determining whether the input (2π) |Σk |
2 2
feature are either genuine or belong to one of the K spoofing

attacks (S1, S2, ..., SK) present in the training set. This CNN where Σk and µk is the covariance and mean for class k, and
uses 2 convolutional and pooling layers as feature extractors, p is the dimension of the identity vectors. Moreover, the LDA
followed by 2 fully connected layers with a softmax layer of model assumes every class shares the same covariance, that is,
K + 1 neurons as classification layer. To prevent overfitting, we Σk = Σ, ∀k. The goal of LDA is to find a transformation
used an annealed dropout training procedure [17]. In annealed which maximizes the distance between classes while minimiz-
dropout, the dropout probability of the nodes in the network ing the spreading within each class. This can be formulated
is decreased as training progresses. In this work, the annealed as a diagonalization problem where matrix Σb Σ (Σb is the
function reduces the dropout rate from an initial rate prob[0] to between-class covariance) is diagonalized, so the transforma-
zero over N steps with constant rate. The dropout probability tion can be built from the resulting eigenvectors.
prob[t] at epoch t is given as: Our LDA classifier uses K +1 classes which represent gen-
uine speech and the K known spoofing attacks considered in the
training set. In this way, the LDA assigns a genuine speech con-
t fidence score to each utterance, which is then used for binary
prob[t] = max 0, 1 − prob[0]. (1)
N decision (spoof or genuine) during the evaluation.
As shown in Fig. 1, the deep features obtained from the
B. Support Vector Machine
CNN are fed into an RNN, which computes the anti-spoofing
identity vector of the utterance. The main advantage of using A support vector machine (SVM) separates data points in a high
an RNN, based on gated recurrent units (GRU) [16], is its abil- dimensional space defined by a kernel function. In this manner,
ity for learning the long-term dependencies of the subsequent we first obtain a binary function that describes the probability
deep feature vectors. Finally, a fully connected layer contain- density function where the genuine data lives. This function re-
ing K + 1 neurons (one per class: genuine, S1, S2, ..., SK) is turns +1 in the small region corresponding to the genuine speech
connected to the output of the last time step, followed by a soft- data and -1 elsewhere. Thus, the core idea of SVM is to esti-
max layer. The state of the last time step represents the single mate the hyperplane with the largest separation margin between
identity spoofing vector of the whole utterance. the two classes.
46
Table 1: Structure of the ASVspoof2015 data corpus divided by
the training, development and evaluation sets [14].
# Speakers # Utterances
Subset
Male Female Genuine Spoofed
Training 10 15 3750 12,625
Development 15 20 3497 49,875
Evaluation 20 26 9404 184,000
In this work, this classifier is used to classify the spoofing

identity vectors obtained by the front-end system, where +1 in-
dicates genuine speech and -1 indicates spoofed speech.
C. One-Class Support Vector Machine

Complex classifiers may overfit the training spoofed data. To
create a spoof-independent system, we also test a derivative
model that can only be trained on genuine speech data. This is
a type of one-class SVMs [12], usually employed to find abnor- Figure 2: Comparison of average EERs (%) between known
mal data. This was first tried in spoofing detection with phase- and unknown spoofing attacks on evaluation dataset for differ-
based features in [13]. This kind of SVM is also applied here to ent features and back-ends, including FBANK, MFCC, CQCC,
classify the spoofing identity vectors, and only genuine speech LDA, SVM and SVM One-Class.
data has been used to train the one-class SVM model.
3. Experimental framework data corpus, the softmax layer of both CNN and RNN contains
K + 1 = 6 neurons (one per class). The two fully connected
To evaluate the performance of several features and back-
layers of the CNN have 1024 sigmoid neurons, and the layer
ends in an anti-spoofing system based on neural networks, the
of the RNN has 1920 GRUs, which is the length of the identity
ASVspoof 2015 dataset [14], a standard data corpus for re-
spoofing vector of the whole utterance. To prevent the prob-
search on spoofing detection, was employed. Details about the
lem of overfitting, the initial dropout probabilities are 50% and
methodology followed for training and testing are also given in
40% from the first to the last fully connected layer, respectively.
this section.
Also, early stopping is applied in order to stop the training pro-
cess when no improvement of the cross entropy is obtained after
3.1. Speech corpus
15 iterations. All the specified parameters of the system have
The ASVspoof 2015 corpus [14] defines three datasets (train- been optimized using the validation set of the data corpus [14].
ing, development and evaluation), each one containing a mix
of genuine and spoofed speech. The structure of these three 3.4. Performance evaluation
datasets are shown in Table 1. Spoofing attacks were generated
either by TTS or VC. A total of 10 types of spoofing attacks (S1 The equal error rate (EER) is used to evaluate the system per-
to S10) are defined: three of them are implemented using TTS formance. As described in the ASVspoof 2015 challenge evalu-
(S3, S4 and S10), and the remaining seven ones (S1, S2, S5, S6, ation plan [14], the EER was computed independently for each
S7, S8 and S9) using different VC systems. Attacks S1 to S5 spoofing algorithm and then the average EER across all attacks
are referred to as known attacks, since the training and develop- was used. To compute the average EER, we used the Bosaris
ment sets contain data for these types of attacks, while attacks toolkit [15].
S6 to S10 are referred to as unknown attacks, because they only
appear in the evaluation set. More details about this corpus can 4. Experimental results
be found in [14].
4.1. Comparison of features and back-ends
3.2. Spectral Analysis Table 2 shows the detailed results of the different features
(FBANK, MFCC and CQCC) and classifiers (LDA, SVM and
The frame window size is 25 ms with 10 ms of frame shift.
SVM One-Class) in the described CNN + RNN anti-spoofing
Moreover, the size of the context window is W = 31 frames,
system. Furthermore, a summary of these results is shown in
and the number of filters used to get the spectral features is
Fig. 2. The best performance is obtained with the combina-
M = 48 filters. In contrast to [7] and [11], we use a 48-
tion of FBANK features and the LDA classifier. In average, the
dim static spectral features without delta and acceleration co-
FBANK features obtain the best performance independently of
efficients, as we have realized that the context window of 31
the back-end, although MFCC features perform better on the
frames is already exploiting the correlations between consecu-
SVM One-Class considering all the attacks. The CQCC fea-
tive frames. Therefore, a higher spectral resolution is achieved
tures achieve the best average performance in the known attacks
while the size of the spectral feature vector is smaller than in
with LDA and SVM back-ends, but these two combinations per-
[7].
form very poorly in the S10 attack.
Regarding the back-ends, the LDA outperforms the other 2
3.3. Training
classifiers in the known and unknown attacks. Moreover, the
The CNN and RNN networks are trained using Adam opti- binary SVM classifier performs much better than SVM One-
mizer [18]. As there are K = 5 known spoofing attacks in the Class using FBANK and CQCC features.
47
Table 2: Comparison on evaluation dataset for each spoofing attack in terms of (%) EER
Known Attacks Unknown Attacks Total
Features Back-end
S1 S2 S3 S4 S5 Avg. S6 S7 S8 S9 S10 Avg. Avg.
LDA 0.01 0.09 0.00 0.00 0.11 0.04 0.66 0.21 0.00 0.36 7.16 1.68 0.86
FBANK SVM 0.03 0.13 0.00 0.01 0.22 0.08 0.77 0.34 0.18 0.48 10.46 2.44 1.26
SVMOne 0.36 2.07 0.17 0.12 4.37 1.42 5.44 1.34 0.34 1.53 8.23 3.38 2.40
LDA 0.06 0.08 0.00 0.00 0.06 0.04 0.11 0.12 0.00 0.05 15.43 3.14 1.59
MFCC SVM 0.05 0.19 0.01 0.01 0.23 0.10 0.22 0.21 0.05 0.15 29.58 6.04 3.07
SVMOne 0.43 1.97 0.12 0.12 2.11 0.95 3.38 2.07 0.06 1.03 11.09 3.53 2.24
LDA 0.04 0.04 0.00 0.00 0.04 0.02 0.13 0.51 0.05 0.08 11.76 2.51 1.27
CQCC SVM 0.03 0.01 0.01 0.01 0.02 0.01 0.06 0.37 0.07 0.02 21.52 4.41 2.21
SVMOne 1.72 6.14 0.49 0.47 7.34 3.23 10.13 9.67 1.39 6.50 8.54 7.25 5.24
Figure 3: Comparison on evaluation dataset for known and unknown spoofing attacks in terms of average (%) EER
According to these results, we propose an anti-spoofing sys- attacks.

tem which employs FBANK features, a CNN + RNN architec- Despite that the CQCC + GMM system outperforms all
ture to get the spoofing identity vector of an utterance, and an the systems in a clean condition training scenario, our previous
LDA classifier to make the final detection decision (spoof or work [6] and reference [11] demonstrate that CQCC + GMM
genuine). performs worse than the systems based on deep features when
a noisy scenario is considered. Moreover, reference [6] shows
4.2. Comparative performance that, when a typical multicondition training for noise robustness
A comparison of our proposal with other popular techniques is applied, our system, based on deep identity vectors, clearly
from the literature are presented in Fig. 3. outperforms CQCC + GMM even in clean conditions.
The first two systems CQCC + GMM [8] and CFCC-IF +
GMM [9] employ the features which perform best for spoofing 5. Conclusions
detection using a Gaussian Mixture Model (GMM) as back-end.
The other four systems are the most popular anti-spoofing sys- This paper has evaluated different features and classifiers in or-
tems based on deep learning frameworks. The FBANK + CNN der to find the combination which offers the best performance
+ LDA system has been proposed in [11], but as its performance for an anti-spoofing system based on a deep learning frame-
is not provided in this reference for the clean scenario, we have work. The experimental results have shown that FBANK fea-
evaluated instead our proposed system removing the RNN and tures and an LDA obtain the best performance for systems
averaging the deep features for getting the spoofing identity vec- based on the extraction of deep features, rather than the popu-
tor of the utterance as in [11]. lar CQCC features and other types of classifiers, such as binary
The CQCC + GMM system achieves the best average per- SVM and SVM One-Class. Furthermore, the proposed system
formance, although our proposed system (FBANK + CNN + (FBANK + CNN + RNN + LDA) outperforms the rest of deep
RNN + LDA) achieves the best results for the known attacks. learning systems of the literature.
Compared to the rest of deep learning systems (Spectro + CNN
+ RNN [19], FBANK + DNN + LDA [7], FBANK + RNN 6. Acknowledgements
+ SVM [7] and FBANK + CNN + LDA), our proposal out-
performs all of them for the known and unknown attacks. In This work has been supported by the Spanish MINECO/FEDER
particular, the result of our proposal for the S10 attack is quite Project TEC2016-80141-P and the Spanish Ministry of Ed-
noteworthy. Furthermore, our proposed system also achieves a ucation through the National Program FPU (grant reference
lower EER in almost all attacks than the CFCC-IF system [9], FPU16/05490). We also acknowledge the support of NVIDIA
performing 0.45% better on average when considering all the Corporation with the donation of a Titan X GPU.
48
7. References [20] B. C. J. Moore, “An Introduction to the Psychology of Hearing”,
BRILL, 2003.
[1] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li,
“Spoofing and countermeasures for speaker verification: A survey,”
in Speech Communication, vol. 66, pp. 130–153, 2015.
[2] Z. Wu et al., “Anti-spoofing for text-independent speaker verifi-
cation: An initial database, comparison of countermeasures, and
human performance,” IEEE/ACM Transactions on Audio, Speech,
and Language Processing, vol. 24, no. 4, pp. 768–783, 2016.
[3] Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., Yu, K., “Deep fea-
ture for text-dependent speaker verification”, in Speech Communi-
cation, vol. 13, pp. 1–13, 2015.
[4] Grzl, F., Karafit, M., Kontr, S., Cernocky, J., “Probabilistic and
bottle-neck feature for LVCSR of meetings,” in Proc. IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2007, pp. 757–760.
[5] Wu, Z., King, S., “Improving trajectory modelling for dnn-based
speech synthesis by using stacked bottleneck features and min-
imum trajectory error training,” in IEEE/ACM Trans. on Audio,
Speech and Language Processing, vol. 24, pp. 1255–1265, 2016.
[6] Alejandro Gomez-Alanis, Antonio M. Peinado, Jose A. Gonzalez,
and Angel M. Gomez, “A Deep Identity Representation for Noise
Robust Spoofing Detection,” in Proc. InterSpeech, 2018.
[7] Y. Qian, N. Chen, and K. Yu, “Deep features for automatic spoofing
detection,” in Speech Communication, vol. 85, pp. 43–52, 2016.
[8] M. Todisco, H. Delgado, and N. Evans, “A new feature for auto-
matic speaker verification anti-spoofing: Constant Q cepstral coef-
ficients,” Proc. Odyssey, 2016, pp. 249–252.
[9] T. B. Patel and H. A. Patil, “Combining evidences from mel cep-
stral, cochlear filter cepstral and instantaneous frequency features
for detection of natural vs. spoofed speech,” Proc. Interspeech,
2015, pp. 2062–2066.
[10] Muckenhirn, H., Korshunov, P., Magimai-Doss, M., Marcel, S.,
“Long-Term Spectral Statistics for Voice Presentation Attack De-
tection,” in IEEE/ACM Trans. on Audio, Speech and Language Pro-
cessing, vol. 25, pp. 2098–2111, 2017.
[11] Y. Qian, N. Chen, H. Dinkel, and Z. Wu, “Deep Feature Engi-
neering for Noise Robust Spoofing Detection,” IEEE/ACM Trans-
actions on Audio, Speech and Language Processing, vol. 25, no.
10, pp. 1942–1955, 2017.
[12] Scholkopf, B., Williamson, R. C., Smola, A. J., et al., “Support
vector method for novelty detection,” in Proc. NIPS, 2000, pp. 582–
588.
[13] Jess Villalba, Antonio Miguel, Alfonso Ortega, and Eduardo
Lleida, “Spoofing detection with dnn and one-class svm for the
asvspoof 2015 challenge,” in Proc. InterSpeech, 2015, pp. 2067–
2071.
[14] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M.
Sahidullah, and A. Sizov, “ASVspoof 2015: The first automatic
speaker verification spoofing and countermeasures challenge,” in
Proc. InterSpeech, 2015, pp. 2037–2041.
[15] N. Brümmer and E. deVilliers, “The BOSARIS toolkit: Theory,
algorithms and code for surviving the new DCF,” in NIST SRE11
Speaker Recognition Workshop, Atlanta, Georgia, USA, Dec. 2011,
pp. 1–23.
[16] Kyunghyun Cho, et al., “Learning Phrase Representations using
RNN Encoder-Decoder for Statistical Machine Translation,” Proc.
Empirical Methods in Natural Language Processing, 2014, pp.
1724–1734.
[17] S. Rennie, V. Goel, and S. Thomas, “Annealed dropout training of
deep networks,” in Proc. Spoken Language Technology Workshop,
2014, pp. 159–164.
[18] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv:1412.6890, 2014.
[19] Chunlei Zhang, Chengzhu Yu, and John H. L. Hansen, “An in-
vestigation of Deep-Learning Frameworks for Speaker Verification
Antispoofing,” IEEE Journal of Selected Topics in Signal Process-
ing, vol. 11, pp. 684–694, 2017.
49
IberSPEECH 2018
The observation likelihood of silence: analysis and prospects for VAD

applications
Igor Odriozola1 , Inma Hernaez1 , Eva Navas1 , Luis Serrano1 , Jon Sanchez1
1
AHOLAB, University of the Basque Country UPV/EHU
{igor,inma,eva,lserrano,ion}@aholab.ehu.eus
Abstract 2. The observation likelihood

This paper shows a research on the behaviour of the observa- In speech recognition, audio segments corresponding to the
tion likelihoods generated by the central state of a silence HMM same recognition unit (word, phone, triphone etc., even silence
(Hidden Markov Model) trained for Automatic Speech Recog- or non-speech) are gathered and processed, in order to extract
nition (ASR) using cepstral mean and variance normalization acoustic parameters from them —typically Mel-frequency cep-
(CMVN). We have seen that observation likelihood shows a stral coefficients (MFCC)— and train a different acoustic model
stable behaviour under different recording conditions, and this for each unit. A very popular acoustic model is the HMM, since
characteristic can be used to discriminate between speech and it not only models the likelihood of a new observation vector,
silence frames. We present several experiments which prove but also the sequentiality of the observations.
that the mere use of a decision threshold produces robust re- Usually, observation likelihoods are generated by the GMM
sults for very different recording channels and noise conditions. belonging to each HMM state j. For an observation vector ot ,
The results have also been compared with those obtained by two the observation likelihood bj of a GMM is calculated as shown
standard VAD systems, showing promising prospects. All in all, in equation 1.
observation likelihood scores could be useful as the basis for the
development of future VAD systems, with further research and M
X
analysis to refine the results. bj (ot ) = cjm N (ot ; µjm , Σjm ) (1)
Index Terms: VAD, observation likelihood, cepstral normal- m=1
ization
where M is the number of mixture components, cjm is the
weight of the mth component and N (·; µ; Σ) is a multivariate
1. Introduction Gaussian with mean vector µ and covariance matrix Σ.
Voice activity detection (VAD) is an important issue in Auto- In this work, the observation likelihoods have been obtained
matic Speech Recognition (ASR) or ASR-based systems. It al- from the silence HMM trained using the Basque Speecon-like
lows the systems to reduce the computation cost and, as a con- database [13], specifically the close-talk channel.
sequence, the response time of the decoding process, by only
passing speech frames [1]. If the access to the system is in- 2.1. The acoustic model for silence
tended to be universal, the VAD has to cope with different noise
The HMM topology chosen for silence frames has three states,
levels, with no —or little— loss in accuracy. Indeed, the great-
left-to-right, allowing the right-end state to connect back with
est challenge for the current ASR systems is to cope with back-
the left-end state. It was trained with 13 MFCCs and 13 first
ground noise in the input speech signal [2].
and 13 second order derivatives as acoustic parameters, and 32-
A large number of speech features and combinations have
mixtures GMMs. The frame length is 25 ms with a shift of 10
been proposed for VAD [3]. Gaussian Mixture Models (GMM)
ms.
and Hidden Markov Models (HMMs) have been tested in this
CMVN was applied to MFCCs, computing global means
context [4][5]. Recently, the use of classifiers has been very
and variances from each recording session. For N cepstral vec-
common: decision trees (DT) [6], Support Vector Machines 2
tors y = {y1 , y2 , ..., yN }, their mean µN and variance σN vec-
(SVM) [7] and hybrid SVM/HMM architectures [8]. More re-
tors are calculated as defined in equations 2 and 3, respectively.
cently, neural networks (NN) have appeared in the literature
outperforming the previous designs [9][10][11]. However, these N
approaches are complex and do not work in real time. 1 X
µN (i) = yn (i) (2)
Little research has been done using cepstral normalization N n=1
for VAD proposals, although it proved to be rather discrimi-
native already in [12]. Here, we introduce some research on N
2 1 X
the use of observation likelihoods for VAD, applying Cepstral σN (i) = (yn (i) − µN (i))2 (3)
Mean and Variance Normalization (CMVN). We analyse the be- N n=1
haviour of the observation likelihoods generated by the GMM
in the central state of the silence HMMs trained for ASR. Re- where i is the ith component of the vector.
sults show that it is a promising basis for future prospects. The cepstral features are then normalized using the calcu-
The next section is a study of different aspects of the ob- lated mean and variance vectors, as given in equation 4. Thus,
servation likelihood scores. Section 3 describes the databases each normalized feature has zero mean and unit variance.
and metrics used for the experiments. Then, VAD some exper-
iments are shown in section 4. Finally, some conclusions and yn (i) − µN (i)
ŷn (i) = (4)
future prospects are explained in section 5. σN (i)
50 10.21437/IberSPEECH.2018-11
2.2. The impact of CMVN 2.4. Robustness against different SN R values
The use of CMVN has a significant impact on the curves that Another interest point to focus on in a VAD is its robustness
observation likelihoods form. When testing a sample signal and for different recording conditions. As an example, we have
computing frame by frame the observation likelihoods at each chosen four signals from the Spanish SpeeCon database [14]
state of the silence HMM, very different curves are obtained to illustrate the impact of the recording distance on the obser-
depending on weather CMVN is applied or not. Figure 1 illus- vation likelihood curves. These four signals correspond to the
trates this difference. The middle and bottom diagrams show same utterance, but were recorded by means of four different
the curves formed by the observation log-likelihoods generated microphones: a headset (channel C0 ), a lavalier (channel C1 ),
by each HMM state s0 , s1 and s2 , without and with normaliza- a medium-distance cardioid microphone (0.5-1 meter, channel
tion respectively, through a utterance composed of four words. C2 ) and a far-distance omnidirectional microphone (channel
In this case, the normalization has been performed using the C3 ). Each of these channels represents a different SN R, C0
means and variances computed from the file. being the cleanest (around 20dB) and C3 the noisiest (0dB).
Figure 2 shows the observation log-likelihoods generated
by the central state of the silence HMM trained with the Basque
Speecon-like database. The utterance is the same as the one in
Figure 1 (note that the signal in Figure 1 corresponds to the C1
signal in Figure 2). The darkest curve corresponds to the C0
channel and the lightest one to the C3 channel.
Figure 2: Observation log-likelihoods along time obtained at

the central state (s1 ) of the silence HMM when processing dif-
ferent channels (C0 , C1 , C2 , C3 ).
The curves show that, as expected, a degradation occurs

when the signals recorded at farther distances are processed,
but even so the curves remain rather discriminative. For C3
signals, the most adverse effect occurs at the initial and ending
phones, where, depending on the phone, likelihoods can be very
similar to those of the noisy silence. This happens mostly when
the initial phone is a noisy phone. However, the curves show
a good behaviour for C1 and C2 , with likelihood profiles very
Figure 1: Spectrogram (top) and observation log-likelihoods similar to those obtained for C0 signals.
along time (frames) generated by the left state (s0 ), central state
(s1 ) and right state (s2 ) of the silence HMM without CMVN
(middle) and applying CMVN (bottom).
3. Data preparation
To assess the stability of the observation likelihood curves gen-
The curves in the bottom diagram (with CMVN), compared erated by the central state of the silence HMM, a VAD accuracy
with the ones in the middle diagram (without CMVN), look experiment has been carried out, setting different thresholds to
more abrupt. This fact can be used to better discern between label frames as speech or silence.
speech and non-speech.
3.1. The databases
2.3. The central state of the silence HMM Two databases have been chosen for the experiments: first, the
In any three-state HMM, the central state is a priori the most Noisy TIMIT spech database [15], to analyse weather a thresh-
stable state of the model, since the left and right states have old could be set for different SN R conditions. The second
to cope with transitions between models. It makes sense that database is the ECESS subset of the Spanish Speecon database
the same will happen to the silence HMM, where left and right [16], which has been used to test the validity of that threshold.
states have to model transitions between silence and speech. 1. Noisy TIMIT spech database: it contains approximately
Looking back at Figure 1, we can see that, indeed, the 322 hours of speech from the TIMIT database [17] mod-
curves generated by the central state (s1 ) are, in both cases (with ified with different additive noise levels. However, we
and without cepstral normalization), much more discriminative have chosen only babble and white noises, as the most
than the curves corresponding to the states at the ends, which natural ones. Noise levels vary in 5 dB steps and rang-
are more irregular. ing from 50 to 5 dB. The database contains 630 different
51
speakers, with 10 utterances per speaker: 6300 files for
each noise level. The total speech content in the database
is 86.57 % (not well balanced), and the label files are the
ones belonging to the classic TIMIT database. All audio
files are presented as single channel 16kHz 16-flac, but
have been converted to 16-bit PCM.
2. ECESS subset of the Spanish Speecon database: it was
used in the ECESS evaluation campaign of voice activ-
ity and voicing detection in 2008. It includes 1020 ut-
terances recorded in different environments (office, en-
tertainment, car and public place) distributed among the
C0 , C1 , C2 and C3 subsets (total number of files: 4080).
There are 60 different speakers each of which utters
17 sentences. The total speech content in the database
is 55.77 % (well balanced), and it contains reference
speech and silence labels specifically designed to assess
different VAD algorithms. The signals in the database
were recorded at 16 kHz and 16 bit per sample.
Each file’s features have been normalized off-line, with the
means and variances calculated from the file itself. The on-line
performance has been left for future research.
3.2. Error metrics Figure 3: ER0 and ER1 (top) and T ER (bottom) for different
decision threshold values when testing the signals of SN R 50
The VAD accuracy experiment consists in evaluating the abil- to 5 dB in the babble noise subset (left) and the white noise
ity of the system to discriminate between speech and silence subset (right) of the Noisy TIMIT database.
segments at different SN R levels, in terms of silence error-
rate (ER0 ) and speech error-rate (ER1 ). These two rates are
computed as the fractions of the silence frames and speech Regarding the error rates, the minimum T ERs are obtained
frames that are incorrectly classified (N0,1 and N1,0 , respec- at T h = −150, except for 5, 10 and 15 dB in white noise
tively) among the number of real silence frames and speech subset, which occur at −100. Thus, we can consider the point
frames in the whole database (N0ref and N1ref , respectively), of T h = −150 as the most valid threshold. Some ER0 and
as shown in equation 5. In addition, the T ER (total error rate) ER1 values obtained for T h = −150 are shown in Table 1.
has also been computed as the average of the ER0 and ER1
(equation 6). Table 1: T ER, ER0 and ER1 for T h = −150 on the signals
of SN R 50, 35, 20 and 5 dB in the babble noise (left) and white
N0,1 N1,0 noise (right) subsets of the Noisy TIMIT database.
ER0 = ref
× 100; ER1 = ref × 100 (5)
N0 N1
Babble White
ER0 + ER1 ER0 ER1 T ER ER0 ER1 T ER
T ER = (6)
2 50dB 34.89 6.71 20.80 34.88 6.95 20.92
A minimum duration of 15 frames both for speech and si- 35dB 30.87 7.48 19.18 28.05 9.18 18.62
lence segments was set. This value was empirically chosen after 20dB 26.35 11.25 18.80 21.53 16.89 19.21
some preliminary experiments. 5dB 22.60 20.78 21.70 15.49 30.90 23.20
4. VAD experiments For T h = −150, the minimum ER1 is 6.71, at 50 dB. As

Initially, we have analysed whether a threshold can be set for expected, the ER1 increases as the SN R decreases. However,
VAD purposes, considering the various SN R values. Then, we notice that the T ER does not present the minimum at 50 dB,
have tested that threshold in a separate database, and, in addi- neither in the babble noise subset nor in the white noise subset,
tion, a validity test has been carried out comparing the results as might be expected.
with those obtained with three standard VAD algorithms.
4.2. Testing
4.1. Analysis of the decision threshold
The threshold calculated in the previous section has been ap-
Different thresholds have been considered to label frames as plied to the files of ECESS subset of the Spanish Speecon
speech or silence. Results are shown in Figure 3, both for bab- database. 4080 files have been tested (1020 in each Ci subset).
ble noise (left) and white noise (right). Results are shown in Table 2.
For the cleanest signals (SN R = 50dB), the equal error The results obtained for the ECESS subset using the thresh-
rate (EER) points of ER0 and ER1 curves are located near old calculated from the Noisy TIMIT are very good. Compared
−200. However, as the SN R gets lower, the EER points with the best result obtained for the Noisy TIMIT (see 50 dB
move towards higher values. In the case of white noise, this row in Table 1), much lower ER0 and ER1 have been obtained.
shift reaches the −120 value for 5 dB. The error rates, as expected, increase as SN R decreases, al-
52
Table 2: T ER, ER0 and ER1 with T h = −150 on the sig- Table 4: Comparison of different VAD algorithm results at four
nals of channels C0 , C1 , C2 and C3 in the Spanish Speecon SN R levels
database.
(a) Silence error rates (ER0 )
ER0 ER1 T ER G.729 AFE-FD AFE-NR Prop.
C0 6.21 2.74 4.48 C0 56.06 63.88 58.23 15.68
C1 4.22 6.13 5.18 C1 70.23 54.75 55.96 12.42
C2 7.10 6.00 6.55 C2 59.54 52.10 38.10 15.39
C3 9.46 6.45 7.96 C3 70.49 50.10 47.65 17.59
(b) Speech error rates (ER1 )

though the best silence error rate is obtained for the C1 channel.
G.729 AFE-FD AFE-NR Prop.
Additionally, a tuning has been performed for ER1 reduc-
tion. Indeed, for speech processing, it is important to reduce C0 3.63 0.03 0.62 0.79
the ER1 as much as possible, so that the minimum number C1 9.28 0.23 1.98 2.30
of speech frames are lost for the next stage. For that purpose, C2 18.19 0.48 4.83 2.47
we have sought to reduce the impact of non-speech to speech C3 17.22 1.41 8.30 2.89
boundaries, setting an additional margin of 5 and 10 frames
around the speech segments. Results are shown in Table 3.
Table 3: T ER, ER0 and ER1 for 5 and 10 frames long speech- 5. Conclusions
segment margins, with T h = −150 for the signals of channels
In this paper, we have assessed the usefulness of the observa-
C0 , C1 , C2 and C3 in the ECESS subset of the Spanish Speecon
tion likelihood generated by the central state GMM of a silence
database.
HMM trained using CMVN, as a possible basis on which to
build a VAD system. We have seen that a good classification
5 frames 10 frames between speech and silence can be performed, just by setting a
ER0 ER1 T ER ER0 ER1 T ER threshold in the curves that observation likelihoods form.
C0 10.84 1.29 6.07 15.68 0.79 8.24 The silence HMM has been trained using the close-talk
C1 7.94 3.47 5.71 12.42 2.30 7.36 channel from the Basque Speecon-like database. Then, a thresh-
C2 10.91 3.50 7.21 15.39 2.47 8.93 old analysis has been carried out, processing the babble and
C3 13.29 3.95 8.62 17.59 2.89 10.24 white noise files of the Noisy TIMIT database. As a conclu-
sion, we have noticed that the minimums error rates occur at
the same likelihood point in 17 SN R values out of a total of
The table shows that ER1 reduces and ER0 increases. 20. This point is the one we have chosen as the threshold.
T ER increases as well, because ER0 increases faster than
ER1 reduces. All in all, the use of a margin around speech This threshold has been tested with a separate database: the
segments allows decreasing significantly ER1 , with a not very ECESS subset of the Spanish Speecon database. The results
significant resulting T ER degradation. obtained for this database are even better than those obtained
for the Noisy TIMIT, which leads us to think that the silence
4.3. Comparison with other systems observation likelihood behaves similarly on different channels.
Additionally, the results of the test have been compared
In order to validate the previous results, our results have been with three different standard VAD systems. Although the best
compared with the outcomes of three popular standard VAD al- speech error rates have not been achieved with the use of the
gorithms carried out in a previous work [18]. These systems decision threshold, we have got the best silence error rates. Our
are standard defined by ITU (International Telecommunication results are quite competitive; actually, the best total classifica-
Union) and ETSI (European Telecommunications Standards In- tion rates have been obtained.
stitute):
As a final conclusion, competitive results are obtained just
1. The VAD algorithm of the ITU G.729 system [19]. by setting a decision threshold to the silence observation likeli-
2. The AFE-FD (frame-dropping mechanism) algorithm hood curves. This fact has been applied in [21], where a method
implemented in ETSI AFE-DSR (Advanced Front-End called Multi-Normalization Scoring (MNS) is used to explode
for Distributed Speech Recognition) [20]. the discriminative potential of the observation likelihood scores.
Robust on-line results are shown in that paper, where the scores
3. The AFE-NR (noise reduction system) algorithm imple- obtained with MNS are classified with a Multi-Layer Percep-
mented in ETSI AFE-DSR [20]. tron (MLP). This issue and others related to the selection of the
Table 4 shows the results obtained for the three VAD sys- optimal threshold are being investigated currently in our labo-
tems along with the proposed method (using T h = −150 and ratory.
a margin of 10 frames), over the same dataset (4080 files from
the ECCESS subset). Regarding ER1 , the AFE-FD gets better 6. Acknowledgements
results, and also the AFE-NR for C0 and C1 . However both
systems show the disadvantage of getting very high ER0 for This work has been partially supported by the EU
all the channels (the lowest value is 38.10 %). This means that (FEDER) under grant TEC2015-67163-C2-1-R (RESTORE)
many silence frames will be sent to the recognizer. The ER0 in (MINECO/FEDER, UE) and by the Basque Government under
our results are between 12.42 and 17.59 %. grant KK-2017/00043 (BerbaOla).
53
7. References [18] I. Luengo, E. Navas, I. Odriozola, I. Saratxaga, I. Hernaez,
I. Sainz, and D. Erro, “Modified LTSE-VAD algorithm for ap-
[1] M. K. Mustafa, T. Allen, and K. Appiah, Research and Develop- plications requiring reduced silence frame misclassification.” in
ment in Intelligent Systems XXXI. Springer International Publish- LREC 2010, Seventh International Conference on Language Re-
ing, 2014, ch. A Review of Voice Activity Detection Techniques sources and Evaluation, May 17-23, Valletta, Malta, Proceedings,
for On-Device Isolated Digit Recognition on Mobile Devices, pp. 2010, pp. 1539–1544.
317–329.
[19] P. Setiawan, S. Schandl, H. Taddei, H. Wan, J. Dai, L. B. Zhang,
[2] T. Virtanen, R. Singh, and B. Raj, Techniques for Noise Robust- D. Zhang, J. Zhang, and E. Shlomot, “On the itu-t g.729.1 si-
ness in Automatic Speech Recognition, 1st ed. Wiley Publishing, lence compression scheme.” in EUSIPCO 2008 – 16th European
2012. Signal Processing Conference, August 25-28, Lausanne, Switzer-
[3] S. G. Tanyer and H. Ozer, “Voice activity detection in nonstation- land, Proceedings, 2008, pp. 1–5.
ary noise,” IEEE Transactions on Speech and Audio Processing, [20] E. Standards, “Speech processing, transmission and quality as-
vol. 8, no. 4, pp. 478–482, 2000. pects (stq); distributed speech recognition; front-end feature ex-
[4] J. Tatarinov and P. Pollák, “Hmm and ehmm based voice activity traction algorithm; compression algorithms,” ETSI Standards, Eu-
detectors and design of testing platform for vad classification,” ropean Telecommunications Standards Institute, vol. ES 201 108
Digital Technologies, vol. 1, pp. 1–4, 2008. Recommendation, 2002.
[5] H. Veisi and H. Sameti, “Hidden-Markov-model-based voice ac- [21] I. Odriozola, I. Hernaez, and E. Navas, “An on-line VAD based on
tivity detector with high speech detection rate for speech enhance- Multi-Normalisation Scoring (MNS) of observation likelihoods,”
ment,” vol. 6, no. 1, pp. 54–63, 2012. Expert Systems with Applications (ESwA), vol. 110, pp. 52–61,
2018.
[6] Ó. Varela, R. S. Segundo, and L. A. Hernández, “Combining
pulse-based features for rejecting far-field speech in a HMM-
based Voice Activity Detector,” vol. 37, no. 4, pp. 589–600, 2011.
[7] D. Enqing, L. Guizhong, Z. Yatong, and C. Yu, “Voice activity
detection based on short-time energy and noise spectrum adapta-
tion,” in ICSP 2002 – 6th International Conference on Signal Pro-
cessing Proceedings, August 26-30, Beijing, China, Proceedings.
IEEE, 2002, p. 464467.
[8] Y. W. Tan, W. J. Liu, W. Jiang, and H. Zheng, “Hybrid svm/hmm
architectures for statistical model-based voice activity detection,”
2014, pp. 2875–2878.
[9] T. Hughes and K. Mierle, “Recurrent neural networks for voice
activity detection,” 2013, pp. 7378–7382.
[10] S. Thomas, S. Ganapathy, G. Saon, and H. Soltau, “Analyzing
convolutional neural networks for speech activity detection in
mismatched acoustic conditions,” 2014, pp. 2519–2523.
[11] Y. Obuchi, “Framewise speech-nonspeech classification by neural
networks for voice activity detection with statistical noise sup-
pression,” in ICASSP, 2016, pp. 5715–5719.
[12] M. Westphal, “The use of cepstral means in conversational speech
recognition,” in EUROSPEECH 1997 – 5th European Confer-
ence on Speech Communication and Technology, September 22-
25, Rhodes, Greece, Proceedings. ISCA, 1997, pp. 1143–1146.
[13] I. Odriozola, I. Hernaez, M. I. Torres, L. J. Rodriguez-Fuentes,
M. Penagarikano, and E. Navas, “Basque speecon-like and
Basque speechdat MDB-600: speech databases for the develop-
ment of ASR technology for Basque,” in LREC 2014, Ninth In-
ternational Conference on Language Resources and Evaluation,
May 26-31, Reykjavik, Iceland, Proceedings, 2014, pp. 2658–
2665.
[14] D. Iskra, B. Grosskopf, K. Marasek, H. van den, F. Diehl, and
A. Kiessling, “Speecon speech databases for consumer devices:
Database specification and validation,” in LREC 2002, Third In-
ternational Conference on Language Resources and Evaluation,
May 27-31, Las Palmas, Spain, Proceedings, 2002, pp. 329–333.
[15] A. Abdulaziz and V. Kepuska, “Noisy timit speech (ldc2017s04),”
3 2017. [Online]. Available: http://hdl.handle.net/11272/UFA9N
[16] B. Kotnik, P. Sendorek, S. Astrov, T. Koç, T. Çiloglu, L. D.
Fernández, E. R. Banga, H. Höge, and Z. Kacic, “Evaluation of
voice activity and voicing detection,” in INTERSPEECH 2008 –
8th Annual Conference of the International Speech Communica-
tion Association, September 22-26, Brisbane, Australia, Proceed-
ings, 2008, pp. 1642–1645.
[17] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, and D. Pallett,
“Darpa timit acoustic-phonetic continous speech corpus cd-rom.
nist speech disc 1-1.1,” NASA STI/Recon Technical Report N,
vol. 93, 1993.
54
IberSPEECH 2018
On the use of Phone-based Embeddings for Language Recognition

Christian Salamea1,2, Ricardo de Córdoba2, Luis Fernando D’Haro2, Rubén San Segundo2, Javier
Ferreiros2
1
Interaction, Robotics and Automation Research Group, Universidad Politecnica Salesiana, Calle
Vieja 12-30 y Elia Liut, Cuenca, Ecuador.
2
Speech Technology Group, Information and Telecomunications Center, Universidad Politecnica de
Madrid, Ciudad Universitaria Av. Complutense, 30, 28040, Madrid
csalamea@ups.edu.ec, cordoba@die.upm.es, lfdharo@die.upm.es, lapiz@die.upm.es,
jfl@die.upm.es
improvement in the overall system thanks to the fusion [3] of

Abstract both techniques.
Language Identification (LID) can be defined as the process of Neural Embeddings (NEs) are vector representations [4] of
automatically identifying the language of a given spoken the phonetic units and have been used in speech recognition
utterance. We have focused in a phonotactic approach in tasks [5]. These vector representations are extracted from
which the system input is the phoneme sequence generated by either the hidden layer in an Neural Network (NN) [6] or from
a speech recognizer (ASR), but instead of phonemes, we have the occurrence matrix [7] of phonetic units in an unlabeled
used phonetic units that contain context information, the so- corpus . NEs has been successfully used in speech recognition
called “phone-gram sequences”. In this context, we propose systems at a word level [5], because of their ability to model
the use of Neural Embeddings (NEs) as features for those the probability of one word appearing in a context close to
phone-grams sequences, which are used as entries in a another one, however their suitability in phonotactic LID
classical i-Vector framework to train a multi class logistic systems has not yet been considered, probably due to some
classifier. These NEs incorporate information from the difficulties at the phoneme level that do not appear at the word
neighbouring phone-grams in the sequence and model level.
implicitly longer-context information. The NEs have been For example, NEs are normally created to reflect the
trained using both a Skip-Gram and a Glove Model. relation at the semantic and syntactic level between words [8],
Experiments have been carried out on the KALAKA-3 characteristics that do not exist at a phonetic level. Our
database and we have used Cavg as metric to compare the proposal is to process NEs similarly to the i-Vector
systems. We propose as baseline the Cavg obtained using the framework, as continuous vectors that contain the most
NEs as features in the LID task, 24,7%. Our strategy to representative information about a language in a low
incorporate information from the neighbouring phone-grams dimension [9]. Our expectation is that NEs in a language will
to define the final sequences contributes to obtain up to 24,3% be projected into some particular direction being this
relative improvement over the baseline using Skip-Gram projection discriminative in comparison to the other
model and up to 32,4% using Glove model. Finally, the languages.
fusion of our best system with a MFCC-based acoustic i- To overcome the limitations in NEs at the phoneme level,
Vector system provides up to 34,1% improvement over the we propose the use of phonetic units that incorporate context
acoustic system alone. information (phone-grams). However, both the high-order
phone-grams and the embedding vectors size could make it
difficult the training of the i-Vectors because of the large size
Index Terms: language identification, phonotactic, neural of the matrix needed for this implementation. We will show
embeddings two alternatives for dealing with the feature vectors.
Context information has also been incorporated in those
1. Introduction feature vectors which we have called "Context Neural
Automatic spoken language identification (LID) is the Embedding Sequences", which we have used to generate the i-
process of identifying the actual language of a sample of Vectors. These i-Vectors are used as features of a multiclass
speech using a known set of trained language models [1]. logistic classifier that obtain the detection cost values (Cavg)
There are currently two main ways of achieving this goal: the for the language trained.
first one uses acoustic features extracted from the speech Finally, all of the systems are fused obtaining the final
signal in which the spectral information is used to distinguish global Cavg for all of the languages trained.
between languages, while the second method uses the phonetic This paper is organized as follows. In section 2, we
sequences obtained using an automatic phonetic recognizer describe the different techniques, as well as the acoustic
(ASR) as features. system used in the fusion with the phonotactic system. In
In general, the best results for the LID task are achieved section 3, we present the experiments and the final results.
using acoustic-based systems. However their fusion with Finally, in section 4, the conclusions and future work are
phonetic/phonotactic-based systems provides a higher presented.
accuracy [2]. This paper focuses on the study of phonotactic
techniques, but we will also show that it contributes to an
55 10.21437/IberSPEECH.2018-12
2. System Description phonetic unit that is going to appear next according to the
context in which the unit is included. In the second case, the
2.1. Phone-gram definition objective is to normalize the counts and then smooth them
trying to obtain a vector representation with homogeneous
Phonotactic systems use context information to improve the values.
performance of LID. In this regard, we propose to use The model definition normally used to train both is focused
phonetic units that implicitly incorporate context information at the word-level [13] but we work at the phone level. The
as features (phone-grams). They can be defined as the objective is to find the co-occurrence of phonemes and
grouping of two or more phonemes in a new unit (Figure 1). phoneme sequences that tend to appear in similar contexts for
In this work we have used 2grams only because of the a specific language. Hence, we expect to improve the results
scattering observed in higher order. compared with the system based on uniphone sequences. Our
study focuses on phone-grams, and their use in the continuous
space has been called Phone-based Embeddings (Ph-Emb)..
2.5. Skip-Gram Model

In relation to the SK-Emb obtained from a NN, they are
obtained with the following procedure: we consider a NN
with one input layer, one hidden layer and one output layer. In
Figure 1: Concept of phone-gram. the input layer we have the phone-gram with 1-of-N coding, in
the state layer we obtain the vector representation of the
2.2. Database phone-gram, and the SK-Emb are generated by applying a
modelling technique on the vector representation of the input
The KALAKA-3 database was created for the Albayzin
phone-gram together with its context. These SK-Emb will be
2012 LRE [10]. It is designed to recognize up to 6 languages
the feature vectors that we will use in our LID system. From
in the closed set condition (i.e. Basque, Catalan, English,
several models that have been proposed for that purpose, two
Galician, Portuguese, and Spanish) using noisy and clean files
of them are the most used: Skip-Gram [14] and C-Bow [15].
with an average duration of 120s and includes 108 hours in
We have selected Skip-Gram as we obtained better results in
total. This database contains training, development, testing,
initial experiments.
and evaluation examples distributed as shown in Table 1:
The Skip-Gram model is a classic NN, where the activation
functions are removed and hierarchical Soft-max [16] is used
Table 1: KALAKA-3 database. instead of soft-max normalization. The training objective of
Train Dev Test Eval the Skip-Gram model is to predict the context of the input
phone-gram in the same sentence [17].
Nº Files 4656 458 459 941
Nº of clean files 3060 - - - 2.6. GloVe Model
Nº of noisy files 1596 - - - On the other hand, the Gl-Emb from the co-occurrence
Length<=30s 2855 121 113 267 matrix have been obtained as follows: the matrix contains the
counts of a phonetic unit appearing close to a possible context.
Length 120s 1801 337 346 674 Rows are the phonetic units of the vocabulary while columns
are the possible context of those phonetic units. The least
To compare the systems, we have considered the Cavg squares model is used in the training process (GloVe model)
metric, that weights the number of the false acceptances and [7].
false alarms generated by the recognizer, representing them as GloVe models capture the statistics of the global corpus to
a detection cost function [11]. Using this metric, lower values learn the vector representations of words. It is normally used
correspond to better systems. at a word level, but in our case, we are going to evaluate them
at a phonetic level using phone-grams. GloVe models are
2.3. Phoneme Recognizers similar to Skip-Gram models except for the context windows.
GloVe uses the co-occurrence of elements, capturing the
The phoneme sequences of the utterances used to obtain
global statistics of the corpus in a matrix and analyzing the
the “phone-grams” have been generated from a phoneme
entire possible co-occurrence probability rates using test
recognizer. In our system, it is based on the system designed
phonetic units. This way, it is possible to distinguish the
by the Brno University (BUT) [12], which uses monophone
relevant and non-relevant phonemes in a sentence [7].
three state HMMs. There are 3 sets of HMMs (for Hungarian,
Russian, and Czech) with 61, 46, and 52 different phonemes 2.7. i-Vector system based on Phone-based Embeddings
respectively. applied to LID
2.4. Phonetic Vector Representation We propose to use these Phone-based Embeddings as
feature vectors for the input of a LID system based on the i-
In our approach, we have called SG-Emb to thevector
Vector framework [18]. In Figure 2, we present the global
representations extracted from the hidden layer of an NN and
system architecture. Our system has two components. The
Gl-Emb to the vector representations extracted from the co-
first one called "Front-End", is where the acoustic signal
occurrence of elements matrix obtained from the global
processing is carried out followed by the phoneme recognizer,
corpus. In the first case, the objective is to predict the
56
[12] that produces the phonetic sequences corresponding to different i-Vectors for each language. Our proposal is to fuse
the utterances. the scores provided by the individual language-dependent
systems expecting a better performance and a lower
Front-End Back-End computational cost.
Phone-based
Embeddings
UBM
Phoneme Phone- training
Training Data
Recognizer grams
Phone-based T Matrix Language
2.10. Acoustic System using MFCCs
creation detected
Embeddings Multi-
Training Phase
generation Class
Logistic
We have fused the scores of the proposed techniques with
i-Vectors
Evaluation Phase
Classifier
the scores obtained from an acoustic system to check if they
Phone- Phone-based
Evaluation Phoneme
grams Embeddings provide complementary information. The acoustic system has
creation generation
Data Recognizer been generated as follows: from each speech utterance, 12
MFCC coefficients including C0 [21] are extracted for each
frame. The silence and noise segments of the acoustic signal
have been removed using a Voice Activity Detector. To
Figure 2: Global System Architecture.
reduce the noise perturbation, a RASTA filter has been used
together with a cepstral mean and variance normalization
The second component of the system is the "Back-End". (CMNV). We have a feature vector of dimension 56,
Firstly, we obtain the phone-gram sequences from the generated from the concatenation of the SDC parameters using
phoneme sequences of each language. The sequences obtained the 7-1-3-7 configuration. Feature vectors are used to train the
have been used to train the Phone-based Embeddings . To total variability matrix, from which the i-vectors of dimension
model the Phone-based Embeddings we have used both 400 with 512 Gaussians are extracted (optimal configuration).
alternatives described above, Skip-Gram and GloVe modeling.
After that, we have replaced every phone-gram by its
respective Phone-based Embedding to use it as input feature
vector to the i-Vector system. All these vectors are used to 3. Results
train the T matrix and the UBM model needed to obtain the i- We have to define the Phone-based Embeddings optimal
Vectors. Finally we have used these i-Vectors as features to training parameters for 2grams: vector size, window size,
train a multiclass logistic classifier to define the detected number of training iterations and negative sampling factor.
language [19], [20]. Negative sampling is an optimization method used to improve
As we have one different Phone-based Embedding for each the NEs robustness applying logistic regression. It reduces the
language to be recognized, we considered two alternatives to computational complexity and increases the vector estimated
manage the set of vectors for each phone-gram. We have efficiency.
called them: "Single vector embedding" and "Multiple vector The window size corresponds to the number of phonetic
embeddings". phone-gram units considered to the left and to the right of the
current phonetic unit and it is considered as contextual
2.8. Single Vector Embedding (SVE) information. The vector size is the vector embedding size. In
We organize the phone-gram sequence in a column, all cases, the results in the tables represent the fusion of the
replacing each phone-gram by its corresponding Phone-based three phonetic recognizers.
Embedding of a specific language and repeat this process for
all the other languages. So, the first column will contain the 3.1. Single Vector Embedding (SVE)
Phone-based Embeddings sequences trained with data from As we described in Section 2.8 we have generated a
language 1, the second one will contain the Phone-based sequence of phone-grams for each language. We have tested
Embeddings sequences trained with data from language 2, several options for the feature vector size, obtaining an
and so on. Finally, we obtain a matrix that includes the Phone- optimum for size 40, being 240 the final vector, considering
based Embeddings trained with all the languages to be the 6 languages to be recognized. In relation to the number of
recognized (Figure 3). Gaussians in the i-Vectors system we have obtained an
optimum for 512. The best result was 24.69% of Cavg.
3.2. MVE using the Skip-Gram model

When we considered a single Phone-based Embedding
feature vector for each language as in SVE we obtained
29,15% of Cavg using only the model for the Basque
language. After fusing all the languages we obtained 19,73%
of Cavg, which is a relative improvement of 20.1% over SVE.
So, we decide to use MVE in all remaining experiments.
3.3. Inclusion of Context in the vector embeddings

Figure 3: Single Vector Embedding.
Based on the hypothesis that the vector representations
2.9. Multiple Vector Embedding (MVE) obtained from NN have some linguistic regularities as the
additive composition [13], we propose to use Contextual
As SVE could easily have a problem of excessive information, including information from the neighboring
dimensionality, we considered a system where we use the phone-grams in the final Phone-based Embedding.
Phone-based Embeddings trained for each language Our proposal is to use a weighted sum of neighboring
individually (Language Phone-based Embeddings) to obtain phone-grams. We considered two options for that, the first one
57
uses a three phone-gram context window and the second one a 4. Conclusions
five phone-gram context window with the following weights:
A) 3 phone-grams context: Final Ph-Emb = Left Ph-Emb * We have demonstrated that the use of Phone-based
0.25 + Central Ph-Emb * 0.50 + Right Ph-Emb * 0.25 Embeddings as feature vectors provides improvements in an
LID task. We have used as a baseline a first system that uses
B) 5 phone-grams context: Final Ph-Emb = Second Left Phone-based Embeddings as feature vectors with rather poor
Ph-Emb * 0.10 + Left Ph-Emb * 0.15 + Central Ph-Emb results. However, using the new approaches proposed in this
* 0.50 + Right Ph-Emb * 0.15 + Second Right Ph-Emb paper results improved, and the fusion of our best
* 0.10 configuration with our acoustic system provides significant
The objective is to assign more weight to the current improvements.
phone-gram but taking into account information from the Our baseline system uses the "SVE" technique to obtain
neighboring units. Using option A (Cavg: 24,38) we obtained 24.69% of Cavg. Considering this poor result, we decided to
a 14,4% relative improvement over option B (Cavg: 28,49) change the approach and use an individual matrix for each
using the model for the Basque language and the Skip-Gram language, fusing the scores from all individual systems at the
technique as reference. The optimum vector size for SG-Emb back-end, and we obtained 19,73% of Cavg using the Skip-
is 80, with 512 Gaussians in the iVectors system, 10 iterations Gram modelling.
and a window size of 8. All the optimization has been Then, we proposed the inclusion of context information in
obtained using the data development set. After fusing all the the Phone-based Embeddings including the two or four
languages, we obtained 18.70% of Cavg with MVE, which is a nearest neighbours. After fusing all the language models we
relative improvement of 24.3% over SVE. obtained 18.7% of Cavg using the Skip-Gram modelling.
Finally, using the GloVe modelling we obtain 16.7% of Cavg
3.4. GloVe model for the MVE
with a 10.7% relative improvement over Skip-Gram modelling
We have also evaluated our approach using the GloVe and a 32.4% compared to the baseline system.
model (Section 2.6) instead of the Skip-Gram model for our Also, the fusion with the acoustic based system provides a
best system with contextual information, because it 34.1% relative improvement, which demonstrates that both
incorporates information of the co-occurrence of phone-grams systems provide complementary information for the LID task.
in all the training data set. The optimal configuration As future research lines, we propose to study the effects of
parameters are: vector size of 80, window size of 4, and 30 higher order units using a larger database. We will also
iterations. The optimum number of Gaussians for the iVectors evaluate other types of language models for the neural
system has been 512 (the same as Skip-Gram). Fusing all the embeddings. Also, we expect to use models with a high
languages as before, we obtain a 16,70% of Cavg, which is a number of layers (char-RNN) and use its combination with
10.7% of relative improvement over MVE based on the Skip- convolutional DNNs to get better local context characteristics.
Gram model.
5. Acknowledgements
3.5. Summary of results
The work leading to these results has been supported by
In Table 2 we present the summary of results obtained with AMIC (MINECO, TIN2017-85854-C4-4-R), and CAVIAR
the techniques proposed in this paper. As we can see, the final (MINECO, TEC2017-84593-C2-1-R) projects. Authors also
system using the GloVe model provides the best results. thank Mark Hallet for the English revision of this paper and all
the other members of Speech Technology Group for the
Table 2: Summary of results. continuous and fruitful discussion on these topics. We
gratefully acknowledge the support of NVIDIA Corporation
System Cavg Improvement % with the donation of the Titan X Pascal GPU used for this
SVE 24.69 research.
MVE context and Skip-Gram 18.70 24.3
MVE context and GloVe 16.70 32.4
6. References
[1] Y. Muthusamy, E. Barnard and A. Cole, “Reviewing automatic
3.6. Fusion with the acoustic model language identification,” in Signal Processing Magazine, IEEE
1994, pp. 33–41.
The objective of this technique is to improve an existing [2] L. D’Haro, R. Cordoba, C. Salamea and J. Echeverry “Extended
LID system, which is based on acoustic information (section phone-likelihood ratio features and acoustic-based i-vectors for
2.10). So, we present the results of fusing the existing acoustic language recognition,” in Proceedings in Acoustics, Speech and
Signal Processing, ICASSP, 2014, pp. 5342–5346.
LID system with our two best systems, based on Phone-based
[3] N. Brummer and D. Van Leeuwen, “On calibration of language
Embeddings obtained with the Skip-Gram and Glove models recognition scores,” in Speaker and Language Recognition
(Table 3). Workshop, IEEE Odyssey 2006, pp. 1-8.
[4] J. Turian, L. Ratinov, and Y. Bengio, “Word Representations: A
Table 3: Phone-based Embeddings systems fused with simple and general method for semi-supervised learning,” in
an acoustic system. Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, 2010, pp. 384–394.
System Cavg Improvement % [5] S. Bengio and G. Heigold, “Word Embeddings for Speech
Recognition,” in INTERSPEECH 2014 – 15th Annual Conference
Acoustic system 7.60 of the International Speech Communication Association,
Fusion with SG-Emb 5.40 28.9 September 14-18, Singapore, Proceedings, 2014, pp. 1053–
1057.
Fusion with Gl-Emb 5.01 34.1
58
[6] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean,
“Distributed representations of words and phrases and their
compositionality,” in Advances in neural information processing
systems, 2013, pp. 3111–3119.
[7] J. Pennington, R. Socher, and C. Manning, “Glove: Global
vectors for word representations,” in Proceedings of the
conference on empirical methods in natural language
processing, 2014, pp. 1532–1543.
[8] P. Wang, B. Xu, J. Xu, G. Tian, C. Liu, and H. Hao, “Semantic
expansion using word embedding clustering and convolutional
neural network for improving short text classification,” , 2016,
pp. 806–814.
[9] L. D’Haro, R. Cordoba, M. Caraballo and J. Pardo, “Low-
resource language recognition using a fusion of phoneme
posteriorgram counts, acoustic and glottal-based i-vectors,” in
Proceedings in Acoustics, Speech and Signal Processing
ICASSP, 2013, pp. 6852–6856.
[10] L. Rodriguez-Fuentes, M. Penagarikano, A. Varona, M. Diez
and G. Bordel, “KALAKA-3: a database for the assessment of
spoken language recognition technology on YouTube audios,” in
Language Resources and Evaluation, 2016, pp. 221–243.
[11] A. Martin and C. Greenberg, “The 2009 NIST Language
Recognition Evaluation,” in Speaker and Language Recognition
Workshop, IEEE Odyssey 2010, pp. 165–171.
[12] P. Ace, P. Schwarz and V. Ace, “Phoneme recognition based on
long temporal context,” PhD. Thesis, Brno University of
Technology, Faculty of Information Technology, 2009.
[13] T. Mikolov, K. Cheng, G. Corrado and J. Dean, “Efficient
estimation of word representation in vector space,” in
Proceedings of Workshop at ICLR, pp. 1–12.
[14] D. Guthrie, B. Allison, W. Liu, L. Guthrie and Y. Wilks, “A
closer look at Skip-Gram modelling,” in Proceedings of the 5th
International Conference on Language Resources and
Evaluation, 2006, pp. 1–4.
[15] Y. Yuang, L. He, L. Peng and Z. Huang, “A new study based on
word2vec and cluster for document categorization,” in Journal
of Computational Information Systems., 2014, pp. 9301–9308
[16] F. Morin and Y. Bengio, “Hierarchical probabilistic neural
network language model,” in Aistats, 2005, pp. 246–252.
[17] S. Yujing, X. Yeming, X. Ji, P. JieLin and Y. Yonghong,
“Recurrent neural network language model with vector-space
word representations,” in the 21th International Congress on
Sound and Vibration, 2014.
[18] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel and P. Ouellet,
“Front-end factor analysis for speaker verification. in Audio,
Speech and Language Processing, 2011, pp. 788–798.
[19] N. Brummer, L. Burget, J. Cernocky, O. Glembek, F. Grezl, M.
Kara, D. Van Leeuwen, P. Matejka, P. Schwarz and A.
Strasheim, “Fusion of heterogeneous speaker recognition
systems in the STBU submission for the NIST speaker
recognition evaluation” in Audio, Speech and Language
Processing., 2007, pp. 2072–2084.
[20] M. BenZeghiba, J. Gauvain and L. Lamel, “Language score
calibration using adapted Gaussian back-end,” in
INTERSPEECH 2009 -- 10th Annual Conference of the
International Speech Communication Association, 2019, pp.
2191–2194.
[21] S. Davis and P. Mermelstein, “Comparison of parametric
representations for monosyllabic word recognition in
continuously spoken sentences,” in Acoustics, Speech and Signal
Processing. 1980, pp. 357–366.
59
IberSPEECH 2018
End-to-End Speech Translation with the Transformer

Laura Cros Vila, Carlos Escolano, José A. R. Fonollosa, Marta R. Costa-jussà
TALP Research Center, Universitat Politècnica de Catalunya, Barcelona

lcrosvila@gmail.com,{carlos.escolano, jose.fonollosa, marta.ruiz}@upc.edu
Abstract on the experimental part including databases, training parame-

ters and results. Finally, 4 reports the final conclusions.
Speech Translation has been traditionally addressed with the
concatenation of two tasks: Speech Recognition and Machine
Translation. This approach has the main drawback that er- 2. Transformer
rors are concatenated. Recently, neural approaches to Speech The goal of this work is to build an End-to-End Speech Trans-
Recognition and Machine Translation have made possible fac- lation (ST) based on the Transformer [8]. This end-to-end
ing the task by means of an End-to-End Speech Translation ar- task will be contrasted with the concatenation of the Automatic
chitecture. Speech Recognition (ASR) and Machine Translation (MT) sys-
In this paper, we propose to use the architecture of the tems, which are also built using the Transformer architecture.
Transformer which is based solely on attention-based mech- This section briefly describes this architecture.
anisms to address the End-to-End Speech Translation system. As many neural sequence transduction models, the Trans-
As a contrastive architecture, we use the same Transformer to former has an encoder-decoder structure. The main difference
built the Speech Recognition and Machine Translation systems between it and any other models is that Transformer is entirely
to perform Speech Translation through concatenation of sys- based on attention mechanisms [9] and point-wise, fully con-
tems. nected layers for both the encoder and the decoder. This makes
Results on a Spanish-to-English standard task show that the it computationally cheaper than other architectures with simi-
end-to-end architecture is able to outperform the concatenated lar test scores. The whole architecture of the Transformer is
systems by half point BLEU. depicted in (Figure 1).
Index Terms: End-to-End Speech Translation, Transformer
1. Introduction
The fields of Machine Translation (MT) and Automatic Speech
Recognition (ASR) share many features, including conceptual
foundations, sustained interest and attention of researchers in
the field, a remarkable progress in the last two decades and the
resulting wide popular use. Both ASR and MT have a long way
to improve and, as a result, do not give perfect results. Speech
Translation (ST) applications are typically created by combin-
ing ASR and MT systems [1, 2].
This pipeline implies that each system has to be trained with
their own dataset (which are required to be large) creating a big
drawback for low resourced languages. In addition, all errors
made by the recognizer go to the MT system and then the MT
system itself adds its own errors. The errors are combined, and
the results are often very poor.
Deep learning architectures have allowed for end-to-end ap-
proaches for both machine translation [3] and speech recogni-
tion [4]. Both systems are based on an architecture of encoder-
decoder with recurrent neural networks and attention mecha- Figure 1: Model architecture of the Transformer.
nisms. This architecture has been successfully extended to end-
to-end speech translation [5].
Recently, there has been a new proposed architecture for 2.1. Input/Output
addressing machine translation [6]. Later, this architecture has
Originally, the input of the Transformer is a sequence of words
also been used for speech recognition1 [7]. In both cases, the
divided in sub-units denominated tokens. Once the text is
Transformer outperforms previous architectures based on recur-
turned into a tokenized version of the words, a matrix of real
rent neural networks. Inspired by these previous works, this
numbers collects the vectors (typically of size dmodel = 512).
paper describes how to adapt this architecture to end-to-end
If taken the raw input sequence as x = (x1 , x2 , . . . , xm ) and
speech translation. The rest of the paper is organised as follows.
the embedded representation as w = (w1 , w2 , . . . , wm ) with
Section 2 briefly describes the architecture of the Transformer
wj ∈ Rf , then each wj is a column vector of the input matrix
to make this paper self-contained. Section 3 reports the details
belonging to the space RV ×f , with V as the number of embed-
1 https://tensorflow.github.io/tensor2tensor/tutorials/asr with dings and f the number of features of each embedding.
transformer.html The decoder generates an output sequence corresponding to the
60 10.21437/IberSPEECH.2018-13
input sentence. participants called family members or close friends. The audio
files of the CALLHOME Corpus are available at 5 LDC96S35.
2.2. Positional encoding The transcripts of these audio files are available at 6 LDC96T17.
The transcript files are in plain-text, tab-delimited format (tdf)
The lack of recurrence and convolution in the model entails that
with UTF-8 character encoding. In order to adapt the transcript
no recurrence nor temporal information is available. A good
files for the Transformer, all the text was turned into capital
way to keep the order of the sentence is adding positional en-
letters as well as a reference number at the beginning of each
coding to the input embeddings at the bottoms of the encoder
sentence was added, consisting of six digits starting from
and decoder stacks.
”000000” to the last sentence and separated with a tabulator
What is used in Transformer is an element-wise vector p =
such as follows:
(p1 , p2 , . . . , pm ), with pj ∈ Rf , which is added to the original
matrix.
000000 HELLO
2.3. Encoder 000001 ALO.
000002 ALO, BUENAS NOCHES. QUIéN ES?
The encoder consists of a stack of N layers, each of them com-
000003 QUé TAL, EH, YO SOY GUILLERMO, CóMO
posed of two sub-layers: a multi-head attention mechanism and
ESTáS?
a fully-connected feed forward net, plus residual connections 2
000004 AH GUILLERMO.
(referenced in Figure 1 as ”Add”) on both stages, followed by a
layer of normalization. ...
The multi-head attention has several parallel attention lay- 003637 OH MY GOD.
ers, or heads, which concatenate attention functions with differ- 003638 MHM. Y NO LE PODı́AN HACER NADA, NO.
ent linearly projected queries, keys and values. 003639 MM.
2.4. Decoder The conversations were recorded as 2-channel mu-law sam-

The decoder resembles the encoder but it is not completely ple data with 8000 samples per second (as captured from the
equal. Although it is also a stack of N layers layers with public telephone network).
sub-layers within them, a masked multi-head attention layer is Table 1 shows the corpus statistics of the text dataset for
added (apart from the common residual connections and layer Spanish-English.
normalization). The singular fact of the decoder is that at each
step the model is auto-regressive, meaning that it uses the pre- L Set S W V
viously generated symbols as additional input when generating Train 138819 1503003 57587
the next ones. Dev1 3979 41271 6251
Spanish
Dev2 3960 40072 5793
3. Experimental work Test 3641 40141 5888
Train 138819 1441090 37817
Our implementation of the Transformer is based on Ten- 3979 40015 5053
sor2Tensor [8], or T2T for short, which is a library of deep Dev1 3979 39977 5059
learning models and datasets actively used and maintained by 3979 39799 5163
researchers and engineers within the Google Brain team and a 3960 39152 4739
community of users. As follows, we report the database used, English
Dev2 3960 39513 4734
the parameters to train the system and the results. 3960 39004 4840
3641 39617 4929
3.1. Database Test 3641 39011 4901
The database used in this experiment is the Fisher Spanish 3641 38578 4875
and Callhome Spanish Corpus. The Fisher Spanish Corpus Table 1: Corpus Statistics. Language (L), number of sentences
provides a set of speech and transcripts developed by the (S),words (W), vocabulary (V).
Linguistic Data Consortium (LDC) which consists of audio
files covering roughly 163 hours of telephone speech from 136
native Caribbean Spanish and non-Caribbean Spanish speakers. For the MT experiment, we used the parallel text from
The speech recordings consist of 819 telephone conversations LDC2014T23 available from LDC7 and in github 8 .
of 10 to 12 minutes in duration. Full orthographic transcripts
of these audio files are available in 3 LDC2010T04. The audio
3.2. Parameters
files are available in 4 LDC2010S01.
The CALLHOME Spanish Corpus consists of 120 unscripted For the experiments, we used three different architectures.
telephone conversations between native speakers of Spanish. These three architectures correspond to the ASR, MT and ST
All calls, which lasted up to 30 minutes, originated in North models.
America and were placed to international locations. Most The main hyperparameters used are detailed in the Table
2 Assuming that the function to be modeled (weights and bias of the 2. For the ASR/ST models the learning rate had a decay each
net) is closer to an identity mapping than to a zero mapping, a good way
5 https://catalog.ldc.upenn.edu/LDC96S35
to optimize the learning process is to add residual connections, which
provides the input without any transformation to the output of the layer. 6 https://catalog.ldc.upenn.edu/LDC96T17
3 https://catalog.ldc.upenn.edu/LDC2010T04 7 https://catalog.ldc.upenn.edu/ldc2014t23
4 https://catalog.ldc.upenn.edu/LDC2010S01 8 https://github.com/joshua-decoder/fisher-callhome-corpus
61
5000 steps and the learning rate warm-up steps set to 80009 . Comparing ASR+MT concatenation and End-to-End
To adapt the speech features in the ASR encoder, we used the Speech Translation, the results show that in terms of BLEU,
conv relu conv from tensor2tensor. As parameters, we used the latter is slightly better than the former gaining 0.5 points of
mel filterbank of 80 coefficients every 10ms with a window of BLEU.
25 ms. As preprocessing for the ASR inputs, we used the ten- Figure 2 shows an example that when concatenating ASR
sor2tensor options as follows conv1d(inputs, filter size=1536, and MT, the errors are also concatenated. The Spanish target
kernel size=9) + relu + conv1d(inputs, filter size=384, ker- word BAILA (DANCE in English), when recognized with the
nel size=1). The Transformer gets a vector of dimension 384 model ASR ES, is misspelled and transcribed to the word VAYA,
every 10ms. Also for the speech part, a clarification of the in- which has a very similar sound but totally different meaning. As
put and target maximum sequence length is that to have an input a consequence, the final translation output can not reproduce the
maximum sequence length of 1550 means that only examples of word DANCE, which gives a strong meaning of context to the
transcriptions whose audio has less than 1550 frames are used, sentence. In this case, the end-to-end system is able to produce
which implies that with frames of 10 ms the maximum size of a better translation
the input audio frame is approximately 15.5 seconds in length.
On the other hand, to have a target maximum sequence length
of 350 means that the train transcripts are limited to a maximum
4. Conclusions
size of 350 characters. This paper proposes to use of the Transformer as main architec-
The speech models were trained on TPUs [10] following ture for Speech Recognition, Machine Translation and Speech
the suggested parameters for the librispeech task of the ten- Translation. To the best of our knowledge, this is the first time
sor2tensor library [11]. that this promising architecture is used to reproduce an End-to-
End Speech Translation system. BLEU results show that the
3.3. Training End-to-End Speech Translation architecture provides slightly
better results than the standard ASR and MT concatenation. Ex-
When training, as there are several GPUs or TPUs, the param- amples show that these better results are achieved by avoiding
eters are applied to each one. So the effective batch size is the the concatenation of errors.
numbers of GPUs (in this case 4 or 8 in a TPU) multiplied by
In future work, it would be interesting to train a system
the batch size. In each batch the parameters are updated using
capable of doing multi-task learning [15]. This system would
the stochastic gradient descend and the Adam optimizer [12].
build several models and not only the one learning to translate
Both the ASR/ST and MT systems use a character-based to-
from Spanish speech to English text. The new multi-task model
kenization. This implies that the models look for the correlation
would learn in addition Spanish Recognition and/or Spanish-to-
of input and output sentences character by character.
English text translation.
3.4. Results
5. Acknowledgements
The evaluation of each model involving translation was done by
computing the Bilingual Evaluation Understudy score (BLEU) This work is supported in part by the Spanish Ministerio de
[13]. The BLEU score is the most used for the field of MT and Economı́a y Competitividad, the European Regional Develop-
it compares the decoded sentence with the target sentence of the ment Fund and the Agencia Estatal de Investigación, through
test set by looking into the modified n-gram precision. As for the postdoctoral senior grant Ramón y Cajal, the contract
ASR evaluation, a commonly used metric is Word Error Rate TEC2015-69266-P (MINECO/FEDER,EU) and the contract
(WER) [14], which is defined as the ratio of word errors (sub- PCIN-2017-079 (AEI/MINECO).
stitutions, deletions and insertions) to words processed. For the
evaluation of ASR systems, punctuation marks were not taken 6. References
into account.
[1] A. Waibel and C. Fugen, “Spoken language translation,” IEEE
Signal Processing Magazine, vol. 25, no. 3, pp. 70–79, May 2008.
[2] M. Dureja and S. Gautam, “Article: Speech-to-speech translation:
A review,” International Journal of Computer Applications, vol.
129, no. 13, pp. 28–30, November 2015, published by Foundation
of Computer Science (FCS), NY, USA.
[3] D. Bahdanau, K. Cho, and Y. Bengio, “Neural ma-
chine translation by jointly learning to align and trans-
late,” CoRR, vol. abs/1409.0473, 2014. [Online]. Available:
[4] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and
spell,” CoRR, vol. abs/1508.01211, 2015. [Online]. Available:
[5] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu,
and Z. Chen, “Sequence-to-sequence models can di-
Figure 2: Speech Translation Example. rectly translate foreign speech,” 2017. [Online]. Available:
https://arxiv.org/abs/1703.08581
[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
9 To set the learning rate warm-up steps to 8000 means that the first Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
8k steps the learning rate grows linearly and then follows an inverse in Advances in Neural Information Processing Systems, 2017, pp.
square root decay. 6000–6010.
62
Hparam Text-to-Text (GPU) ASR/ST (TPU)
Number of encoder layers 6 6
Number of decoder layers 6 4
Gradient clipping No No
Learning rate 0.2 0.15
Momentum 0.9 0.9
Audio sampling rate - 8000
Batch size 4096 16
Maximum length 256 125550
Input sequence maximum length 0 1550
Target sequence maximum length 0 350
Adam optimizer β1 = 0.9 β2 = 0.997 = 10−9 β1 = 0.9 β2 = 0.997 = 10−9
Attention layers 8 2
Initializer uniform unit scaling uniform unit scaling
Initializer gain 1.0 1.0
Training steps 250000 210000
Table 2: Training parameters.
System Results [10] N. Jouppi, “Google supercharges machine learning tasks with tpu
ASR ES (TPU) 38.02 (WER) custom chip,” Google Blog, May, vol. 18, 2016.
MT (GPU) 55.05 (BLEU) [11] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez,
ASR + MT (TPU/GPU) 19.97 (BLEU) S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar,
ASR EN (TPU) 20.47 (BLEU) R. Sepassi, N. Shazeer, and J. Uszkoreit, “Tensor2tensor for
Table 3: Results of the model evaluation. ASR ES stands for the neural machine translation,” CoRR, vol. abs/1803.07416, 2018.
speech recognition with Spanish transcriptions as target. ASR [Online]. Available: http://arxiv.org/abs/1803.07416
EN stands for the speech recognition with English transcrip- [12] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
tions as target. mization,” arXiv preprint arXiv:1412.6980, 2014.
[13] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method
for automatic evaluation of machine translation,” in Proceedings
of the 40th annual meeting on association for computational lin-
[7] S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-Based Sequence- guistics. Association for Computational Linguistics, 2002, pp.
to-Sequence Speech Recognition with the Transformer in Man- 311–318.
darin Chinese,” ArXiv e-prints, 2018. [14] A. C. Morris, V. Maier, and P. Green, “From wer and ril to mer and
[8] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, wil: improved evaluation measures for connected speech recog-
S. Gouws, L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar et al., nition,” in Eighth International Conference on Spoken Language
“Tensor2tensor for neural machine translation,” arXiv preprint Processing, 2004.
arXiv:1803.07416, 2018. [15] M. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and
[9] R. Prabhavalkar, T. N. Sainath, B. Li, K. Rao, and N. Jaitly, “An L. Kaiser, “Multi-task sequence to sequence learning,”
analysis of attention in sequence-to-sequence models,,” in Proc. CoRR, vol. abs/1511.06114, 2015. [Online]. Available:
of Interspeech, 2017. http://arxiv.org/abs/1511.06114
63
IberSPEECH 2018
Audio event detection on Google's Audio Set database: Preliminary results using
different types of DNNs
Javier Darna Sequeiros and Doroteo T. Toledano
AuDIas – Audio, Data Intelligence and Speech, Universidad Autónoma de Madrid

javier.darna@estudiante.uam.es, doroteo.torre@uam.es
one of them being about general-purpose audio tagging. The

Abstract database used in this task is a subset of the Freesound database
This paper focuses on the audio event detection problem, in [5], a collaborative database of sounds under a Creative
particular on Google Audio Set, a database published in 2017 Commons license. This subset has 11.1K segments and 41
whose size and breadth are unprecedented for this problem. In different categories taken from the Google Audio Set
order to explore the possibilities of this dataset, several ontology.
classifiers based on different types of deep neural networks Along with these databases and challenges, several models
were designed, implemented and evaluated to check the based on neural networks were developed in order to classify
impact of factors such as the architecture of the network, the the data in such databases, quickly becoming state-of-art over
number of layers and the codification of the data in the previous models mostly based on hidden Markov models
performance of the models. From all the classifiers tested, the (HMM):
LSTM neural network showed the best results with a mean  The paper “Polyphonic sound event detection using
average precision of 0.26652 and a mean recall of 0.30698. multi label deep neural networks” [6] uses a neural
This result is particularly relevant since we use the network for multi label audio event classification
embeddings provided by Google as input to the DNNs, which that obtained 63.8% accuracy, outperforming the
are sequences of at most 10 feature vectors and therefore limit previous state-of-art model based on HMMs.
the sequence modelling capabilities of LSTMs.  The paper “Recurrent Neural Networks for
Polyphonic Sound Event Detection in Real Life
Index Terms: Audio Set, deep neural network, audio event
Recordings” [7] proposes a bi-directional LSTM
recognition, machine learning.
network to classify audio events from a database
with 61 classes from 10 different contexts. This
1. Introduction system reports an average F1-score of 65.5%.
In machine learning, there are several problems that try to
mimic biologic senses, such as recognizing objects in images Nevertheless, these studies focus on somewhat restricted tasks,
(image object recognition), or identifying particular sounds in and the databases used are relatively small, in terms of both
an audio track (audio event recognition). In recent years, there number of samples and number of classes, which seriously
have been great improvements in image object recognition limits the applicability of modern deep learning techniques
thanks to the availability of large databases such as the and the progress made in audio event recognition for general
Imagenet database [1] and similar ones. These have allowed cases.
the organization of competitions and have fostered the For those reasons, Google created a database named
proposal of novel network architectures such as the Alexnet, Google Audio Set consisting of segments of 10 seconds
VGG, Residual Networks, etc. In the case of audio event extracted from YouTube videos, and published it in 2017 [8].
recognition, the lack of availability of large databases has not With over 2 million samples and 527 different classes, this
allowed a similar development until very recently. There have database is significantly larger and wider than any other
been several notable efforts to foster research in this area, database ever created for this problem, thus allowing to train
among which we must mention: and test more versatile models. In particular, such a huge
 The CLEAR audio event recognition and database is very well fitted to the problem of deep learning
classification challenge [2], which compared where very complex models can be trained from huge amounts
algorithms on a database with 12 audio event classes of data. In addition to the database itself, Google trained a
including common sounds from meeting rooms and model on it in order to establish a baseline. This model,
seminars. described as a multilayer perceptron with a single hidden layer
 The urban sound taxonomy [3] dataset, containing of units (hence not a deep neural network) produced a mean
10 classes of urban sounds. average precision (mAP) of 0.314 [9] on the evaluation subset.
This value can be used as a baseline for other models, but
Since 2013, the Detection and Classification of Acoustic unfortunately, Google did not publish more details on this
Scenes and Events (DCASE) community system. As the database was published just one year before the
(http://dcase.community/) has organized several challenges moment this article was written, there has not been great
focused on different acoustic scenes and events detection and improvement from the previously mentioned baseline. Despite
classification problems. Since 2016, there are yearly that, the current state-of-art results have been achieved
challenges and workshops. The 2018 edition [4] received 223 through the use of attention models, reaching mAPs of 0.327
submission entries from 81 teams, which implies a huge with the inclusion of a trainable probability measure for the
collective effort in the area. It proposed five different tasks,
64 10.21437/IberSPEECH.2018-14
samples [10], and 0.360 with the implementation of several one type of audio), all the networks (except one, as will be
levels of attention [11]. discussed later) use binary cross-entropy as their loss function
This paper intends to be a first approach to Google Audio and the binary sigmoid as the activation function for the
Set. Our goal is to train several different architectures of output layer.
neural networks with this database and compare their
evaluation results with each other and with Google's baseline.
The rest of the paper is organized as follows: Section 2 will 3.1. LSTM.
discuss in more detail the Google Audio Set database. Section
Since the data has the form of a time series of 10 feature
3 will define the neural networks that were used to face the
vectors (embeddings), each one representing one second of
audio event detection problem, section 4 will describe the tests
audio, a Long Short-Term Memory (LSTM) recurrent neural
that were made on the neural networks created, section 5 will
network was immediately considered as an appropriate
interpret the results from those tests, and finally section 6 will
network, as they are specialized in this kind of data. The
conclude the work and propose future research lines that arise
proposed LSTM model has the following architecture:
from the results of this paper.
• The inputs are series of 10 128-dimensional feature
vectors (embeddings).
2. The Google Audio Set database • The first hidden layer is a unidirectional LSTM
The Google Audio Set database is available in two different layer with 600 units, which outputs a single vector
formats: when the whole sequence has been processed. This
• Text files describing the video id, start time, stop layer is followed by a dropout layer with a 0.3
time, and labels assigned to each segment. probability.
• Features (embeddings) extracted at 1 Hz for each • The last hidden layer is a fully connected layer with
segment using a DNN trained by Google (the 600 units.
structure of this DNN is similar to the VGG
networks used in image recognition).
3.2. CNN.
In this paper we have used only the latter format so that Convolutional Neural Networks (CNN) have proven to be
minimal preprocessing is required and results are easier to very powerful for processing images. In our case, the input
recreate. However, the official dataset from Google Audio Set sequence of 10 128-dimensional feature vectors can be
webpage was not used because it was developed to be used considered as a 10x128 image or matrix, and therefore a CNN
directly on Tensorflow (also developed by Google) and it was could be appropriate in this case as well. The 128-dimensional
very difficult and inefficient to use in other toolkit such as input vectors are themselves produced as the output of a
Keras. Instead, we finally used an “unofficial” conversion to different neural network. This neural network uses a PCA
.h5df format available from the Google+ user group "audioset- transformation to create an embedding of the data. Therefore,
users" [12]. This conversion includes, for each segment, the there is no local proximity relationship between the elements
extracted audio features as a uniform 128x10 array and the of the vector. However, there is a temporal proximity for each
presence or absence of each possible label as a boolean vector. individual feature, which translates into a local proximity
The Google Audio Set database consist of 3 subsets (in between the rows of the matrix. A convolutional neural
any of the available formats): network with the following architecture was designed to take
• Balanced training, which has a balanced distribution advantage of this temporal proximity:
of the classes but contains only a small fraction of • The input is the previously mentioned 10x128
the samples (about 22K segments). matrix.
• Unbalanced training, which contains all of the • The first hidden layer is a convolutional layer with
samples (2.0M segments) but suffers from a greatly 16 filters and a kernel with dimension 3x1. Since
unbalanced distribution of the classes. there is not proximity relationship in the input
• Evaluation, which contains about 20K segments. vectors (columns of the input matrix) the second
dimension of the kernel is always restricted to 1 in
In all the experiments of this paper the neural networks were our tests.
trained with one of the training sets and then tested with the • The second hidden layer is a maximum pooling
evaluation set in order to obtain the final results. layer with a 2x2-dimensional window followed by a
dropout layer with a 0.3 probability.
3. Proposed neural networks • The last hidden layer is a fully connected layer with
600 units.
All the models used in this paper to perform the test are neural
networks based on one of three different architectures. Despite
the different architecture, all the models share the following 3.3. MLP
properties:
• They use Adam as their optimizer. Multi Layer Perceptrons (MLPs) are amongst the most
• Every unit that does not belong to the output layer standard and versatile neural networks. In fact, MLPs can
uses ReLU as its activation function. approximate any input-output multidimensional output,
• The output layer is fully connected and has 527 including those produced by other network architectures, so
units. they can be used as a reference model. The main advantage of
other architectures over MLPs is that MLPs include a huge
Since the audio event classification is a multi-labelled problem amount of weights, which can make training more difficult
(i.e. the same segment can, and typically, contain more than and more prone to overfitting. Given their property of
65
universal function approximation, they can be used to test the Table 1: Classification of the models ordered by their
impact of other factors apart from the type of neural network performance (in increasing mAP or mR order). 1h, 2h,
they are based on. For this last reason, several models based 3h indicates the number of hidden layers. bal. unbal.
on MLPs were developed and tested: Indicates the training set (balanced or unbalanced
• Two MLPs with one hidden layer. training set). bip. bin. Indicates the codification of the
• A MLP with two hidden layers. targets (bipolar or binary) and correspondingly the
• A MLP with three hidden layers. activation function of the output layer (tanh or
sigmoid).
All these models share these properties:
• The input is a 1280-dimensional vector (the Model, training set, codification mAP mR
flattened version of the input matrix used for MLP 1 h. l., bal., bip. 0.13704 0.15848
CNNs).
• The hidden layers have 1500 units each. After each MLP 1 h. l., unbal., bip. 0.19696 0.22697
one of them, there is a dropout layer with a 0.3
MLP 3 h. l., unbal., bin. 0.20686 0.23529
probability.
MLP 2 h. l., unbal., bin. 0.21203 0.24079
One of the MLPs with one hidden layer has the following
MLP 1 h. l., unbal., bin. 0.21342 0.24166
particular properties:
• The hyperbolic tangent (tanh) is used as the MLP 1 h. l., bal., bin. 0.21893 0.24249
activation function of the output layer.
• In this case, the Mean Squared Error (MSE) is used CNN, bal., bin. 0.22830 0.25595
as the networks' loss function. MLP 2 h. l., bal., bin. 0.24422 0.27542
MLP 3 h. l., bal., bin. 0.25276 0.28706
4. Test Description LSTM, bal., bin. 0.26652 0.30698
In order to compare the performance of the different models,
we evaluated them on the evaluation subset of Google Audio The first point to note is that the ranking of the different
Set. Every model was trained with the balanced training test of neural networks is the same whether the models are ordered
Audio Set. In the case of the MLP with the bipolar sigmoid by their mAP or their mR. Because of that, the metric
activation function, the target vectors were preprocessed so considered is irrelevant when interpreting the results.
that they have a bipolar codification (the absence of a class is The models using bipolar data obtain the worst results.
represented with the value -1 instead of 0), allowing us to test One possible reason could be that hidden layers use an
the effect of the codification of the data in performance. activation function unable to take negative values. However,
All the models were trained with a minibatch size of by comparing both models using bipolar data, we can notice
128. The training had a maximum duration of 50 epochs, that the one using the unbalanced training set has a much
however, early stopping was used in order to interrupt the better performance than the one using the balanced training
training process when the mAP no longer increases for three set. This is surprising because the models using binary data
epochs, thus preventing overfitting. No early stopping was show the exact opposite behavior. In these models the use of
used on the LSTM model as its mAP grew at a notably the unbalanced training set has a negative impact on their
irregular rate and early stopping kept interrupting the training performance, the effect becoming more intense the more
process before the model could reach its stability phase. This layers the network has.
phenomenon didn’t happen with the rest of the models. The MLPs using binary data and the balanced training
As the main focus of these tests is to compare the set have a better performance the more hidden layers they
different network architectures, hyper-parameters were left at have, which is the expected behavior when there is enough
their default values (learning rate: 0.001, beta1: 0.9, beta2: training data, as it seems to be the case.
0.999, decay: 0). The CNN's results were quite limited, falling behind the
After training, the networks were tested with the MLPs with more than one hidden layer. This is probably
evaluation subset of the Google Audio Set, and the final because of the lack of local meaning of the different features
results were obtained by calculating the mean Average included in the 128-dimensional feature vectors (embeddings),
Precision (mAP) and the mean Recall (mR). which limits the kernel to a single dimension.
In addition, another test was performed on all the models Finally, the model with the best results is the LSTM
based on MLPs where they were trained with the unbalanced network, with a mAP of 0.26652 and a mR of 0.30698. This
training set (much larger in terms of samples, but much more result is interesting because, even knowing that LSTM neural
unbalanced too) instead of the balanced one. networks are particularly effective with time series, in our case
these time series are very short, with 10 elements, which could
have limited the performance of this model.
5. Results
After performing the tests described above, results presented
in Table 1 were obtained:
66
6. Conclusions and future work 8. References
After testing the performance of several deep neural networks, [1] KRIZHEVSKY, Alex; SUTSKEVER, Ilya; HINTON,
we were able to obtain a mAP of 0.26652 with a simple LSTM Geoffrey E. Imagenet classification with deep
network. Despite these results being worse than the 0.314 convolutional neural networks. In Advances in neural
mAP of the baseline established by Google, they allow us to information processing systems. 2012. p. 1097-1105.
draw some conclusions about creating models for Google [2] TEMKO, Andrey, et al. CLEAR evaluation of acoustic
Audio Set. event detection and classification systems. In
First of all, we can conclude that LSTM networks are International Evaluation Workshop on Classification of
the most appropriate architecture for this problem from all of Events, Activities and Relationships. Springer, Berlin,
those which were tested, as a relatively simple network with Heidelberg, 2006. p. 311-322.
one LSTM layer and a fully connected layer offered better [3] SALAMON, Justin; JACOBY, Christopher; BELLO,
results than a more complex network with three fully Juan Pablo. A dataset and taxonomy for urban sound
connected layers, therefore recurrent neural networks should research. In Proceedings of the 22nd ACM international
be a good starting point if a better performance is looked for, conference on Multimedia. ACM, 2014. p. 1041-1044.
for example by adding more layers to the model or [4] IEEE AASP 2018 Challenge on Detection and
implementing more complex architectures. Classification of acoustic Scenes and Events,
The use of the balanced training subset seems to http://dcase.community/challenge2018/index, (accessed:
improve the performance of the models despite being less than 05/10/2018).
1/20 of the dataset. However, the unbalanced training subset [5] Freesound database, https://freesound.org, (accessed:
was only used on MLPs due to time restrictions. Its effects on 05/10/2018).
the other architectures should be studied in the future. [6] CAKIR, Emre, et al. Polyphonic sound event detection
Transforming the target vectors to a bipolar codification using multi label deep neural networks. In Neural
decreases the performance of the models; however there Networks (IJCNN), 2015 International Joint Conference
seems to be a positive correlation between this codification on. IEEE, 2015. p. 1-7.
and the use of the unbalanced training set, which could be [7] PARASCANDOLO, Giambattista; HUTTUNEN, Heikki;
worth researching into. VIRTANEN, Tuomas. Recurrent neural networks for
polyphonic sound event detection in real life recordings.
arXiv preprint arXiv:1604.00861, 2016.
7. Acknowledgements [8] Audio Set, a large scale dataset of manually annotated
audio events, https://research.google.com/audioset/,
This work has been partially supported by project “DSSL” (accessed: 20/05/2018).
(TEC2015-68172-C2-1-P), funded by the Ministry of [9] GEMMEKE, Jort F., et al. Audio set: An ontology and
Economy and Competitivity of Spain and FEDER. human-labeled dataset for audio events. In Acoustics,
Speech and Signal Processing (ICASSP), 2017 IEEE
International Conference on. IEEE, 2017. p. 776-780.
[10] KONG, Qiuqiang, et al. Audio Set classification with
attention model: A probabilistic perspective. arXiv
preprint arXiv:1711.00927, 2017.
[11] YU, Changsong, et al. Multi-level Attention Model for
Weakly Supervised Audio Classification. arXiv preprint
arXiv:1803.02353, 2018.
[12] Audio Set Users Group,
https://groups.google.com/forum/#!forum/audioset-users,
(accessed: 20/05/2018).
67
IberSPEECH 2018
Emotion Detection from Speech and Text

Mikel deVelasco, Raquel Justo, Josu Antón, Mikel Carrilero, M. Inés Torres
Universidad del Pais Vasco UPV/EHU

mikel.develasco@ehu.eus, raquel.justo@ehu.eus, josuantonsanz@gmail.com,
mcarrilero001@ikasle.ehu.eus, manes.torres@ehu.eus
Abstract [8]. In early systems dealing with emotion detection in text,

knowledge-based approaches were applied making use of emo-
The main goal of this work is to carry out automatic emotion lexicons, such as Sentiwordnet [9]. Other methods, employ
tion detection from speech by using both acoustic and textual machine learning based approaches [10], where statistical clas-
information. For doing that a set of audios were extracted from sifiers are trained using large annotated corpora and the emotion
a TV show were different guests discuss about topics of current detection can be seen as a multi-label classification problem. In
interest. The selected audios were transcribed and annotated this work we propose to use neural networks to solve the regres-
in terms of emotional status using a crowdsourcing platform. sion problem given the 3-dimensional emotional model.
A 3-dimensional model was used to define an specific emo- The main contribution of this work is the appropriate selec-
tional status in order to pick up the nuances in what the speaker tion of a neural network architecture for emotion detection con-
is expressing instead of being restricted to a predefined set of sidering both acoustic signals and their corresponding transcrip-
discrete categories. Different sets of acoustic parameters were tion. Additionally, the proposed architecture has been adapted
considered to obtain the input vectors for a neural network. To to the 3-dimensional VAD (Valence, Arousal and Dominance)
represent each sequence of words, a models based on word em- emotional model [11].
beddings was used. Different deep learning architectures were Section 2 describes the two proposed approaches for the au-
tested providing promising results, although having a corpus of tomatic emotion recognition from used features to the common
a limited size. network architectures basics. Experiments carried out are fully
Index Terms: Emotion Detection, Speech, Text transcriptions described in Section 3. Section 3.1 aims to explain the diffi-
culties to find out a Spanish corpus and it has led us to create
1. Introduction our own corpus. Section 3.2 mentions the used baselines meth-
ods and which measure has been used for testing. Section 3.3
The emotion recognition is the process of identifying human shows the experiments carried out under the regression prob-
emotions, a task that is automatically carried out by humans lem of emotional status with acoustic features whereas Section
considering facial and verbal expressions, body language, etc. 3.4 deals with the experiments achieved at the regression prob-
However, this is a challenging task for an automatic system. In lem of emotional status with language features. Finally some
recent years, the great amount of multimedia information avail- concluding remarks are reported in Section 4.
able due to the extensive use of the Internet and social media,
along with new computational methodologies related to ma-
chine learning, have led to the scientific community to put a
2. Emotion Detection from Speech and
great effort in this area [1, 2]. Language
Emotion recognition from speech signals relies on a num- Emotion detection from speech is based on the extraction of rel-
ber of short-term features such as pitch, vocal tract features such evant features from the acoustic signal, that can be seen as hints
as formants, prosodic features such as pitch loudness, speak- of the emotional status. That is, a numerical vector that rep-
ing rate, etc. Surveys on databases, classifiers, features and resent the specific information related to emotional status and
classes to be defined in the analysis of emotional speech can embedded in an acoustic signal is needed. There are numerous
also be found in [3]. Regarding methodology, statistical anal- acoustic parameters that can be obtained using the free soft-
ysis of feature distributions has been traditionally carried out. ware Praat1 or the free python library pyAudioAnalysis2 . Con-
Classical classifiers such as the Bayesian or Super Vector Ma- sidering these tools the following set of 72 parameters could
chines (SVM) have been proposed for emotion features from be considered: Pitch, Zero Crossing Rate (ZCR), Energy, En-
speech. The model of continuous affective dimensions is also an tropy of the energy, Spectral Centroid, Spectral Spread, Spec-
emerging challenge when dealing with continuous rating emo- tral Entropy, Spectral flux, Spectral Rolloff, Chroma vector (12
tion labelled during real interaction [1, 4]. In this work, recur- coefficients), Chroma deviation, MFCC coefficients (12), LPC
rent neural networks have been proposed to integrate contextual coefficients (16), Bark features (21). However, some of these
information and then predict emotion in continuous time using parameters provide the same or very similar kind of informa-
a three-dimensional emotional model. tion. For instance, LPC, Bark and MFCC coefficients provide
Speech transcripts have also been demonstrated to be a similar information about the phonemes without considering the
powerful tool to identify emotional states [5]. Over the last vocal tract. Thus, such a big set of parameters is useless since
decade, there has been considerable work in sentiment analy- it will complicate the learning procedure requiring more train-
sis [6]. Moreover, the detection of emotions such as anger, joy, ing data. In this work different subsets of the aforementioned
sadness, fear, surprise, and disgust have also been addressed parameters were explored:
[7]. However, spoken language is informal and provides infor-
mation in an unstructured way so that developing tools to select 1 http://www.fon.hum.uva.nl/praat/
and analyse sentiments, opinions, etc. is still a challenging topic 2 https://pypi.org/project/pyAudioAnalysis/
68 10.21437/IberSPEECH.2018-15
• Set A: Pitch and Energy. is represented as three-dimensional real-valued vector. The di-
mensions of this vector correspond to Valence (corresponding
• Set B: Pitch, Energy and Spectral Centroid.
to the concept of polarity), Arousal (degree of calmness or ex-
• Set C: Pitch, Energy, Spectral Centroid, ZCR and Spec- citement), and Dominance (perceived degree of control over a
tral Spread. situation): the VAD model.
• Set D: Pitch, Energy, Spectral Centroid, ZCR, Spectral In order to solve the regression problem of emotional status
Spread and 12 MFCC coefficients. detection, we propose to use deep learning. When consider-
ing emotion detection from speech Long-Short Term Memory
• Set E: Pitch, Energy, Spectral Centroid, ZCR, Spectral (LSTM) neural networks were tested. The underlying idea is
Spread and 16 LPC coefficients. to be able to learn the relationship among present and past in-
• Set F: Pitch, Energy, Spectral Centroid, ZCR, Spectral formation although existing a big distance among them. That
Spread and 21 Bark features. is, they have memory and they can manage with temporal se-
quences of data like the sequence of vectors extracted from
The first set was selected according to the studies performed an acoustic signal. When regarding emotional status detection
in [12] where the arousal state of the speaker affects the overall from text a classical feedforward network was considered be-
energy and pitch. In addition to time-dependent acoustic fea- cause of simplicity. Such networks have proven to be efficient
tures such as pitch and energy, spectral features were selected for problems in similar tasks, like sentiment analysis [21].
for Sets B and C as a short-time representation for speech signal
[13]. For Sets D, E and F different Cepstral-based features were 3. Experiments
added, proven that they are good for detecting stress in speech
signal [14]. We have carried out two series of experiments for the evalua-
When regarding emotion detection from language the same tion of the regression processes. In the first one we present the
procedure has to be carried out. First of all a vectorial represen- most interesting results related to the emotion detection from
tation of the transcribed text is needed. In this case, we hope speech, and in the second we show the most interesting results
to capture some meaning of the utterance that might help in the on emotion detection from text.
detection of specific emotional status. An appropriate repre-
sentation should consider some semantic information like the 3.1. Corpus
word embeddings word2vec[15], doc2vec [16] or GLOVE [17]. As far as we know there is no Spanish three-dimensional corpus
Word2vec embeddings, the most simple model, are shallow, within the literature, so for the experiments, we have created a
two-layer neural networks that are trained to reconstruct lin- small corpus using the VAD model. The corpus consists of 120
guistic contexts of words. Word2vec takes as its input a large fragments between 3 and 5 seconds taken from the Spanish TV
corpus of text and produces a vector space, typically of several program “La Sexta Noche”. This TV program consist of politi-
hundred dimensions, with each unique word in the corpus be- cal debate, news and events, and discussions commonly appear.
ing assigned a corresponding vector in the space. Word vectors Each fragment has been transcribed manually and tagged using
are positioned in the vector space such that words that share crowdsourcing (the practice of obtaining needed services, ideas,
common contexts in the corpus are located in close proximity or content by soliciting contributions from a large group of peo-
to one another in the space. However, this technique represents ple) techniques. In this case each fragment has been labeled by
each word of the vocabulary by a distinct vector, without pa- 5 different annotators, following the next questionnaire:
rameter sharing. In particular, they ignore the internal structure
of words, which is an important limitation for morphologically 1. In order to address the Valence: “How do you perceive
rich languages. For example, in Spanish, most verbs have more the speaker?”
than forty different inflected forms and this leads to a vocab-
ulary where many word forms occur rarely (or not at all) in • Excited
the training corpus, making it difficult to learn good word rep- • Slightly Excited
resentations. Thus, [15] proposes to learn representations for • Neutral
character n-grams, and to represent words as the sum of the n-
2. In order to address the Arousal: “His mood is . . . ”
gram vectors. The model (known as FastText) can be seen as an
extension of the continuous skip-gram model [18] which takes • Positive (nice / constructive)
into account subword information. • Slightly Positive
Additionally, a way of representing the emotional status is
• Slightly Negative
needed in order to establish a machine learning problem. A
• Negative (unpleasant / non colaborative)
categorical emotion description (e.g. six basic emotions) is
an easy way to procedure but it provides a quite constrained 3. In order to address the Dominance: “How do you per-
model. Affective computing researchers have started exploring ceive the speaker about the situation in which he or she
the dimensional representation of emotion [19] as an alterna- is in?”
tive. Dimensional emotion recognition, aims to improve the
understanding of human affect by modelling affect as a small • More dominant / controlling the situation / . . .
number of continuously valued, continuous time signals. It has • He or she does not dominate the situation neither
the benefit of being able to: (i) encode small changes in affect is he or she cowed.
over time, and (ii) distinguish between many more subtly differ- • More coward / defensive / . . .
ent displays of affect, while remaining within the reach of cur-
rent signal processing and machine learning capabilities [20]. Once the tags were generated by crowdsourcing, the an-
In our work, we represent the problem of dimensional emotion swers collected from all the annotators were transferred to the
recognition as a regression one, where each emotional status three-dimensional model, making the average of each answer
69
and avoids the vanishing gradient problem [25].
The layers of the proposed network consist of a small num-
ber of units or cells not to build a large network architecture,
since we have a limited sized training corpus.
Table 1: Best result obtained with baseline models and Recur-

rent Neural Network (RNN) in the regression problem of emo-
tional status with acoustic features. Each pair of set and model
has been tested with acoustic features itself, with first deriva-
tives and with first and second derivatives, but only best perfor-
Figure 1: Schematic diagram of speech production.
mance is shown. MSE error has been used in order to compare.
for all fragments where the first answer of each question was SVR
LR RNN
assigned the value 0, the last answer was assigned the value 1, linear poly rbf
and the rest of the answers a midpoint. Then the corpus was Set A 0.1682 0.1660 0.1661 0.1691 0.1670
split into two sets, 70% of the fragments were used for training Set B 0.1686 0.1653 0.1685 0.1710 0.1565
purposes and the remaining 30% for test. Set C 0.1690 0.1679 0.1718 0.1709 0.1576
Set D 0.1742 0.1894 0.1703 0.1710 0.1665
3.2. Baselines models and Evaluation Metrics Set E 0.1699 0.2007 0.1842 0.1711 0.1664
Set F 0.1733 0.2256 0.1898 0.1710 0.1413
Both emotion detections problems, from speech and from text,
has been tested first with Linear Regression (LR) [22] and Su-
per Vector Regression (SVR) [23] (with three different types of As shown in Table 1, the proposed network slightly im-
kernels linear, poly, and rbf ), in order to compare with Neural proves the results of the baseline models in almost all the sets in
Networks. the corpus. It can also be concluded that by selecting a smaller
Regarding the input of these baselines models, two differ- set of parameters, better results are obtained with the baseline
ent approaches have been analysed to fix the problem of time model. However, the set A seems to have insufficient informa-
sequence. On the one hand, we fit the models with full information and Bark features are of great help in the case of networks
tion, considering each feature in on each time-step independent providing the best results.
(full models), and on the other hand, calculating the mean of
each feature over time-steps (mean models). 3.4. Experiments with language features
In relation to the evaluation metric, the Mean Square Error Regarding the word representation, FastText embeddings from
(MSE) has been used, because it seems to provide a good inter- SBWC3 has been used in the experiments. The mentioned em-
pretation of how far the prediction and the true label are. In this beddings are a Skipgram model of 300 dimensions and 855380
problem, MSE can be described as the mean of the distances different word vectors, trained with Spanish Billion Word Cor-
between the points of the true label on the three-dimensional pus4 with more than 1.4 billion words.
model and the predicted points on the same three-dimensional The regression problem of emotional status with language
model. features has been addressed with a small Deep Neural Network
(DNN). This network consist of three similar layers; the first
3.3. Experiments with acoustic features two layers are composed of 5 units, a sigmoidal activation func-
In order to obtain the acoustic parameters, each audio has been tion and followed by Dropout layer with 0.5 of keep-probability
divided into individual frames using a context window of 25 in order to prevent to the overfitting problem [26]; while the last
milliseconds and a step of 10 milliseconds (as shown in Figure layer, the output layer, is a Dense layer of 3 units and the sig-
1), obtaining 300 frames per audio. A vector made up of the moidal activation function (same as the network proposed for
selected acoustic features was associated to each frame. Differ- the regression problem of emotional status with acoustic fea-
ent experiments were carried out using the different feature sets tures).
(A, B, C, D, E, F) described in Section 2. Additionally, for each
Table 2: Best result obtained with baseline models and Deep
set, different experiments were also performed including both
Neural Network (DNN) in the regression problem of emotional
the first and the first and the second derivatives.
status with language features. MSE error has been used.
The network proposed to address the regression problem
of emotional status with acoustic features is a Recurrent Neural
Network (RNN). The network is composed with an LSTM layer SVR
LR DNN
(the architecture proposed by [24]) of 10 cell memory blocks, linear poly rbf
to get a representation of the audio along the time. Subsequent Mean 0.1906 0.1350 0.1165 0.1197 0.1203
layers are two Dense layers which aim to infer the VAD model Full 0.1356 0.1292 0.1199 0.1229 0.1196
from the representation given by the LSTM. The fist Dense
layer consist of 15 units and ReLU activation function while
the second and last consist of 3 units and sigmoid activation As shown in Table 2, the proposed network achieves similar
function. results when comparing it to the baselines models. It is an inter-
The output layer contains a sigmoidal activation function to esting result given the small size of the training set and the great
take advantage of the output limitation benefits, it is bounded 3 https://github.com/uchile-nlp/spanish-word-
between 0 and 1. On the other hand, the hidden layer contains embeddings/blob/master/README.md
the ReLU activation function because it provides good results 4 http://crscardellino.me/SBWCE/
70
impact it has when building neural networks. The obtained re- [8] R. Justo, T. Corcoran, S. M. Lukin, M. Walker, and M. I. Torres,
sults suggest that increasing the annotated training corpus neu- “Extracting relevant knowledge for the detection of sarcasm and
ral networks might improve the baseline models. nastiness in the social web,” Knowledge-Based Systems, vol. 69,
pp. 124–133, 2014.
[9] S. Baccianella, A. Esuli, and F. Sebastiani, “Sentiwordnet 3.0: an
4. Conclusions enhanced lexical resource for sentiment analysis and opinion min-
The main goal of this work was to develop an automatic emo- ing.” in Lrec, vol. 10, no. 2010, 2010, pp. 2200–2204.
tion detection system from speech and language. The sys- [10] S. Volkova and Y. Bachrach, “Inferring perceived demographics
tem acted over acoustic fragments extracted from a TV show from user emotional tone and user-environment emotional con-
and their corresponding transcriptions. Each fragment was trast,” in Proceedings of the 54th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long Papers),
annotated by means of a crowdsourcing platform using a 3- vol. 1, 2016, pp. 1567–1578.
dimensional VAD model. Different neural networks architec-
[11] R. A. Calvo and S. Mac Kim, “Emotions in text: dimensional and
tures were tested and the obtained results show that RNN can categorical models,” Computational Intelligence, vol. 29, no. 3,
outperform baseline systems when considering emotion detec- pp. 527–543, 2013.
tion from speech. Moreover, using a simple feedforward neural [12] C. E. Williams and K. N. Stevens, “Vocal correlates of emotional
network with a very small training corpus (84 sentences) similar states,” Speech evaluation in psychiatry, pp. 221–240, 1981.
results to those obtained with baseline models can be achieved.
[13] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion
For further work we propose to get a bigger annotated cor- recognition using hidden markov models,” Speech communica-
pus by using crowdsourcing tools to better train the proposed tion, vol. 41, no. 4, pp. 603–623, 2003.
neural networks. Additionally, the two knowledge sources [14] S. E. Bou-Ghazale and J. H. Hansen, “A comparative study of
(acoustic and text) might be merged to provide a more accurate traditional and newly proposed features for recognition of speech
emotion detection system. under stress,” IEEE Transactions on speech and audio processing,
vol. 8, no. 4, pp. 429–442, 2000.
5. Acknowledgements [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their com-
This work has been partially founded by by positionality,” in Advances in Neural Information Processing Sys-
the Spanish Government (TIN2014-54288- tems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani,
C4-4-R and TIN2017-85854-C4-3-R), and by and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp.
the European Commission H2020 SC1-PM15 3111–3119.
program under RIA 7 grant 69872. [16] Q. V. Le and T. Mikolov, “Distributed representations of sentences
and documents,” in ICML, ser. JMLR Workshop and Conference
Proceedings, vol. 32. JMLR.org, 2014, pp. 1188–1196.
6. References [17] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global
[1] J. Irastorz and M. I. Torres, “Analyzing the expression of annoy- vectors for word representation,” in EMNLP. ACL, 2014, pp.
ance during phone calls to complaint services,” in 7th IEEE Inter- 1532–1543.
national Conference on Cognitive Infocommunications (CogInfo- [18] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich-
Com), 2016, p. 103106. ing word vectors with subword information,” Transactions of the
[2] S. E. Eskimez, K. Imade, N. Yang, M. Sturge-Apple, Z. Duan, and Association for Computational Linguistics, vol. 5, pp. 135–146,
W. B. Heinzelman, “Emotion classification: How does an auto- 2017.
mated system compare to naive human coders?” in 2016 IEEE In- [19] H. Gunes and M. Pantic, “Automatic, dimensional and continuous
ternational Conference on Acoustics, Speech and Signal Process- emotion recognition,” Int. J. Synth. Emot., vol. 1, no. 1, pp. 68–99,
ing, ICASSP 2016, Shanghai, China, March 20-25, 2016, 2016, Jan. 2010.
pp. 2274–2278.
[20] M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajew-
[3] D. Ververidis and C. Kotropoulos, “Emotional speech recogni- ski, R. Cowie, and M. Pantic, “Avec 2014: 3d dimensional affect
tion: resources, features, and methods,” Speech Communication, and depression recognition challenge,” in Proceedings of the 4th
pp. 1162–1181, 2006. International Workshop on Audio/Visual Emotion Challenge, ser.
[4] A. Mencattini, E. Martinelli, F. Ringeval, B. W. Schuller, and AVEC ’14. New York, NY, USA: ACM, 2014, pp. 3–10.
C. D. Natale, “Continuous estimation of emotions in speech [21] R. Moraes, J. F. Valiati, and W. P. G. Neto, “Document-level sen-
by dynamic cooperative speaker models,” IEEE Trans. Affective timent classification: An empirical comparison between svm and
Computing, vol. 8, no. 3, pp. 314–327, 2017. [Online]. Available: ann.” Expert Syst. Appl., vol. 40, no. 2, pp. 621–633, 2013.
https://doi.org/10.1109/TAFFC.2016.2531664 [22] G. A. Seber and A. J. Lee, Linear regression analysis. John
[5] C. Clavel, G. Adda, F. Cailliau, M. Garnier-Rizet, A. Cavet, Wiley & Sons, 2012, vol. 329.
G. Chapuis, S. Courcinous, C. Danesi, A.-L. Daquo, M. Deldossi [23] C.-C. Chang, “Libsvm: a library for support vector machines,”
et al., “Spontaneous speech and opinion detection: mining call- ACM Transactions on Intelligent Systems and Technology, 2: 27:
centre transcripts,” Language resources and evaluation, vol. 47, 1–27: 27, 2011 “http://www. csie. ntu. edu. tw/˜ cjlin/libsvm”,
no. 4, pp. 1089–1125, 2013. vol. 2.
[6] S. Mohammad, C. Dunne, and B. Dorr, “Generating high- [24] A. Graves and J. Schmidhuber, “Framewise phoneme classifica-
coverage semantic orientation lexicons from overtly marked tion with bidirectional lstm and other neural network architec-
words and a thesaurus,” in Proceedings of the 2009 Conference tures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
on Empirical Methods in Natural Language Processing: Volume
2-Volume 2. Association for Computational Linguistics, 2009, [25] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
pp. 599–608. networks,” in Proceedings of the fourteenth international confer-
ence on artificial intelligence and statistics, 2011, pp. 315–323.
[7] J. R. Bellegarda, “Emotion analysis using latent affective folding
[26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
and embedding,” in Proceedings of the NAACL HLT 2010 work-
R. Salakhutdinov, “Dropout: a simple way to prevent neural net-
shop on computational approaches to analysis and generation
works from overfitting,” The Journal of Machine Learning Re-
of emotion in text. Association for Computational Linguistics,
search, vol. 15, no. 1, pp. 1929–1958, 2014.
2010, pp. 1–9.
71
IberSPEECH 2018
Experimental Framework Design for Sign Language Automatic Recognition

Darío Tilves Santiago,1 Ian Benderitter,2 Carmen García Mateo1
1
AtlanTTic Research Center – University of Vigo
2
Polytech Nantes
dtilves@gts.uvigo.es, ian.benderitter@etu.univ-nantes.fr, carmen.garcia@uvigo.es
of images by enlarging them, trimming them, changing the

Abstract angle, altering the background and changing the illumination.
Automatic sign language recognition (ASLR) is quite a Other possibilities are to artificially create hand gestures [3] or
complex task, not only for the intrinsic difficulty of automatic to record hand gestures with a device like Leap Motion which
video information retrieval, but also because almost every sign instead of acquiring images, captures the 3 dimensional
language (SL) can be considered as an under-resourced movement of the hands and each finger [4], [5].
language when it comes to language technology. Spanish sign The study of SL is also constrained by its complexity. While
language (SSL) is one of those under-resourced languages. there has been some improvement with isolated gesture
Developing technology for SSL implies a number of technical recognition despite the dataset situation, there has been no
challenges that must be tackled down in a structured and advance whatsoever in continuous gesture recognition. The
sequential manner. reason is twofold: datasets are available for isolated but not for
In this paper, the problem of how to design an experimental continuous gestures, and the technology is not advanced enough
framework for machine-learning-based ASLR is addressed. In to identify the latter. Nevertheless, there have been some
our review of existing datasets, our main conclusion is that there discussions about how to proceed in continuous gesture
is a need for high-quality data. We therefore propose some recognition as described in [6], [7], [8] and [9].
guidelines on how to conduct the acquisition and annotation of In this paper, several datasets are described in terms of their
an SSL dataset. These guidelines were developed after characteristics in Section 2. Section 3 reports details of an initial
conducting some preliminary ASLR experiments with small study of the American SL (ASL) alphabet using transfer
and limited subsets of existing datasets. learning and provides a set of recommendations on how to
Index Terms: hearing impaired, dataset recommendations, create a robust dataset. Finally, some conclusions are drawn in
automatic recognition, convolutional neural networks, ASL Section 4.
1. Introduction 2. Overview of existing datasets

Sign language (SL) is one of the preferred ways of Table 1 lists hand sign datasets (#1-#19) found in the literature,
communication for hearing-impaired people and their families; along with relevant information for its use in ASLR. The
indeed, often it is the only communication mechanism. purposes of these datasets are different: some were designed for
automatic recognition problems and others were created for
Automatic language recognition is highly developed for pedagogical reasons. Shown are the dataset name and lexicon,
written and oral languages, but not for SLs [1]. In recent years, along with the SL if any. Also provided is information on size
with the emerging interest in both neural networks (NNs) and and on variety in terms of number of signers, of repetitions per
graphical processing units (GPUs) and ongoing improvements signer and type of data, with indications as to whether the
in their efficiency, there has been significant advances in oral dataset is labelled or not. The following will explain how Table
and written speech recognition. However, this is not the case 1 is organized.
for automatic SL recognition (ASLR). This is due to differences
in recognition problems, and also to the lack of available images Items #1-4 are datasets which were used in our research as
and videos tailored to the development of ASLR systems using training and testing material for preliminary experiments. The
machine-learning techniques. second column indicates the language of each dataset: items #1-
11 are ASL, #12 and #13 are German SL (GSL), #14 and #15
Therefore, an important drawback to be solved is the lack, are Spanish SL (SSL), #16 is Argentinian SL (ArSL), and #17-
both in quantity and quality terms, of SL data for processing by 19 are hand gestures which do not belong to any SL. The
deep learning algorithms. It is possible to find SL datasets, but datasets are sorted by their lexicon: from alphabets and numbers
their purpose is to be used as learning material for SL courses, to words and sentences. All the datasets are publicly available,
rather than for automatic image recognition. The main except for Grades Online (#14) which requires contact with the
drawback is that there are not enough repetitions of each gesture creators [10]. All the datasets are monolingual, except for
performed by one signer, and more signers are needed in order SpreadTheSign (#15), which collects suggestions for signs
to ensure variability. from different SLs around the world. Some of the datasets, due
Many researchers have to find creative solutions to this to their size and/or insufficient labelling, are not completely
problem. Some apply data augmentation to increase the volume characterized in Table 1 (indicated NA for ‘not available’).
of their data. Data augmentation [2] transforms the data in case
72 10.21437/IberSPEECH.2018-16
Table 1: Summary of datasets found in the literature.
# Name Language and Lexicon Size Data Type # Repetitions # Signers Labelled
1 Superpixel [11] ASL: 24 alphabet signs 131688 images RGB 500 5 Yes
2 Fingerspelling [12] ASL: 24 alphabet signs and 31000 images Depth 200 5 Yes
digits 1-9
3 Massey University [13] ASL: 24 alphabet signs and 2524 images RGB 5 5 Yes
numbers 0-9
4 Padova Senz3D [3] ASL: letters B, D, I, S, and digits 2640 images RGB + Depth 30 + 30 4 Yes
2, 3, 4, 5, 9, 10
5 Padova Kinect [14] ASL: letters A, D, I, L, W, Y, and 2800 images RGB + Depth 10 + 10 14 Yes
digits 2, 5, 7
6 HKU Kinect Gesture [15] ASL: letters A, L, Y and digits 1-5 3000 images RGB 60 5 Yes
7 NTU Microsoft Kinect [16] ASL: letters A, L, Y and digits 1-5 2000 images RGB + Depth 10 + 10 10 Yes
8 Cvpr15 [17] ASL: 7 words and digits 1-9 68000 images Depth 500 8 Yes
9 RWTH-50 [18] ASL: 83 words 8844 videos Grey scale 2 3 Yes
10 ASLLVD [19] ASL: words 992 videos RGB 2 5 Yes
11 RWTH-104 [20] ASL: 201 sentences 201 videos Grey scale 1 3 Yes
12 German Spelling [21] GSL:30 alphabet signs and digits 3080 videos Grey scale 2 20 Yes
1-5
13 RWTH PHOENIX [22] GSL: sentences 592383 images RGB 1 NA No
14 Grades Online [10] SSL: 750 words 750 videos RGB 1 2 Yes
15 SpreadTheSign [23] SSL: words and sentences +2000 videos RGB 1 NA Yes
16 LSA64 [24] ArSL: 64 words 3200 videos RGB 5 10 Yes
17 SKIG [25] Hand gestures: 10 2160 videos RGB + Depth 18 + 18 6 Yes
18 Microsoft Gesture-RC [26] Hand gestures: 12 594 videos RGB + Depth NA NA No
19 ChaLearn 2016 [27] Hand gestures 140945 videos RGB + Depth 1+1 NA No
feature changing until the CNN is capable of recognizing all the

3. ASLR: preliminary experiments different objects provided as inputs.
Learning a language is a step by step process: we first learn the We used transfer learning to demonstrate the efficiency of
alphabet and numbers, then basic words (objects, adjectives, using trained CNNs to recognize hand signs. Transfer learning
etc.), then how to relate words (subjects, verbs, adverbs, etc.) takes advantage of NNs trained for specific applications,
and how to contextualize them, and finally correct grammar and retraining them for another application, meaning that the
syntax, eventually managing to make coherent conversation. features extracted for the first problem are used for the new one.
That same learning process applies to automatic speech This is very useful to reduce the time consumed and the images
recognition systems, in stepping up the interaction versatility needed in training because there is no need to learn all the
with machines. And that same process ought to be applied to features from scratch.
SL recognition systems. It was selected the VGG16 CNN pre-trained on the
We trained different NNs using some of the listed datasets ImageNet Database [28]. Its last layer was modified in order to
in order to compare their performance. The lexicon used to train serve as a classifier of 33-sign ASLR experiment. The input
the NNs was composed of 33 ASL isolated signs: the alphabet features consist of 224x224-pixel image of one hand. So
24 gestures and digits from 1-9. Isolated gestures were selected depending on the dataset, hand segmentation and resizing have
because they were easier to identify, as commented above. We to be performed.
used only one hand image as input feature to identify the
isolated sign. 3.2. Data selection
This experiment helped devise a proposed set of guidelines In order to experiment with recording and segmentation
on creating a robust dataset. Also demonstrated was the techniques, an in-house dataset was acquired. This dataset is
usefulness of combining both depth and RGB images in denoted UVIGO in further explanations and results. It consists
datasets in our training with both types of pictures. of recordings of the 33 ASL signs of the lexicon described in
Section 3. Kinect2 sensor was used acquiring both RGB and
3.1. NN architecture depth streams of the upper body. It was recorded by two signers
over four days to ensure variable in recording environments
An NN was first selected for the experiment. We used the
(clothing, lighting conditions, etc.). 100 repetitions of each sign
Convolutional Neural Network (CNN), a deep learning
were used for training and 10 repetitions for testing.
approach to object identification. A CNN is based on relating a
set of features to an object definition. It is designed as a set of Of the available datasets described in Section 2, two were
layers containing neurons that extract features, from specific to selected for NN training and testing: Superpixel (#1), composed
general as the data goes through the layers. In training the CNN, of alphabetical RGB images of one hand, and Fingerspelling
this process is repeated, with the importance given to each (#2), containing both alphabetical and numerical depth pictures
73
of one hand. 100 randomly-selected repetitions of each sign the highest accuracy (99.61%) but the lowest accuracy values
were used for training and 10 for testing. over the other datatests (fourth row in Table 2).
Two further datasets were selected only as test datasets: the UVIGO CNN (which was trained using both RGB and
RGB images of Massey University dataset (#3), and the depth depth images) tested over the other datasets gave better results,
images of Padova Senz3D dataset (#4). which it seems that generalizes better. Therefore, it resulted that
training with RGB pictures or depth pictures separately did not
3.3. Data preprocessing perform well in recognizing depth and RGB pictures,
respectively.
Image preprocessing in machine learning typically consists
of cropping, resizing and feature extraction operations. Next we As seen in Table 1, #1 and #2 both have five signers unlike
describe the preprocessing tasks needed for each dataset. UVIGO which has two signers. More signers mean more
variation, and therefore more generalization. However, these
As said before, the used CNN required 224x224-pixel
results seem to indicate that using RGB and depth images is as
images as input features, so resizing had to be performed for
important as the number of signers.
Superpixel, Fingerspelling and Massey University datasets.
Cropping (hand segmentation) is not needed for these datasets.
Table 2: Test comparison for the trained CNN in terms of
Regarding UVIGO dataset, hand segmentation is made accuracy. The labels of the first column refer to the training
using the information provided by the Kinect2 sensors. Kinect2 material used to train the CNN
gives access to RGB and depth image streams, along with 25
Training UVIGO Superpixel Fingerspelling
body joint coordinates. In the RGB stream our segmentation
data/Testing data
algorithm uses 4 body joints to easily locate a hand. Then it
crops a square around the detected hand. For the depth stream UVIGO 89.76% 27.97% 20.96%
we use again 4 joints to locate the hand. This hand is segmented Superpixel 17.39% 92.48% 7.81%
in the xy plane as the RGB images. The algorithm applies
Fingerspelling 10.00% 6.38% 99.61%
another cropping in the z axis using the depth values of the hand
to set a threshold (Figure 1). This eliminates the background
from the depth images. In a second experiment, the three training datasets were
Finally, we also performed a segmentation in Padova jointly used to train three CNNs.
dataset. It was applied the depth stream cropping described Table 3 summarizes results for the training of these three
before, but in this case instead of using the joints provided by CNNs: “All RGB” label means a CNN trained with only RGB
Kinect2, we directly used the depth values to distinguish the images from UVIGO and Superpixel datasets, “All depth” label
hand from its surroundings. means a CNN trained with only depth images from UVIGO and
Fingerspelling datasets, and “All” label means a CNN trained
with both types of images. They are also compare to Massey
(#3) and Padova (#4) datasets.
It can be seen that the CNNs based on both RGB and depth
images outperformed the CNNs trained with only one kind of
image. All (Table 3) exceeded the 50% precision mark on
external datasets. All depth also did so for depth, but only for
the Padova dataset because it consists of depth images.
Table 3: Comparison between different type of images. The

labels of the first column refer to the training material used to
train the CNN
Training All RGB All depth Massey Padova
data/Testing data
Figure 1: Segmentation representation.
All RGB 91.62% 6.56% 25.53% 7.31%
3.4. Experiments and results All depth 8.53% 95.33% 16.73% 59.70%
The results presented in this section are accuracy scores All 95.21% 96.77% 52.51% 69.63%
calculated from CNN predictions. All the CNNs were trained in
the same way, fixing the number of epochs to five, and with a It can be concluded that mixing depth and RGB images is
low initial learning rate, which is divided by four every epoch. an effective way to create a robust and flexible dataset to train
With such a reduced number of epochs, and taking advantage systems for SL recognition. However, if only one type of image
of the higher learning slope that characterize transfer learning, must be chosen, it is preferable to work with depth datasets
the goal was to reduce overfitting. because overall have better performance than RGB datasets
In a first experiment, each training dataset was separately This may be due to the absence of background noise which was
used to train a CNN. Table 2 shows the results in terms of filtered while preprocessing.
accuracy over the testing data.
3.4.1. Model application
The highest scores in Table 2 are from testing each CNN
with its testing material (main diagonal). However, despite The best trained network was incorporated in a hand sign
UVIGO CNN having the lower score of these three, it recognition application, resulting in segmentation speeds of
performed better when testing it with the #1 and #2 datatests 0.001s for RGB images and 0.007s for depth images (both
(second row in Table 2). Counterwise, Fingerspelling CNN had
74
hands at the same time), and a prediction time of 0.054s. The Future research will focus on improving system training to
whole process was therefore performed in 0.116s. match or surpass accuracy rates for both depth and RGB images
separately with all datasets included and on extending
3.5. Dataset acquisition guidelines recognition to the Spanish SL. An expansion of the acquired
dataset is also planned, as well as the creation of two computer
As mentioned, training a sign recognition NN requires
applications, one for recording and another for real-time gesture
significant amounts of data, with sufficient diversity in sign
recognition.
execution and in recording environments to ensure robustness.
Considering the results, dataset acquisition guidelines are
described next to meet these requirements. 5. Acknowledgements
It is therefore recommended to record data from at least 10 This work has received financial support from the Xunta de
different signers and to ensure a balance between male and Galicia (Agrupación Estratéxica Consolidada de Galicia
female and native and non-native signers. It is also accreditation 2016-2019, Galician Research Network
recommended to spread recording sessions over several days to TecAnDaLi ED431D 2016/011 and Grupos de Referencia
ensure a variety of conditions (lighting, clothes, etc.) and Competitiva GRC2014/024) and the European Union
settings. (FEDER/ERDF).
It is highly recommended to record RGB and depth at the
same time, preferably with a resolution of at least 640x480 6. References
pixels. Depth images are useful because a more precise
[1] J. Isaacs and S. Foo, "Hand pose estimation for American sign
segmentation is possible. Another advantage is that different language recognition," IEEE Thirty-Sixth Southeastern
lightning conditions, clothes and settings do not have any Symposium on System Theory, 2004, pp. 132-136.
influence because they are not perceived by the sensors. [2] D. Guo, W. Zhou, M. Wang and H. Li, "Sign language recognition
Nonetheless, recording sex, height and anatomical differences based on adaptive HMMS with data augmentation," IEEE
is important. International Conference on Image Processing (ICIP), Phoenix,
AZ, 2016, pp. 2876-2880.
To train the system for each sign the same way, it is highly [3] A. Memo, L. Minto and P. Zanuttigh, “Exploiting Silhouette
recommended to formalize the number of images recorded per Descriptors and Synthetic Data for Hand Gesture Recognition”,
sign, signer and recording setting. 100 images per isolated Eurographics Italian Chapter Conference, Eurographics
gesture and 20 videos per continuous gesture for each signer is Association, pp. 15-23, 2015.
proposed as a good compromise. For a sufficient number of [4] Zuzanna Parcheta and Carlos-D. Martínez-Hinarejos, ”Sign
signers, this represents a sufficiently large and sufficiently Language Gesture Recognition Using HMM”, L.A. Alexandre et
heterogeneous set of empirical data to avoid training the al. (Eds.): IbPRIA 2017, LNCS 10255, pp. 419-426, 2017
network with images too similar to each other. [5] Carlos D. Martínez-Hinarejos and Zuzanna Parcheta, “Spanish
Sign Language Recognition with Different Topology Hidden
Formalization also requires efficient and convenient Markov Models”, Proceedings of the Interspeech 2017,
organization and annotation. It is therefore recommended to do Stockholm, Sweden, pp. 3349-3353, 2017.
the following: organize the dataset in a folder tree, with a single [6] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S.
folder per signer, and with each containing a sub-folder per sign Guadarrama, K. Saenko, T. Darrell, “Long-Term Recurrent
and sub-sub-folders for each repetition of that sign; identify Convolutional Networks for Visual Recognition and
signer folders using an identification code, and name sign Description,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 4, pp. 677-691, April 1 2017.
folders after the sign name; identify each image with the signer [7] E. Tsironi, P. Barros, C. Weber and S. Wermter, “An Analysis of
identification code, the recording date (or code), the repetition Convolutional Long Short-Term Memory Recurrent Neural
number and the sign name; and finally, if feasible, assign each Networks for Gesture Recognition”, Neurocomputing, vol. 268,
recorded sign a code and use that instead of its name. pp. 76-86, 2016.
It is recommended describe files in terms of three sections: [8] X. Shi, Z. Chen, H. Wang and D. Y. Yeung, “Convolutional
LSTM Network: A Machine Learning Approach for Precipitation
with the lexicon and the associated code; with the signer’s name
Nowcasting”, Proceedings of the 28th International Conference
and respective code with information on date of birth, why they on Neural Information Processing Systems (NIPS’15), vol. 1, pp.
use SL and their dominant hand; and with details on the 802-810, 2015.
recording session, equipment used, recorder name, recording [9] O. Koller, H. Ney, and R. Bowden, “Automatic Alignment of
date, file name and recording conditions. Recommended is to HamNoSys Subunits for Continuous Sign Language
use a spreadsheet (or CSV) because it is very easy to filter out Recognition”, LREC Workshop on the Representation and
and retrieve key statistics from the data. Processing of Sign Languages: Corpus Mining, Portorož,
Slovenia, pp. 121-128, 2016.
[10] “Grades,” [Online]. Available: http://grades.uvigo.es/. [Accessed
4. Conclusions and further work 12 07 2018].
Although a few studies exist on ASLR, progress is still limited [11] C. Wang, Z. Liu and S. C. Chan, “Superpixel-Based Hand Gesture
Recognition With Kinect Depth Camera”, IEEE Transactions on
by the lack of datasets. The fact that the few available lack Multimedia, vol. 17, no. 1, pp. 29-39, Jan. 2015.
reliability and convenience makes the creation of new datasets [12] B. Kang, S. Tripathi, T. Q. Nguyen, “Real-time Sign Language
mandatory in order to progress in this field. Fingerspelling Recognition using Convolutional Neural
We have described an experimental framework designed to Networks from Depth Map”, 3rd IAPR Asian Conference on
study the reliability of existing datasets and the combination of Pattern Recognition (ACPR), Kuala Lumpur, 2015, pp. 136-140.
RGB and depth data and experimentally trained CNNs. The [13] A. L. C. Barczak, N. H. Reyes, M. Abastillas, A. Piccio and T.
system achieved over 50% precision for challenging datasets Susnjak, “A New 2D Static Hand Gesture Colour Image Dataset
from both natural and artificial recording environments. This for ASL Gestures”, Research Letters in the Information and
Mathematical Sciences, vol. 15, pp. 12-20, 2011.
method can be used to create a complete and straightforward
dataset suitable for research.
75
[14] G. Marin, F. Dominio, P. Zanuttigh, “Hand Gesture Recognition
with Leap Motion and Kinect Devices”, IEEE International
Conference on Image Processing (ICIP), Paris, 2014, pp. 1565-
1569.
[15] Z. Ren, J. Yuan, J. Meng and Z. Zhang, “Robust Part-Based Hand
Gesture Recognition Using Kinect Sensor”, IEEE Transactions
on Multimedia, vol. 15, no. 5, pp. 1110-1120, Aug. 2013.
[16] S. C. Chan, C. Wang and Z. Liu, “Hand gesture recognition based
on canonical formed superpixel earth mover's distance”, 2016
IEEE International Conference on Multimedia and Expo (ICME),
Seattle, WA, 2016, pp. 1-6.
[17] X. Sun, Y. Wei, S. Liang, X. Tang and J. Sun, “Cascaded Hand
Pose Regression”, IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Boston, MA, 2015, pp. 824-832.
[18] M. Zahedi, D. Keysers, T. Deselaers, and H. Ney, “Combination
of Tangent Distance and an Image Distortion Model for
Appearance-Based Sign Language Recognition”, Deutsche
Arbeitsgemeinschaft für Mustererkennung Symposium (DAGM),
Lecture Notes in Computer Science, Vienna, Austria, pp. 401-408,
Aug. 2005.
[19] C. Neidle, A. Thangali and S. Sclaroff, “Challenges in
Development of the American Sign Language Lexicon Video
Dataset (ASLLVD)”, 5th Workshop on the Representation and
Processing of Sign Languages: Interactions between Corpus and
Lexicon, LREC 2012, Istanbul, Turkey, May 27, 2012.
[20] P. Dreuw, D. Rybach, T. Deselaers, M. Zahedi, and H. Ney,
“Speech Recognition Techniques for a Sign Language
Recognition System”, INTERSPEECH 2007, Antwerp, Belgium,
pp. 2513-2516, Aug. 2007.
[21] P. Dreuw, T. Deselaers, D. Keysers, and H. Ney, “Modeling
Image Variability in Appearance-Based Gesture Recognition”,
ECCV Workshop on Statistical Methods in Multi-Image and
Video Processing (ECCV-SMVP), Graz, Austria, pp. 7-18, May
2006.
[22] J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater,
and H. Ney, “RWTH-PHOENIX-Weather: A Large Vocabulary
Sign Language Recognition and Translation Corpus”, Language
Resources and Evaluation (LREC), Istanbul, Turkey, pp. 3785-
3789, 2012.
[23] European Sign Language Center, “Spread The Sign,” 2006.
[Online]. Available: www.spreadthesign.com. [Accessed 11 07
2018].
[24] F. Ronchetti, F. Quiroga, C. Estrebiu, L. Lanzarini and A. Rosete,
“LSA64: An Argentinian Sign Language Dataset”, XX II
Congreso Argentino de Ciencias de la Computación (CACIC),
2015.
[25] I. Shao and L. Liu, “Learning Discriminative Representations
from RGB-D Video Data”, Proceedings of the Twenty-Third
International Joint Conference on Artificial Intelligence
(IJCAI’13), pp. 1493-1500, May 2013.
[26] Microsoft Research Cambridge-12 Kinect, “Kinect Gesture Data
Set”, [Online]. Available: https://www.microsoft.com/en-
us/download/details.aspx?id=52283. [Accessed 12 07 2018].
[27] S. Escalera, X. Baro, H. J. Escalante and I. Guyon, “ChaLearn
Looking at People: A Review of Events and Resources”, 2017
International Joint Conference on Neural Networks (IJCNN),
Anchorage, AK, 2017, pp. 1594-1601, 2017.
[28] K. Simonyan and A. Zisserman, “Very Deep Convolutional
Networks for Large-Scale Image Recognition”, arXiv preprint
arXiv:1409.1556, 2014.
76
IberSPEECH 2018
Baseline Acoustic Models for Brazilian Portuguese Using Kaldi Tools

Cassio T. Batista, Ana Larissa da S. Dias, Nelson C. Sampaio Neto
Federal University of Pará, Institute of Exact and Natural Sciences

Augusto Corrêa 1, Belém 660750-110, Brazil
{cassiotb,nelsonneto}@ufpa.br, ana.dias@itec.ufpa.br
Abstract 2. Related Work

Kaldi has become a very popular toolkit for automatic speech A literature review was conducted on IEEE Xplore and ACM
recognition, showing considerable improvements through the Digital Library. However, most works from ACM simply men-
combination of hidden Markov models (HMM) and deep neu- tion speech recognition as an application of DNNs. Therefore,
ral networks (DNN). However, in spite of its great performance only papers from IEEE Xplore were considered. After filtering
for some languages (e.g. English, Italian, Serbian, etc.), the by title and abstract, the most relevant ones were selected and
resources for Brazilian Portuguese (BP) are still quite limited. will be shortly detailed below.
This work describes what appears to be the first attempt to cre- Sahu and Ganesh [9] performed a survey on HTK, CMU
ate Kaldi-based scripts and baseline acoustic models for BP us- Sphinx and Kaldi toolkits for different languages regarding their
ing Kaldi tools. Experiments were carried out for dictation tasks performance in terms of WER. They found that Kaldi achieved
and a comparison to CMU Sphinx toolkit in terms of word er- the best WER value of 2.7% using the Wall Street Journal (WSJ)
ror rate (WER) was performed. Results seem promising, since English corpus. In another work, Becerra et al. [10] presented
Kaldi achieved the absolute lowest WER of 4.75% with HMM- a comparative case study for Spanish between the conventional
DNN and outperformed CMU Sphinx even when using Gaus- HMM-GMM architecture and the recent HMM-DNN model us-
sian mixture models only. ing Kaldi. The audio corpus used includes 1,836 sentences from
Index Terms: automatic speech recognition, Brazilian Por- 87 speakers sampled at 16 kHz, which are a mixture of human
tuguese, Kaldi voices and text-to-speech utterances. A 20.71% improvement
was achieved by the HMM-DNN architecture over the HMM-
GMM models: 3.33% against 4.20% of WER, respectively.
1. Introduction Popović et al. [11] used Kaldi to develop an HMM-based
The attempts to simplify the communication between humans ASR system for the Serbian language. The audio corpora used
and machines are not a novelty. Ever since the emergence of in the experiment contains 95 hours of speech sampled at 8 kHz.
consumer electronic computers, researchers have been doing a They obtained a word recognition accuracy of approximately
lot of effort in order to design more convenient interfaces for 98%. For Italian, on the other hand, Cosi [12] adapted Kaldi’s
controlling and interacting with electronic devices. However, TIMIT recipe for the FBK ChildIt corpus, which contains ap-
despite the consolidation of keyboards and mice as main input proximately 10 hours of speech of children sampled at 16 kHz.
methods for personal computers, as well as touch screens for The results only show that DNN configurations outperforms the
mobile devices, alternative control interfaces such as speech, non-DNN ones. Karan et al. [13] also used Kaldi to developed
body gestures, or even thoughts never ceased to be investigated. a speech recognition system, now for Hindi Odia language. The
For years, the combination of hidden Markov models audio corpus consisted of 2,647 utterances collected from 104
(HMM) and Gaussian mixture models (GMM) has been the speakers at 8 kHz using mobile phones. The experiment used
state-of-the-art technique for acoustic modeling in the auto- the conventional HMM-GMM architecture only, and reports the
matic speech recognition (ASR) field [1, 2]. Concerning best result of 1.74% WER in the triphone model.
the Brazilian Portuguese language in particular, robust speech Ali et al. [14] presented a complete Kaldi recipe for build-
recognition systems based on traditional HMM-GMMs have ing Arabic speech recognition systems. The corpus used was
already been proposed [3, 4] using both HTK [5] and CMU the GALE Arabic Broadcast News data set, which consisted
Sphinx [6] tools. However, the deep learning approaches that of 100,000 speech segments of nine different TV channels, a
emerged last decade seem to have outperformed HMM-GMM total of 203 hours of speech data recorded at 16 kHz. In the
models when replacing the Gaussian mixtures by deep neural experiment, the DNN-based system achieved the best results
networks (DNN) combined with HMMs. with an overall WER of 26.95%, which is nearly a 10% rela-
Most works that apply deep learning for speech recognition tive improvement to the HMM-GMM model. Kipyatkova and
make use of Kaldi [7], an open-source software package that Karpov [15], on the other hand, built an HMM-DNN acous-
implements the hybrid HMM-DNN combination. To the best of tic model using Kaldi for Russian language. For training and
our knowledge, no previous work has developed a Kaldi recipe testing, they used their own speech recorded at 44.1 kHz with
for Brazilian Portuguese (BP) yet. Therefore, towards building 16 bits per sample. The data set was composed by 55 speakers
freely available resources [8] for a large vocabulary continu- and 16,850 utterances. Two different kinds of neuron activation
ous speech recognition (LVCSR) system in BP using the Kaldi functions were implemented on the neural network: tanh and
toolkit, this work presents results for HMM-GMM triphone- p-norm. The results showed that the p-norm function obtained
based acoustic models in terms of word error rate (WER). A the best WER value of 20.30%.
preliminar result for DNN-based models is also presented, but Another search was performed on the previous
the main development of HMM-DNN hybrid models is still on- IberSPEECH proceedings of 2014 and 2016, where two
going, given the huge amount of time taken to train them. works stood out. On the first one, Guiroy et al. [16] im-
77 10.21437/IberSPEECH.2018-17
Figure 1: Training stages of a hybrid HMM-DNN, triphone-based acoustic model on Kaldi.
plemented an HMM-DNN ASR system in Kaldi and also 3. Tools and Resources for BP using Kaldi
conducted a comparative study between HMM-based models
In order to build a speech recognition system, one must be pro-
in both Kaldi and HTK. The Castillian Spanish SpeechDat(II)
vided with a language model (LM), a phonetic dictionary and
FDB-4000 audio corpus was used, which contains 43 hours
an acoustic model (AM). The resources and tools used to build
of recordings from 4,000 speakers. The results indicated a
each one of the three aforementioned components with Kaldi
34.02% decrease in WER when comparing the most accurate
will be detailed below. It is worth mentioning that the LM and
DNN-based and HMM-based models from Kaldi. A decrease
the dictionary are the very same used in CMU Sphinx as well.
of 53.79% for the HMM-based model in Kaldi could also be
The steps to train the AMs in particular are similar for both
observed over their most accurate model from HTK.
toolkits, but some differences will be pointed out along the text.
On the second work, Zorrilla et al. [17] carried out sev- For further information about acoustic model training for BP
eral experiments using Kaldi in order to evaluate different using CMU Sphinx tools, the reader is referred to [4].
deep-learning approaches for acoustic modeling on well-known
Spanish data sets, namely Albayzin, Dihana, CORLEC-EHU 3.1. Audio Corpora
and TC-STAR. In addition, the El País text corpus was used
for language modeling. The authors found through experi- Speech recognition is a data-driven technology, which means it
ments that all HMM-DNN hybrid acoustic models have out- requires a relatively large amount of labeled data (transcribed
performed the HMM-GMM ones and work well even with non audio) to work properly. The corpora used to train the acoustic
task-specific language models. models with Kaldi are composed by seven data sets, as summa-
During the research, we also found two works that tackle rized in Table 1. The data sets contain audio files in an uncom-
the ASR problem for BP using deep neural networks. Quin- pressed, linear, signed PCM (namely, WAVE) format, and are
tanilha et al. [18] presented an open-source, character-based, sampled at 16 kHz with 16 bits per sample. It is important to
end-to-end bidirectional long short-term memory (BLSTM) note that the actual number of speakers in West Point was rather
neural network for LVCSR. Several experiments were con- reduced due to abundance of foreign words amidst the corpus.
ducted over a data set of approximately 14 hours of recorded Besides, Constitution and Consumer Protection Code corpora
audio and the best performance evaluated in terms of label er- share the same speaker.
ror rate was 31.53% without the use of any language model.
Bonilla et al. [19], on the other hand, proposed an end-to-end 3.2. Phonetic Dictionary and Language Model
deep-learning system for recognizing digits, which is compared The phonetic dictionary maps every grapheme in the lexicon
to a simple multilayer perceptron (MLP) network. It is not clear, (orthographic representation) to one or more phonetic transcrip-
however, if the system classifies characters, words or phonemes. tions. The software described in [24] was used to include the
The best result is reported as 97.5% of accuracy rate, against pronunciation mapping of each of the 14,518 words into the
82.8% achieved by the MLP. dictionary. The trigram language model used in this work is de-
According to the literature review, it appears no previous scribed in [3]. It was trained with the SRILM [25] toolkit with
work has developed ASR resources with Kaldi for Brazilian 1.6 million phrases from the CETENFolha [26] corpus, yielding
Portuguese yet. Therefore, we believe this is the first attempt a perplexity value of 170. The LM is available in ARPA format,
to build acoustic models for BP using the toolkit’s deep learn- but in order to be used on the Kaldi environment, it was con-
ing approaches. verted to the FST format using the provided arpa2fst script.
78
Table 1: Audio corpora used to train acoustic models. Table 2: Kaldi DNN tools and parameters used for training.
Data set Ref. Hours Words Speakers Tool or Parameter Value

LapsStory [3] 5h:18m 8,257 5 DNN codebase nnet2 (“Dan’s DNN”)
LapsBenchmark [3] 0h:54m 2,731 35 Script train_pnorm_fast.sh
Constitution [20] 8h:58m 5,330 1 Hidden layers 2
Consumer Activation function p_norm
[20] 1h:25m 2,003 1
Protection Code pnorm_output_dim 3,000
Spoltech LDC [21] 4h:19m 1,145 475 pnorm_input_dim 300
West Point LDC [22] 5h:22m 484 70 num_epochs 8
CETUC [23] 144h:39m 3,528 101 num_epochs_extra 5
Minibatch size 512
Total 170h:51m 14,518 687
Learning rate 0.02 down to 0.004
3.3. Acoustic Model

The LDA technique takes the feature vectors and splice
The scripts used on the Brazilian Portuguese audio corpora were them across several frames, building HMM states with a re-
based on Kaldi’s recipe available for the WSJ corpus [27]. For duced feature space. Then, a unique transformation for each
the sake of comparison, HMM-GMM acoustic models were speaker is obtained by a diagonalizing MLLT transform. On top
trained with both Kaldi and CMU Sphinx as well. On Kaldi, the of the LDA+MLLT features, the fMLLR alignment algorithm,
deep learning approach actually uses the HMM-GMM training which is a speaker normalization that uses feature-space maxi-
as a pre-processing stage. Figure 1 shows the steps followed to mum likelihood linear regression (fMLLR), is applied [34].
train HMM-DNN acoustic models on Kaldi based on HMM- Finally, the HMM-DNN acoustic model is obtained by us-
GMM triphones. The audio signals are windowed at every ing the neural network to model the state likelihood distribu-
25 ms with 10 ms of overlap in the front-end, being encoded as tions as well as to input those likelihoods into the decision tree
a 39-dimension vector: 12 Mel frequency cepstral coefficients leaf nodes [16]. In short terms, the network input are groups
(MFCCs) [28] using C0 as the energy component, plus 13 delta of feature vectors and the output is given by the aligned state
(∆, first derivative) and 13 acceleration (∆∆, second derivative) of the HMM-GMM system for the respective features of the in-
coefficients are extracted from each window. put. The number of HMM states in the system also defines the
The AMs are iteratively refined. The flat-start approach DNN’s output dimension [15].
models 39 phonemes (38 monophones plus one silence model) Table 2 shows the most important Kaldi tools and pa-
as context-independent HMMs. The standard 3-state left-to- rameters set used to train the deep neural network. Kaldi
right HMM topology with self-loops was used. At the flat- provides two distinct implementations for DNN training:
start, a single Gaussian mixture models each individual HMM nnet1 [35], which is primarily maintained by Katel Veselý;
with the global mean and variance of the entire training data. and nnet2 [36], by Daniel Povey. The former was chosen be-
The transition matrices are also initialized with equal probabil- cause it supports CPU training while the nnet1 enables GPU
ities. The parameters used for extracting MFCCs and training training only, a resource that was not available for us. Regard-
the monophones are the same for both CMU Sphinx and Kaldi. ing the activation functions of the DNN, we chose the p_norm
Nevertheless, Kaldi uses the Viterbi training algorithm [29] nonlinearity because it presents a superior performance over the
to re-estimate the models at each training step, rather than the tanh in the literature review [15, 16, 17]. The remaining param-
Baum-Welch algorithm [30] used by CMU Sphinx. Further- eters of the DNN were set based in the Kaldi’s documentation as
more, Viterbi alignment is applied after each training step in well as in the related works. Since there is actually no parameter
order to allow training algorithms to improve the model pa- to define the number of neurons in the hidden layers for p-norm
rameters, a feature that is not present on CMU Sphinx by de- networks, pnorm_output_dim and pnorm_input_dim
fault. Subsequently, the context-dependent HMMs are trained parameters must be set instead, being the latter an integer mul-
for each triphone, first with the delta and after with the accel- tiple of the former usually with a ratio of 5 or 10 [37]. The
eration coefficients. Each triphone is represented by a leaf on number of epochs is given by the sum of the num_epochs
a decision tree, which is automatically created by both toolkits and num_epochs_extra parameters. The first one was sup-
using statistical methods. Eventually, leaves with similar pho- posed to be 15, but it is recommended to reduce it when the
netic characteristics are then tied/clustered together. computational environment is not very high powered [37], so
The last two steps for training a HMM-GMM acous- we choose 13 (8+5) to be the total number of epochs. The
tic model with Kaldi are the linear discriminant analysis learning rate was set to vary from 0.02 down to 0.004 during
(LDA) [31] combined with the maximum likelihood linear the default number of epochs; and to stay constant at 0.004 for
transform (MLLT) [32], followed by the speaker adaptive train- the next extra epochs [15].
ing (SAT) [33]. Both are included in most tutorials for AM
training with Kaldi. The latter, however, was not taken into 4. Experimental Tests and Results
account during our simulations in order to save time, so only
LDA+MLLT was adopted. Moreover, these two steps are not Tests were executed on an HP EliteDesk 800 G1 desktop com-
enabled by default on CMU Sphinx and, since they were not puter equipped with a Intel® Core™ i5-4570 3.20 GHz CPU,
used in [4] either, we decided not to include them in order to try 8 GB of RAM and 1 TB of hard disk storage. During the ex-
to reproduce the results and to save time as well. periments, the LapsBenchmark corpus was held exclusively for
79
Table 3: WER (%) achieved by CMU Sphinx and Kaldi toolkits.
Number of tied-states or senones

Toolkit # Gauss. 500 1,000 2,000 4,000 6,000 8,000
2 21.70 18.60 17.30 16.20 15.40 15.10
CMU Sphinx 4 17.50 15.50 14.60 13.10 13.10 12.70
(standard) 8 15.50 13.40 12.80 11.90 11.80 12.20
16 14.20 12.80 11.90 11.10 11.20 11.30
2 26.54 21.31 18.25 15.69 14.32 14.04
Kaldi 4 21.61 18.07 15.81 13.46 12.47 11.91
(tri-∆) 8 19.91 15.80 13.67 11.66 10.85 10.31
16 17.25 14.10 12.09 10.64 9.68 9.31
2 25.47 20.63 17.28 15.18 13.64 13.42
Kaldi 4 21.13 17.79 14.65 13.10 11.93 11.39
(tri-∆+∆∆) 8 18.72 15.26 13.02 11.15 11.03 10.20
16 17.17 13.61 12.06 10.47 9.31 9.23
2 19.91 15.36 12.12 9.99 9.60 9.13
Kaldi 4 16.84 13.43 11.15 9.12 8.65 8.04
(tri-LDA-MLLT) 8 14.42 11.93 9.63 8.15 7.58 6.99
16 12.96 10.75 8.85 7.60 6.79 6.50
testing and the six other corpora were used for training. Unfor- 5. Conclusions and Future Works
tunately, no clusters or graphic cards could be used for training
This paper addressed the first attempt to develop a speech recog-
the models. Therefore, due to the computational burden and
nition system for large vocabulary (LVCSR) in Brazilian Por-
the lack of hardware resources, it was not possible to develop
tuguese using the Kaldi toolkit. Triphone-based, HMM-GMM
DNN-based AMs for all combinations of HMM-GMM acous-
acoustic models with different values of Gaussians and tied-
tic models with Kaldi.
states were trained with Kaldi and CMU Sphinx tools in order to
Table 3 shows the results obtained with both CMU Sphinx establish a comparison in terms of word error rate (WER). The
and Kaldi. For Kaldi, by the way, the WER was evaluated evaluation results showed that the systems perform better as we
across all triphone training steps in order to perform a more increase the number of Gaussian densities per mixture and the
complete comparison to CMU Sphinx results, since neither the number of tied-states. For CMU Sphinx, the results obtained
LDA+MLLT stage or the fMLLR alignment were included for are in accordance to [4], in spite of the current WER achieved
this toolkit. For Sphinx, as expected, the WER decreases as being lower, possibly due the larger corpora used for training
we increase both the number of Gaussians and the number of the models.
tied-states of the model. However, the values seem to converge Results also showed that Kaldi definitely outperformed
after 4,000 senones and 8 Gaussians. The lowest WER value CMU Sphinx even without the use of its deep learning tools. An
achieved was approximately 11.1% with 4,000 senones and 16 explanation might be the use of Viterbi algorithm for training
Gaussian densities. (rather than Baum-Welch), as well as the use of Viterbi align-
For Kaldi, however, we found that the previous convergence ments in between each training stage, which is said to improve
shown on CMU Sphinx results does not occur. As we increase or refine the parameters of the model [38]. With the use of
the number of senones and the number of Gaussians, the WER DNNs, Kaldi presents an improvement of 57.21% over the best
values linearly drop. Besides, it can be seen that the lowest HMM-GMM-based acoustic model built with CMU Sphinx.
WER values for the first two triphone training steps (tri-∆ and As future work, we plan to finish training the HMM-DNN
tri-∆∆) are already lower than the best one achieved by CMU triphone-based AMs with Kaldi and consequently make them
Sphinx: 9.31% and 9.23%, respectively. The global, lowest publicly available (together with the recipe) [8] to the commu-
WER value obtained with Kaldi was 6.5% with 8,000 tied-states nity. We also expect to test with 32 and 64 densities per mix-
and 16 Gaussians at the tri-LDA-MLLT step, which is equiva- ture, now evaluating the decoding time too in terms of the real-
lent to 128,000 leaves on the decision tree, according to Kaldi’s time factor (xRT) as the WER possibly decreases. Furthermore
parameter settings (which is basically the result of the product HTK’s latest release also has an implementation of deep learn-
8,000 × 16). ing algorithms, which may join the next comparisons.
As proof of concept, we trained a DNN-based acoustic

model on the best HMM-GMM model produced with Kaldi.
6. Acknowledgements
The WER value dropped from 6.5% to 4.75%, an improve- The authors would like to thank Coordenação de Aperfei-
ment of 26.92%. When compared to the lowest WER value çoamento de Pessoal de Nível Superior (CAPES) and Fe-
obtained with CMU Sphinx, the improvement increases to deral University of Pará (UFPA) under Edital n◦ 06/2017 –
57.21%, which is a huge difference for dictation tasks. PIBIC/PROPESP for granting scholarships.
80
7. References [18] I. M. Quintanilha, L. W. P. Biscainho, and S. L.
Netto, “Towards an end-to-end speech recognizer for por-
[1] L. R. Rabiner, “A tutorial on hidden markov models and selected tuguese using deep neural networks,” in XXXV Simpó-
applications in speech recognition,” Proceedings of the IEEE, sio Brasileiro de Telecomunicações e Processamento de Sinais,
vol. 77, no. 2, pp. 257–286, Feb 1989. September 2017, pp. 709–714. [Online]. Available: http:
[2] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Process- //www.sbrt.org.br/sbrt2017/anais/1570360756.pdf
ing: A Guide to Theory, Algorithm, and System Development, [19] D. A. Bonilla, N. Nedjah, and L. de Macedo Mourelle, “Recon-
1st ed. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2001. hecimento automático de fala em português usando redes neurais
[3] N. Neto, C. Patrick, A. Klautau, and I. Trancoso, “Free tools artificiais profundas,” in Anais do 12 Congresso Brasileiro de In-
and resources for brazilian portuguese speech recognition,” teligência Computacional, C. J. A. Bastos Filho, A. R. Pozo, and
Journal of the Brazilian Computer Society, vol. 17, no. 1, pp. H. S. Lopes, Eds. Curitiba, PR: ABRICOM, 2015, pp. 1–6.
53–68, Mar 2011. [Online]. Available: https://doi.org/10.1007/ [20] PCD Legal. (2018) PCD legal: Acessível para todos. [Online].
s13173-010-0023-1 Available: http://www.pcdlegal.com.br/
[4] R. Oliveira, P. Batista, N. Neto, and A. Klautau, “Baseline acous- [21] LDC. (2018) Cslu: Spoltech brazilian portuguese version 1.0.
tic models for brazilian portuguese using cmu sphinx tools,” in [Online]. Available: https://catalog.ldc.upenn.edu/LDC2006S16
Computational Processing of the Portuguese Language. Berlin,
[22] LDC. (2018) West point brazilian portuguese speech. [Online].
Heidelberg: Springer Berlin Heidelberg, 2012, pp. 375–380.
Available: https://catalog.ldc.upenn.edu/LDC2008S04
[5] S. Young, D. Ollason, V. Valtchev, and P. Woodland, The HTK [23] PUC-Rio. (2018) Centro de estudos em telecomunicações
Book. Cambridge University Engineering Department, version (CETUC). [Online]. Available: http://www.cetuc.puc-rio.br/
3.4, 2006.
[24] A. Siravenha, N. Neto, V. Macedo, and A. Klautau, “Uso de regras
[6] D. Huggins-Daines, M. Kumar, A. Chan, A. W. Black, M. Ravis- fonológicas com determinação de vogal tônica para conversão
hankar, and A. I. Rudnicky, “Pocketsphinx: A free, real-time con- grafema-fone em Português Brasileiro,” 7th International Infor-
tinuous speech recognition system for hand-held devices,” in 2006 mation and Telecommunication Technologies Symposium, 2008.
IEEE International Conference on Acoustics Speech and Signal
Processing Proceedings, vol. 1, May 2006, pp. I–I. [25] A. Stolcke, “SRILM - an extensible language modeling toolkit,”
International Conference on Spoken Language Processing, 2002.
[7] D. Povey, A. Ghoshal, G. Boulianne, N. Goel, M. Hannemann, [Online]. Available: http://www.speech.sri.com/projects/srilm/
Y. Qian, P. Schwarz, and G. Stemmer, “The kaldi speech recogni-
[26] Linguateca. (2018) Corpus de extractos de textos electrónicos
tion toolkit,” in In IEEE 2011 workshop, 2011.
nilc/folha de s. paulo (CETENFolha). [Online]. Available:
[8] GitLab. (2018) Tutorial para treino de modelo acústico com kaldi. https://www.linguateca.pt/cetenfolha/
[Online]. Available: https://gitlab.com/fb-asr/fb-am-tutorial/ [27] GitHub. (2018) Kaldi speech recognition toolkit. [Online].
kaldi-am-train Available: https://github.com/kaldi-asr/kaldi
[9] P. K. Sahu and D. S. Ganesh, “A study on automatic speech recog- [28] S. Davis and P. Mermelstein, “Comparison of parametric repre-
nition toolkits,” in 2015 International Conference on Microwave, sentations for monosyllabic word recognition in continuously spo-
Optical and Communication Engineering (ICMOCE), Dec 2015, ken sentences,” IEEE Transactions on Acoustics, Speech, and Sig-
pp. 365–368. nal Processing, vol. 28, no. 4, pp. 357–366, Aug 1980.
[10] A. Becerra, J. I. de la Rosa, and E. González, “A case study of [29] A. Viterbi, “Error bounds for convolutional codes and an asymp-
speech recognition in spanish: From conventional to deep ap- totically optimum decoding algorithm,” IEEE Transactions on In-
proach,” in 2016 IEEE ANDESCON, Oct 2016, pp. 1–4. formation Theory, vol. 13, no. 2, pp. 260–269, April 1967.
[11] B. Popović, S. Ostrogonac, E. Pakoci, N. Jakovljević, and [30] L. R. Welch, “Hidden markov models and the baum-welch algo-
V. Delić, “Deep neural network based continuous speech recogni- rithm,” in IEEE Information Theory Society Newsletter, vol. 53,
tion for serbian using the kaldi toolkit,” in Speech and Computer. 2003, pp. 10–12.
Cham: Springer International Publishing, 2015, pp. 186–192.
[31] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification,
[12] P. Cosi, “A kaldi-dnn-based asr system for italian,” in 2015 In- 2nd ed. Wiley Interscience, 2000.
ternational Joint Conference on Neural Networks (IJCNN), July [32] R. A. Gopinath, “Maximum likelihood modeling with gaussian
2015, pp. 1–5. distributions for classification,” in IEEE International Conference
[13] B. Karan, J. Sahoo, and P. K. Sahu, “Automatic speech recogni- on Acoustics, Speech and Signal Processing, ICASSP, vol. 2, May
tion based odia system,” in 2015 International Conference on Mi- 1998, pp. 661–664 vol.2.
crowave, Optical and Communication Engineering (ICMOCE), [33] S. Matsoukas, R. Schwartz, H. Jin, and L. Nguyen, “Practical
Dec 2015, pp. 353–356. implementations of speaker-adaptive training,” in DARPA Speech
[14] A. Ali, Y. Zhang, P. Cardinal, N. Dahak, S. Vogel, and J. Glass, Recognition Workshop, 1997.
“A complete kaldi recipe for building arabic speech recognition [34] M. J. F. Gales, “Maximum likelihood linear transformations
systems,” in 2014 IEEE Spoken Language Technology Workshop for hmm-based speech recognition,” Computer Speech and
(SLT), Dec 2014, pp. 525–529. Language, vol. 12, no. 2, pp. 75–98, April 1998. [Online].
[15] I. Kipyatkova and A. Karpov, “Dnn-based acoustic modeling for Available: https://doi.org/10.1006/csla.1998.0043
russian speech recognition using kaldi,” in Speech and Computer. [35] K. Veselý, A. Ghoshal, L. Burget, and D. Povey, “Sequence-
Cham: Springer International Publishing, 2016, pp. 246–253. discriminative training of deep neural networks,” in INTER-
[16] S. Guiroy, R. de Cordoba, and A. Villegas, “Application of the SPEECH 2013, 2013, pp. 2345–2349.
kaldi toolkit for continuous speech recognition using hidden- [36] D. Povey, X. Zhang, and S. Khudanpur, “Parallel train-
markov models and deep neural networks,” in Advances in Speech ing of dnns with natural gradient and parameter averaging,”
and Language Technologies for Iberian Languages. IberSPEECH http://arxiv.org/pdf/1410.7455v8, Tech. Rep., 2014.
2016, ser. LNCS, vol. 10077. Springer, November 2016. [37] Kaldi. (2018) Dan’s dnn implementation. [Online]. Available:
[17] A. L. Zorrilla, N. Dugan, M. I. Torres, C. Glackin, G. Chollet, and http://kaldi-asr.org/doc/dnn2.html
N. Cannings, “Some asr experiments using deep neural networks [38] E. Chodroff. (2018) Kaldi tutorial: Training overview. [On-
on spanish databases,” in Advances in Speech and Language Tech- line]. Available: https://www.eleanorchodroff.com/tutorial/kaldi/
nologies for Iberian Languages. IberSPEECH 2016, ser. LNCS, training-overview.html
vol. 10077. Springer, November 2016.
81
IberSPEECH 2018
Converted Mel-Cepstral Coefficients for Gender Variability Reduction in

Query-by-Example Spoken Document Retrieval
Paula Lopez-Otero1 , Laura Docio-Fernandez2
Universidade da Coruña - CITIC, Information Retrieval Lab

1
2
Universidade de Vigo - atlanTTic Research Center, Multimedia Technology Group
paula.lopez.otero@udc.gal, ldocio@gts.uvigo.es
Abstract Gaussian posteriorgrams are computed from acoustic features

Query-by-example spoken document retrieval (QbESDR) is a extracted from the waveforms, so they can be considered a zero-
task that consists in retrieving those documents where a given resource representation for QbESDR: this is interesting since
spoken query appears. Spoken documents and queries exhibit no linguistic resources are necessary to develop the system.
a huge variability in terms of speaker, gender, accent or record- However, the use of acoustic features for speech representation
ing channel, among others. According to previous work, re- makes the system sensitive to different sources of variability
ducing this variability when following zero-resource QbESDR such as speaker identity, gender, accent or recording channel
approaches, where acoustic features are used to represent the [16].
documents and queries, leads to improved performance. This Previous work showed that compensating the query and
work aims at reducing gender variability using voice conversion document variability in terms of gender leads to an improve-
(VC) techniques. Specifically, a target gender is selected, and ment of QbESDR performance [16]. In those experiments, al-
those documents and queries spoken by speakers of the opposite ternative queries were generated via voice conversion (VC) in
gender are converted in order to make them sound like the target order to have a female and male version of every query; then,
gender. VC includes a resynthesis stage that can cause distor- male queries were searched within male documents and female
tions in the resulting speech so, in order to avoid this, the use of queries within female documents. These results led to believe
the converted Mel-cepstral coefficients obtained from the VC that transforming both queries and documents into a similar
system is proposed for QbESDR instead of extracting acous- voice would reduce the acoustic variability even more. Some
tic features from the converted utterances. Experiments were experiments following this research direction were presented in
run on a QbESDR dataset in Basque language, and the results [14], where vocal tract length normalization (VTLN) was used
showed that the proposed gender variability reduction technique to reduce the speaker-specific variation in the queries and docu-
led to a relative improvement by 17% with respect to using the ments: this procedure implies computing the warping factor of
original recordings. every recording. In this paper, a new gender variability com-
Index Terms: query-by-example spoken document retrieval, pensation strategy is proposed that consists in, given a target
dynamic time warping, voice conversion, variability compen- gender, converting all the documents and queries of the oppo-
sation site gender to that target gender so that all the resulting record-
ings will all be spoken by speakers of the same gender. For
this purpose, a speaker-independent VC strategy [17] is used,
1. Introduction which allows the use of the same conversion function for all the
Spoken document retrieval (SDR) consists in, given a set of spo- spoken utterances.
ken documents, retrieving those documents where a given query Typical VC techniques consist in a speech analysis stage
appears. The availability of technologies to perform this task is followed by feature conversion and speech resynthesis in order
of paramount importance nowadays due to the amount of mul- to obtain new speech utterances with the converted speech. One
timedia contents that are part of our everyday life. STD can be of the main drawbacks of these strategies is the negative effect
carried out using either written or spoken queries. The latter al- caused by the vocoder since it can introduce distortions that re-
ternative, known as query-by-example SDR (QbESDR), allows duce the quality and intelligibility of the resulting spoken utter-
a natural communication with devices while easing the access to ances [18]. Nevertheless, it would not be necessary to resyn-
such technologies to visually impaired users. For these reasons, thesize the speech if the features used in the VC procedure are
this task has gained the attention of the research community, suitable for speech representation in the QbESDR task. In this
which led to the organization of evaluations in order to encour- situation, the use of Gaussian posteriorgrams for speech repre-
age investigation on this topic [1, 2, 3, 4, 5, 6, 7]. sentation is straightforward, since it allows the modelling of the
The most common strategies for QbESDR are based on pat- documents and queries using the converted features obtained
tern matching techniques: first a set of features is extracted from from the VC system. Therefore, in this paper, a comparison
the documents and queries, and then each query-document pair of QbESDR results when using converted features and when
is compared using an alignment algorithm usually based on dy- extracting the same features from the converted waveforms is
namic time warping (DTW) [8] or any of its variants. These made in order to quantify the influence of the speech resynthe-
techniques usually rely on posteriorgram representations of the sis stage.
speech utterances such as phone posteriorgrams [9], which ac- The rest of this paper is organized as follows: Section 2 de-
count for the probability of each phone unit in a phone decoder scribes the proposed gender variability compensation strategy
given a speech frame [9, 10, 11]; and Gaussian posteriorgrams, for QbESDR; the experimental framework used for validation
which represent the likelihood of each Gaussian in a Gaussian is described in Section 3; Section 4 discusses the experimen-
mixture model (GMM) given a speech frame [12, 13, 14, 15]. tal results; and, finally, some conclusions and future work are
82 10.21437/IberSPEECH.2018-18
Figure 2: Voice conversion pipeline.
Figure 1: Overview of the proposed approach for gender vari-

ability compensation.
presented in Section 5.
2. Gender variability compensation in

QbESDR
The gender variability compensation technique proposed in this
paper is depicted in Figure 1: given a target gender, all the Figure 3: Piecewise linear approximation of a FW function.
speech utterances that are spoken by speakers of the opposite
gender are converted to the target gender via VC, while the oth-
ers remain unchanged. Then, QbESDR is performed using the frequency (F0) and band aperiodicity features (BAP).
resulting documents and queries. In this way, all the speech The VC approach used in this work is based on frequency
utterances sound as spoken by speakers of the same gender, warping (FW) and amplitude scaling (AS), and it consists in
therefore reducing the gender variability of the recordings in applying an affine transformation in the cepstral domain [22]:
the QbESDR task.
The implementation of this system requires three different y = Ax + b (1)
tasks: gender classification, gender conversion, and search. where x is a MCEP vector, A denotes a FW matrix, b represents
an AS vector, and y is the transformed version of x.
2.1. Gender classification The method proposed in [17] uses a simplified FW curve
that is defined piecewise by means of three linear functions, as
First, the gender of the documents and queries must be detected depicted in Figure 3: the discontinuities of the FW curve are
in order to find out whether a speaker belongs to the target gen- placed at frequencies fa and fb ; α is the angle between the 45-
der or, on the contrary, the speech must be converted. In this degree line and the first linear function; and β is the angle be-
work, a gender classifier based on the GMM log-likelihood ra- tween the 45-degree line and the second linear function, defined
tio was used. Given a speech utterance to classify, its likeli- as β = kα (0 < k < 1). Values of α greater (less) than 0 lead
hoods given male and female GMMs are computed and then to higher (lower) formant frequencies, resulting in a male-to-
the log-likelihood ratio is calculated. The utterance is classified female (female-to-male) conversion function. This strategy is
as belonging to the gender with the highest likelihood [19]. similar to vocal tract length normalization but, in this case, the
As mentioned in [16], the proposed gender classifier has same parameters are used for all the speech utterances, avoiding
shown an accuracy close to 100%, so few utterances are ex- the need to compute a warping factor for each speaker.
pected to be misclassified. Nevertheless, since the proposed ap- Afterwards, the AS vector b is defined by selecting ran-
proach aims at converting the gender of the utterance, the appar- dom values from a set of weighted Hanning-like bands equally
ent gender is more relevant than the actual gender, so possible spaced in the Mel-frequency scale [23] as described in [17]. Fi-
classification errors are not really important in this work. nally, the fundamental frequency is scaled proportionally to the
value of α [17].
2.2. Gender conversion Once the MCEP and F0 features are converted (BAP fea-
Male and female voices are different for anatomical reasons: tures remain unchanged), speech resynthesis is performed to
males usually have longer vocal tracts as well as longer and obtain waveforms with the converted speech. This is done us-
heavier vocal folds than females. This derives in differences ing a vocoder and, depending on the goodness of the converted
in the fundamental frequency (F0) and formant frequencies of features and the vocoder, the resulting speech can be more or
their voice, which are generally higher for women [20]. Hence, less natural and intelligible [18].
VC techniques can be used to transform the gender of a speaker
by modifying the voice characteristics of a source speaker in 2.3. Search
order to make it sound like a target speaker. In previous work The QbESDR system proposed in this work belongs to the fam-
on gender conversion for QbESDR, the VC technique proposed ily of pattern-matching techniques for search on speech. An
in [17] was used since it is speaker-independent, i.e. it can be overview is presented in Figure 4.
used to transform any speaker into a different (undetermined) Previous work on QbESDR using gender conversion [16]
one of the opposite gender. relied on a large set of features plus feature selection for speech
The gender conversion system used in this system has three representation [24]. Nevertheless, as mentioned above, this pa-
stages, as most VC strategies: speech analysis, feature conver- per aims to straightforwardly use the converted MCEP features
sion, and speech resynthesis, as depicted in Figure 2. Three for QbESDR and compare the performance with that achieved
different sets of features are commonly extracted in the first when extracting features from the converted waveforms. There-
stage [21]: 40 Mel-cepstral coefficients (MCEP), fundamental fore, there are two alternatives for speech representation: using
83
3. Experimental framework
The evaluation framework used in this paper is that of the
QbESDR task of Albayzin 2014 search on speech evaluation. It
consists of a set of spoken documents extracted from TV broad-
cast news in Basque language under diverse background condi-
tions [28]. The queries were recorded in an office environment,
which serves to simulate a regular user querying a retrieval sys-
Figure 4: Overview of the search on speech system. tem via speech. Each query includes a basic and two additional
examples from different speakers; in these experiments, only
MCEP features or obtaining a more complex representation that the basic example is used. Two different sets of queries are
makes use of these features. Since raw acoustic features do not included in the dataset: development (dev) queries for parame-
usually exhibit a good performance in QbESDR, Gaussian poster tuning and evaluation (eval) queries to assess system perfor-
teriorgrams were used for this purpose: a Gaussian posterior- mance. Table 1 summarizes some statistics of the database.
gram represents each frame of a spoken utterance by means of a
vector of dimension G: each element of this vector is the poste- Table 1: Summary of the experimental framework used in this
rior probability of each of the G Gaussians in a GMM given the paper.
frame. This representation was first proposed in [12] and used Duration
for QbESDR in [13, 14, 15], to cite some examples. Data # recordings Total Min Max # hits
After feature extraction, given a query Q = {q1 , . . . , qn } Documents 1841 3 h 11 min 3.00 s 30.12 s -
dev queries 100 2 min 51 s 1.35 s 2.29 s 772
and a document D = {d1 , . . . , dm } of n and m frames respec- eval queries 100 2 min 52 s 1.31 s 2.25 s 855
tively, with vectors qi , dj ∈ <G and n m, DTW finds the
best alignment path between these two sequences. Subsequence
DTW [25] (S-DTW) was used in this system, since it allows the The evaluation metric used in this work to assess QbESDR
partial alignment of a short sequence (the query) with a longer performance is the maximum term weighted value [29], in ac-
sequence (the document). The first step consists in computing cordance with the experimental protocol defined for Albayzin
a cumulative cost matrix M ∈ <n×m for a given query and 2014 search on speech evaluation. This metric was adopted in-
document as follows: stead of actual TWV in order to ignore the performance loss
caused by calibration issues.

 c(qi , dj ) if i=0

Mi,j =
c(qi , dj ) + Mi−1,0 if i>0
(2)
4. Experiments and results

 j=0
Before presenting the experimental results, some details of the
c(qi , dj ) + M∗ (i, j) else
different modules of the system described in Section 2 must
where c(qi , dj ) is a function that defines the cost between be mentioned. The GMMs of the gender classification system
query vector qi and document vector dj , and were trained using the FA sub-corpus of Albayzin database [30],
which includes around 4 hours of speech uttered by 200 differ-
M∗ (i, j) = min (Mi−1,j , Mi−1,j−1 , Mi,j−1 ) (3) ent speakers (100 male, 100 female). The features used were 19
In this paper, the log cosine similarity was used as the cost Mel-frequency cepstral coefficients (MFCCs) augmented with
function as in [10] since it empirically showed a superior per- energy, delta and acceleration coefficients, and only voiced
formance compared with other metrics: frames were considered. The number of mixtures of the GMMs
was empirically set to 1024. The parameters fa , fb and k of the
qi · dj VC strategy were set to 700 Hz, 3000 Hz and 0.5, respectively,
cost(qi , dj ) = − log (4)
|qi ||dj | according to [17]. In the search stage, the silence intervals be-
This metric is normalized in order to turn it into a cost func- fore and after the queries were automatically removed using the
tion defined in the interval [0,1]: voice activity detection approach described in [31]. The number
of Gaussians G of the GMM used for Gaussian posteriorgram
cost(qi , dj ) − costmin (i) computation was empirically set to 128. It must be noted that
c(qi , dj ) = (5)
costmax (i) − costmin (i) the GMM of each experiment was trained with the features ex-
where costmin (i) = minj cost(qi , dj ) and costmax (i) = tracted from its corresponding documents, so they are different
maxj (qi , dj ). for each experiment.
After computing M, the S-DTW algorithm is used to find The first experiment aimed at comparing system perfor-
the best alignment path between Q and D. According to this mance when extracting MCEP features from the converted
algorithm, the best alignment path ends at frame b∗ : waveforms (Synthesized) and when straightforwardly using the
converted MCEP features (Converted). As shown in Figure 5,
b∗ = arg min Mn,b (6) using the converted features leads to clearly better results for
b∈1,...,m
dev queries. In addition, experiments were run with different
Then, it is possible to backtrack the whole alignment path that values of |α| in order to analyze the influence of this param-
starts at frame a∗ . eter in QbESDR results. The figure shows that the best re-
A score must be assigned to each detection of a query Q sults were obtained when converting male utterances to female
in a document D in order to measure how likely the query is voices with |α| = π/30. The worst performance was achieved
present in the document. In this system, the document is length- with |α| = π/12 since such a coarse conversion leads to more
normalised by dividing the cumulative cost by the length of the distorted speech according to [17]. Results with |α| = π/36
warping path [26] and z-norm is applied afterwards [27]. are worse than those obtained with |α| = π/30 because the
84
Dev
Table 2: Eval results for all documents and queries and for dif-
0.25
ferent combinations of male (M) and female (F) documents (d)
F, Synthesized
F, Converted
and queries (d) with different systems. Conversion parameters
0.2 M, Synthesized were tuned on dev queries.
M, Converted
0.15
MTWV
MTWV
0.1 All Fq-Fd Mq-Md Fq-Md Mq-Fd
Original 0.1659 0.2753 0.1623 0.1354 0.1969
0.05 Converted 0.1937 0.2958 0.1958 0.2013 0.2249
Synthesized 0.0835 0.2042 0.0280 0.1672 0.0272
Queries only 0.1684 0.2753 0.1623 0.1490 0.1841
0
π/12 π/24 π/30 π/36
|α|
Figure 5: Results on the dev queries when using parameters ex-

tracted from converted waveforms (Synthesized) and when us- This suggests that the speakers of the documents are more simi-
ing the converted parameters (Converted) for female (F) and lar to each other when applying the proposed gender variability
male (M) target genders. compensation strategy.

conversion in this case is so subtle that it barely compensates
the gender variability [17]. This paper analyzed the effect of reducing gender variability
The experimental results on the dev queries were used to via voice conversion for QbESDR. Given a target gender, all
select the best target gender and |α| for the Synthesized and the spoken documents and queries of the opposite gender are
Converted systems: the best |α| was π/30 in both cases; the converted in order to make them sound as spoken by the tar-
best target gender was male for the Synthesized experiment and get gender. In addition, in order to alleviate the negative effects
female for the Converted experiment. Once parameter tuning of resynthesis, the converted MCEP features were straightfor-
was done, experiments were performed on the eval queries. Ta- wardly used to obtain Gaussian posteriorgrams of the queries
ble 2 shows the results on the eval queries with the original and documents to perform QbESDR using a DTW-based ap-
speech (Original), with the Converted and Synthesized features, proach. The experimental framework of the QbESDR task in
and when using an adapted version of the gender compensation Albayzin 2014 search on speech evaluation was used for val-
strategy proposed in [16] (Queries only). The latter strategy idation, which consisted in a set of documents and queries in
consists in generating an opposite-gender version of each query Basque language. The experiments showed a relative improve-
and using the male (female) queries to search within the doc- ment by 17% with the proposed technique compared to the
uments spoken by males (females). The results show that the QbESDR results with the original documents and queries.
Converted system achieves a relative improvement by 17% over QbESDR results were analyzed according to the gender of
the original speech. Performance was also evaluated for female the documents and queries, and the proposed method showed
queries with female documents (Fq-Fd), male queries with male an improvement both in original and converted utterances. This
documents (Mq-Md), female queries with male documents (Fq- suggests that a GMM trained with both original and converted
Md) and male queries with female documents (Mq-Fd). The features leads to more robust Gaussian posteriorgrams. In future
results show that the Converted system achieved the best per- work, the use of voice conversion for data augmentation in this
formance for all the experiments. The improvement in Fq-Md scenario will be experimented.
and Mq-Fd experiments result from the gender variability re- The experimental validation showed a clear improvement
duction obtained with the proposed technique. Also, the results when using converted features compared to extracting new fea-
when both queries and documents were converted versions of tures from the converted waveforms, which can be caused by
the original ones (i.e. Mq-Md) show that the conversion is not the distortion introduced by the vocoder. Recent strategies for
negatively affecting the features used for speech representation. speech generation such as Wavenet have emerged in the voice
In addition, the improvement of the results when using origi- conversion field, and it would be interesting to analyze the effect
nal speech (i.e. Fq-Fd) suggest that the GMM trained with both of different vocoders for QbESDR in the future. Also, the use
original and converted features leads to more robust Gaussian of deep learning techniques for noise robust voice conversion
posteriorgrams. The table shows degraded performance of the will be assessed.
Synthesized system in all the experimental conditions caused
by the quality of the features extracted from the resynthesized
recordings. The Queries only system exhibits almost the same
6. Acknowledgements
performance as the Original one. This work has received financial support from i) “Ministerio de
Further experiments were performed in order to find out Economı́a y Competitividad” of the Government of Spain and
whether the variability of the recordings is reduced using the the European Regional Development Fund (ERDF) under the
proposed technique. For this purpose, an experiment was run in- research projects TIN2015-64282-R and TEC2015-65345-P, ii)
spired in a state-of-art speaker verification technique: i-vectors Xunta de Galicia (projects GPC ED431B 2016/035 and GRC
were extracted from the Original and Proposed spoken docu- 2014/024), and iii) Xunta de Galicia - “Consellerı́a de Cultura,
ments, and PLDA scoring [32] of all the pairs of documents was Educación e Ordenación Universitaria” and the ERDF through
computed. This resulted in a mean score of -5.34 and standard the 2016-2019 accreditations ED431G/01 (“Centro singular de
deviation of 11.45 for Original documents, and a mean score investigación de Galicia”) and ED431G/04 (“Agrupación es-
of 0.84 and standard deviation of 9.71 for Proposed documents. tratéxica consolidada”).
85
7. References [18] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen-
cio, T. Kinnunen, and Z. Ling, “The Voice Conversion Challenge
[1] F. Metze, N. Rajput, X. Anguera, M. Davel, G. Gravier, C. V. 2018: promoting development of parallel and nonparallel meth-
Heerden, G. Mantena, A. Muscariello, K. Pradhallad, I. Szöke, ods,” in Proceedings of Odyssey, 2018, pp. 195–202.
and J. Tejedor, “The spoken web search task at MediaEval 2011,”
in Proceedings of the 37th International Conference on Acoustics, [19] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker veri-
Speech and Signal Processing (ICASSP), 2012, pp. 5165–5168. fication using adapted Gaussian mixture models.” Digital Signal
[2] F. Metze, E. Barnard, M. Davel, C. V. Heerden, X. Anguera, Processing, vol. 10, no. 1-3, pp. 19–41, 2000.
G. Gravier, and N. Rajput, “The spoken web search task,” in Pro- [20] J. Hillenbrand and M. Clark, “The role of f0 and formant frequen-
ceedings of the MediaEval 2012 Workshop, 2012. cies in distinguishing the voices of men and women,” Attention,
[3] X. Anguera, F. Metze, A. Buzo, I. Szöke, and L. Rodriguez- Perception, & Psychophysics, vol. 71, no. 5, pp. 1150–1166, 2009.
Fuentes, “The spoken web search task,” in Proceedings of the Me- [21] F. Bahmaninezhad, C. Zhang, and J. Hansen, “Convolutional neu-
diaEval 2013 Workshop, 2013. ral network based speaker de-identification,” in Proceedings of
[4] J. Tejedor, D. Toledano, P. Lopez-Otero, P. Docio-Fernandez, and Odyssey, 2018, pp. 255–260.
C. Garcia-Mateo, “Comparison of ALBAYZIN query-by-example [22] T. Zorila, D. Erro, and I. Hernaez, “Improving the quality of stan-
spoken term detection 2012 and 2014 evaluations,” EURASIP dard GMM-based voice conversion systems by considering physi-
Journal on Audio, Speech, and Music Processing, vol. 2016, no. 1, cally motivated linear transformations,” Communications in Com-
pp. 1–19, 2016. puter and Information Science (ISSN: 1865-0929), vol. 328, pp.
[5] X. Anguera, L. Rodriguez-Fuentes, I. Szöke, A. Buzo, and 30–39, 2012.
F. Metze, “Query by example search on speech at Mediaeval
[23] D. Erro, A. Alonso, L. Serrano, E. Navas, and I. Hernáez, “In-
2014,” in Proceedings of the MediaEval 2014 Workshop, 2014.
terpretable parametric voice conversion functions based on Gaus-
[6] I. Szöke, L. Rodriguez-Fuentes, A. Buzo, X. Anguera, F. Metze, sian mixture models and constrained transformations,” Computer
J. Proença, M. Lojka, and X. Xiong, “Query by example search Speech and Language, vol. 30, no. 1, pp. 3–15, 2015.
on speech at Mediaeval 2015,” in Proceedings of the MediaEval
2015 Workshop, 2015. [24] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo,
“Finding relevant features for zero-resource query-by-example
[7] J. Tejedor, D. Toledano, P. Lopez-Otero, P. Docio-Fernandez, search on speech,” Speech Communication, vol. 84, pp. 24–35,
J. Proença, F. Perdigão, F. Garcı́a-Granada, E. Sanchis, A. Pom- 2016.
pili, and A. Abad, “ALBAYZIN query-by-example spoken term
detection 2016 evaluation,” EURASIP Journal on Audio, Speech, [25] M. Müller, Information Retrieval for Music and Motion.
and Music Processing, vol. 2018, no. 2, pp. 1–25, 2018. Springer-Verlag, 2007.
[8] H. Sakoe and S. Chiba, “Dynamic programming algorithm op- [26] A. Abad, R. Astudillo, and I. Trancoso, “The L2F spoken web
timization for spoken word recognition,” IEEE Transactions on search system for Mediaeval 2013,” in Proceedings of the Medi-
Acoustics, Speech and Signal Processing, vol. 26, no. 1, pp. 43– aEval 2013 Workshop, 2013.
49, 1978.
[27] I. Szöke, L. Burget, F. Grézl, J. C̆ernocký, and L. Ondel, “Calibra-
[9] T. Hazen, W. Shen, and C. White, “Query-by-example spoken tion and fusion of query-by-example systems - BUT SWS 2013,”
term detection using phonetic posteriorgram templates,” in IEEE in Proceedings of the 37th International Conference on Acoustics,
Workshop on Automatic Speech Recognition & Understanding, Speech and Signal Processing (ICASSP), 2014, pp. 7899–7903.
ASRU, 2009, pp. 421–426.
[28] J. Tejedor, D. Toledano, L. Rodriguez-Fuentes, M. Pe-
[10] L. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. Bordel, nagarikano, A. Varona, M. Diez, and G. Bordel, “The
and M. Diez, “High-performance query-by-example spoken term ALBAYZIN 2014 search on speech evaluation plan,” 2014.
detection on the SWS 2013 evaluation,” in Proceedings of the 37th [Online]. Available: http://iberspeech2014.ulpgc.es/images/
International Conference on Acoustics, Speech and Signal Pro- EvaluationPlanSearchonSpeech.pdf
cessing (ICASSP), 2014, pp. 7869–7873.
[29] J. Fiscus, J. Ajot, J. Garofolo, and G. Doddington, “Results of
[11] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo, “Pho-
the 2006 spoken term detection evaluation,” in Proceedings of the
netic unit selection for cross-lingual query-by-example spoken
ACM SIGIR Workshop “Searching Spontaneous Conversational
term detection,” in Proceedings of IEEE Automatic Speech Recog-
Speech”, 2007, pp. 51–56.
nition and Understanding Workshop, 2015, pp. 223–229.
[12] Y. Zhang and J. Glass, “Unsupervised spoken keyword spotting [30] A. Moreno, D. Poch, A. Bonafonte, E. Lleida, J. Llisterri,
via segmental DTW on Gaussian posteriorgrams,” in IEEE Work- J. Mariño, and C. Nadeu, “Albayzin speech database: design of
shop on Automatic Speech Recognition & Understanding, ASRU, the phonetic corpus,” in EUROSPEECH, vol. 1, 1993, pp. 175–
2009, pp. 398–403. 178.
[13] G. Mantena, S. Achanta, and K. Prahallad, “Query-by-example [31] S. Basu, “A linked-HMM model for robust voicing and speech de-
spoken term detection using frequency domain linear prediction tection,” in Proceedings of International conference on acoustics,
and non-segmental dynamic time warping,” IEEE/ACM Transac- speech and signal processing (ICASSP), vol. 1, 2003, pp. 816–
tions on Audio, Speech and Language Processing, vol. 22, no. 5, 819.
pp. 944–953, 2014. [32] D. Garcia-Romero and C. Espy-Wilson, “Analysis of i-vector
[14] M. Madhavi and H. Patil, “Vtln-warped gaussian posteriorgram lenth normalization in speaker recognition systems,” in Proceed-
for qbe-std,” in Proceedings of EUSIPCO, 2017, pp. 593–597. ings of Interspeech, 2011, pp. 249–252.
[15] ——, “Combining evidences from detection sources for query-
by-example spoken term detection,” in Proceedings of APSIPA
ASC, 2017, pp. 563–568.
[16] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo,
“Compensating gender variability in query-by-example search on
speech using voice conversion,” in Proceedings of Interspeech,
2017, pp. 2909–2913.
[17] C. Magariños, P. Lopez-Otero, L. Docio-Fernandez, D. Erro,
E. Banga, and C. Garcia-Mateo, “Piecewise linear definition of
transformation functions for speaker de-identification,” in Pro-
ceedings of First International Workshop on Sensing, Processing
and Learning for Intelligent Machines (SPLINE), 2016, pp. 1–5.
86
IberSPEECH 2018
A Recurrent Neural Network Approach to Audio Segmentation for Broadcast

Domain Data
Pablo Gimeno, Ignacio Viñals, Alfonso Ortega, Antonio Miguel, Eduardo Lleida
{pablogj, ivinalsb, ortega, amiguel, lleida}@unizar.es
Abstract Generalized Likelihood Ratio (GLR) [3], or the Kullback

Leibler (KL) distance [4] are some examples.
This paper presents a new approach for automatic audio seg-
mentation based on Recurrent Neural Networks. Our system • Segmentation-by-classification: in this group of sys-
takes advantage of the capability of Bidirectional Long Short tems the segmentation is produced directly as a sequence
Term Memory Networks (BLSTM) for modeling temporal dy- of decisions over the input signal. A set of well-known
namics of the input signals. The DNN is complemented by a machine learning classification techniques have been
resegmentation module, gaining long-term stability by means used in this task with good results, such as Gaussian
of the tied-state concept in Hidden Markov Models. Further- Mixture Models (GMM) [5], Neural Networks (NN) [6]
more, feature exploration has been performed to best represent [7], Support Vector Machines (SVM) [8], or decision
the information in the input data. The acoustic features that have trees [9]. A factor analysis approach is proposed in [10],
been included are spectral log-filter-bank energies and musical where a compensation matrix is computed for each class.
features such as chroma. This new approach has been evaluated
In both approaches to audio segmentation it is usual that the
with the Albayzı́n 2010 audio segmentation evaluation dataset.
original segmentation boundaries are refined by a resegmen-
The evaluation requires to differentiate five audio conditions:
tation model. In this way, sudden changes in the labels can
music, speech, speech with music, speech with noise and oth-
be prevented. Some resegmentation strategies rely on hidden
ers. Competitive results were obtained, achieving a relative im-
Markov models [11] or smoothing filters [12].
provement of 15.75% compared to the best results found in the
literature for this database.
Index Terms: audio segmentation, recurrent neural networks, The segmentation task is specially challenging when deal-
LSTM, broadcast data ing with broadcast domain content because such documents
contain different audio sequences with a very heterogeneous
style. Different speech conditions and domains can be found,
1. Introduction from telephonic quality to studio recordings to outdoors speech
Due to the increase of multimedia content and the generation of with different noises overlapped. Background music and a
large audiovisual repositories, the need for automatic systems variety of acoustic noise effects are likely to appear as well. In
that can analyze, index and retrieve information in a fast and this context, the technological evaluations Albayzı́n incorpo-
accurate way is becoming more and more important. Given an rated a segmentation task for broadcast news environments in
audio signal, the goal of audio segmentation is to obtain a set of 2010 [13]. This is the task we are focusing on in this paper.
labels in order to separate that signal into homogeneous regions
and classify them into a predefined set of classes, e.g., speech, The remainder of the paper is organized as follows: a theo-
music or noise. This task is also important in other applications retical background on LSTM networks and its applications is in-
of speech technologies, such as automatic speech recognition troduced in section 2. Section 3 describes our novel RNN-based
(ASR) or speaker diarization, where an accurate labeling of segmentation system. Section 4 briefly introduces the experi-
audio signals can improve the performance of these systems in mental setup, describing the Albayzı́n 2010 evaluation dataset
real-world environments. and presents all the experimental results obtained with our sys-
tem. Finally, a summary and the conclusions are presented in
Audio segmentation systems can be divided into two main section 5.
groups depending on how the segmentation is performed:
segmentation-and-classification systems and segmentation-by- 2. LSTM networks
classification systems.
Neural Networks are a powerful modeling tool for non
• Segmentation-and-classification: these systems per- linear dependencies that, since the early 2010s, have been
form the segmentation task in two steps. First, bound- increasingly applied to speech technologies [14] [15] [16]. One
aries that separate segments belonging to different of the main disadvantages of traditional feed-forward neural
classes are detected using a distance metric. Then the networks when dealing with temporal series of information
system classifies each delimited segment in a second is that they process every example independently. However,
step. Several distance metrics have been proposed in the Recurrent Neural Networks (RNNs) are able to capture
literature: Bayesian Information Criterion (BIC) [1] [2], temporal dependencies introducing feedback loops between
This work has been supported by the Spanish Ministry of Economy
the input and the output of the neural network. The long short
and Competitiveness and the European Social Fund through the 2015 term memory (LSTM) [17] networks are a special kind of
FPI fellowship, the project TIN2017-85854-C4-1-R and Gobierno de RNN with the concept of memory cell. This cell is able to
Aragón /FEDER (research group T36 17R) learn, retain and forget information [18] in long dependencies.
87 10.21437/IberSPEECH.2018-19
This capability becomes very useful to carry out long and Xt−1 Xt Xt+1
short term analysis simultaneously. LSTM networks have
been modified combining two of them in a Bidirectional
LSTM (BLSTM) network. One processes the sequence in the ... BLSTM1 BLSTM1 BLSTM1 ...
forward direction while the other one processes the sequence Cell Cell Cell
backwards. This way the network is able to model causal and
anticausal dependencies for the same sequence. ... BLSTM2 BLSTM2 BLSTM2 ...
Cell Cell Cell
LSTM ans BLSTM networks have been successfully ap-
plied to sequence modeling tasks in speech technologies such Linear Linear Linear
as ASR [19] [20], language modeling [21], speaker verification layer layer layer
in text-dependent systems [22] or machine translation [23].
3. System description Segt−1 Segt Segt+1
Our proposed system is based on the use of RNNs to clas-

sify each frame in the audio signal. We have opted for a Figure 1: BLSTM based neural architecture description. Xi
segmentation-by-classification solution, combining the RNN represents the input features for the frame i. Segi is the seg-
with a resegmentation module to get smoothed segmentation mentation label for frame i
hypotheses. The three different blocks of our segmentation sys-
tem (feature extraction, an RNN based classifier, and the final
resegmentation module) are described below.
neural architectures in this paper have been evaluated using the
3.1. Feature extraction PyTorch toolkit [26].
The main features for the neural network consist of log Mel fil-
ter bank energies and the log-energy of each frame. Addition- 3.3. Resegmentation module
ally, we combine Mel features with chroma features [24], due
to its capability to capture melodic and harmonic information RNNs have high capabilities to model temporal dependencies
in music while being robust to changes in tone or instrumenta- but its output may contain high frequency transitions which
tion. All these features are computed every 10 ms using a 25 are unlikely to occur in signals which have a high temporal
ms window. In order to take into account the dynamic infor- correlation such as human speech or music. With the objective
mation of the audio signal, first and second order derivatives of avoiding sudden changes in the segmentation process, we
of the features are computed. Finally, feature mean & variance incorporate a resegmentation module to our system. In our
normalization is applied. case, it is implemented as a hidden Markov model where
each acoustic class is modeled through a state in the Markov
3.2. Recurrent Neural Network chain. Every state is represented by a multivariate Gaussian
distribution with full covariance matrix. This block gets as
This is the core of our segmentation system. Its task is to input the pseudo log-likelihood for each class from the neural
classify each audio frame as belonging to one of the predefined network and its corresponding segmentation hypotheses. No
acoustic classes. The neural architecture proposed can be more a priori information is required because statistical distri-
observed in Fig. 1. As shown, it is composed by two stacked butions are estimated using the hypothesized labels for each file.
BLSTM layers with 256 neurons each. The outputs of the last
BLSTM layer are then independently classified by a linear
perceptron, which shares its values (weights and bias) for all Input information is given every 10 ms, which can result in
time steps. Training and evaluation are performed with limited a noisy estimation of class boundaries. In order to reduce tem-
length sequences (3 seconds, 300 frames), limiting the delay poral resolution, scores are down-sampled by a factor L using
of dependencies to take into account. However, the neural an L order averaging zero-phase FIR filter to avoid distorting
network emits a segmentation label per every frame processed phase components [27]. First and second order derivatives of
at the input (every 10 ms, in our case). the scores are taken into account when computing the resegmen-
tation. Additionally, each of the states in the Markov chain will
The neural network has been trained using exclusively consist of a left-to-right topology of a number Nts of tied states
the train subset of the Albayzı́n 2010 database [13], which sharing the same statistical distribution. This way, by modify-
consists of 58 hours of audio sampled at 16 KHz. However, ing Nts we can force the minimum length of a segment before
15% of the train subset will be reserved for validation, which a change in the acoustic class happens. Taking all this informa-
makes a total of 49 hours of audio for training and 9 hours for tion into account, this minimum segment length forced by our
validation. Adaptative Moment Estimation (Adam) optimizer resegmentation module can be computed as follows:
is chosen due to its fast convergence properties [25]. Data will
be shuffled in each training iteration seeking to improve model
generalization capabilities. Tmin = Ts LNts (1)
Despite our system consisting of an additional resegmen-

tation module, the RNN itself is able to emit a segmentation where, Ts is the sampling period of the neural network output
hypothesis. This way, we can say that the first two blocks of (10 ms in our case), L is the down-sampling factor and Nts is
our system can fully work as a segmentation system. All the the number of tied states used in the Markov model.
88
4. Experimental setup and results Table 1: SER, error per class and average error for BLSTM
segmentation-by-classification system on the test partition for
4.1. Database and metric description different feature configurations (Mel: log Mel filter bank, Chr:
The database consists of broadcast news audio in Catalan. chroma, ∆ + ∆∆: 1st and 2nd order derivatives)
The full database includes 87 hours of audio sampled at 16
KHz and divided in 24 files. The database was split into two Class Error(%)
Feats SER Avg
parts: two thirds of the total amount of data are reserved for mu sp sm sn
training, while the remaining third is used for testing. Five
acoustic classes were defined for the evaluation. The classes 32 Mel 17.67 18.09 30.89 33.07 35.14 29.30
are distributed as follows: 37% for clean speech (sp), 5% for 64 Mel 17.47 17.98 31.45 31.38 34.60 28.85
music (mu), 15% for speech over music (sm), 40% for speech 80 Mel 16.87 18.14 29.63 30.23 33.45 27.86
over noise (sn) and 3% for others (ot). The class “others” is 96 Mel 17.33 18.07 30.81 31.48 33.94 28.58
not evaluated in the final test. A more detailed description of
80 Mel 16.14 17.12 30.13 26.57 31.66 26.37
the Albayzı́n 2010 audio segmentation evaluation can be found
+ Chr
in [13].
80 Mel
The main metric we will be using for evaluating our results + Chr + 15.91 16.28 28.82 26.32 31.94 25.84
is the the Segmentation Error Rate (SER), inspired by the NIST ∆+∆∆
metric for speaker diarization [28]. This metric can be inter-
preted as the ratio between the total length of the incorrectly
labeled audio and the total length of the audio in the reference.
Given the dataset to evaluate Ω, each document is divided into were computed to take into account dynamic information in the
continuous segments and the segmentation error time for each audio signal.
segment n is defined as: Results obtained on the test partition for the different
front-end configurations using our BLSTM segmentation-by-
Ξ(n) = T (n)[max(Nref (n) , Nsys (n)) − Ncorrect (n)] (2) classification system are presented in Table 1 in terms of SER
and the Albayzı́n evaluation metric. When the number of
where T (n) is the duration of the segment n, Nref (n) is the analysis bands is increased, we can appreciate that the SER
number of reference classes that are present in segment n, decreases, reaching its minimum using 80 bands. However,
Nsys (n) is the number of system classes that are present in seg- it can also be seen that using a higher number of bands can
ment n and Ncorrect (n) is the number of reference classes that affect the system performance. This is the case of the 96 bands
are present in segment n and were correctly assigned by the seg- configuration, that increases its error compared to the 80 bands
mentation system. This way, the SER is computed as follows: configuration. We can notice that, by incorporating chroma
P features, the error in the class “Speech over music” decreases
n∈Ω Ξ(n) significantly when compared to the 80 Mel coefficient con-
SER = P (3)
n∈Ω (T (n)Nref (n)) figuration, with a relative improvement of 12.10%. This is
due to the capabilities of chroma features to capture musical
Alternatively, the metric originally proposed for the Al- dependencies, which helps our system discriminate this class in
bazı́n 2010 evaluation will also be taken into account. This a more accurate way. The best result for this set of experiments
metric represents the relative error averaged over all the acous- is obtained using the first and second order derivatives of the
tic classes: log Mel filter bank and the chroma features, achieving a SER
of 15.91%, which is equivalent to an average class error of
dur(missi ) + dur(fai )
Error = meani (4) 25.84%. The performance of our segmentation system using
dur(refi ) the resegmentation module is evaluated in the following set of
experiments.
where dur(missi ) is the total duration of all miss errors for
the ith acoustic class, dur(fai ) is the total duration of all false
Aiming to illustrate how the system performance is influ-
alarm errors for the ith acoustic class, and dur(refi ) is the total
enced by the inertia imposed by the resegmentation module,
duration of the ith acoustic class according to the reference. A
Fig. 2 shows the scatter plot of the relative improvement in
collar of ±1s around each reference boundary is not scored in
performance versus the minimum segment length (Tmin ) for
both cases, SER and average class error, to avoid uncertainty
different values of the down-sampling factor L. It can be
about when an acoustic class begins or ends, and to take into
seen that configurations that perform better have a minimum
account inconsistent human annotations.
segment length between 0.5 and 1.5 seconds, which is in
the order of magnitude of the 2 seconds collar applied in the
evaluation.
4.2. Experimental results
For the experimental evaluation of our system, different front- The results on the test partition of the full segmentation
end configurations were assessed. The starting point of our system combining the BLTSMs and the resegmentation module
feature space exploration consists of a simple 32 log Mel fil- for the best front-end configuration evaluated (80 Mel + chroma
ter bank. Our next step was increasing the frequency resolution + derivatives) and for different values of the down-sampling
by using a higher number of analysis bands in the filter bank, factor, L, and the minimum segment length, Tmin , are shown
testing 64, 80 and 96 bands. Chroma features were incorpo- in Table 2 in terms of SER and the Albayzı́n evaluation
rated in order to help our system to discriminate classes that metric. If we compare the best result in this Table with the
contain music. Eventually, first and second order derivatives best result in Table 1, it can be seen that, by incorporating
89
25
L=25 it can be seen that our BLSTM-HMM system performs better
L=35
L=45
in all the acoustic classes. This error reduction is equivalent
20
L=55
L=65
to a relative improvement of 15.24% in terms of SER and a
L=75 15.75% in terms of the average class error. The difference in
Relative improvement (%)
the class “speech over music” is specially significant (23.60%

L=85
15
vs 18.82%) with a relative improvement of 20.25%.
Table 3: Results obtained on the Albayzı́n 2010 test partition

10
for different systems proposed in the literature compared to our
BLSTM-HMM system
5
Class Error(%)
System SER Avg
mu sp sm sn
0
0 1 2 3 4 5 6 7 8 Eval
T min - Minimum segment length (s)
winner 19.30 19.20 39.50 25.00 37.20 30.30
[29]
Figure 2: Relative improvement using the resegmentation mod-
ule for the best feature configuration versus minimum segment FA
length forced by the system HMM 14.70 18.80 23.70 23.60 29.10 23.80
[10]
Table 2: SER, error per class and average error for BLSTM- BLSTM 12.46
HMM segmentation-by-classification system over test partition 14.19 22.14 18.82 25.04 20.05
HMM
for the best feature configuration and different values of the
down-sampling factor, L, and minimum segment length, Tmin
L, Class Error(%)
SER Avg
Tmin mu sp sm sn 5. Conclusions
25, 12.49 14.55 21.99 19.08 24.88 20.13 A new approach for audio segmentation based on RNNs is
1.25s presented in this paper proving the capabilities of this kind
35, of models in the audio segmentation task, achieving the best
12.48 14.31 22.26 18.70 25.10 20.10 result so far in the Albayzı́n 2010 database. A segmentation-
1.4s
by-classification scheme has been followed, combining a
45, classification system, which is mainly made of 2 BLSTM
12.46 14.19 22.14 18.82 25.04 20.05
0.9s layers, with an smoothing back-end implemented through
55, a Hidden Markov Model. Several front-end configurations
12.57 16.12 22.00 18.94 24.95 20.50 were evaluated, proving the capabilities of chroma features for
0.55s
capturing musical structures when compared to a perceptual
Mel filter bank. The combination of BLSTM and HMM
has been proven to be appropriate, reducing significantly the
the resegmentation module, the error is reduced significantly, system error by forcing a minimum segment length for the
getting a relative improvement of 21.68% in terms of SER. It segmentation labels. Competitive results have been obtained
can also be observed that, as long as the Tmin value used in with this new approach, resulting in a relative improvement of
the configuration stays in the order of magnitude of 1 second, 15.75% when compared to the best result in the literature so
the performance of the system is not highly affected by the far.
variations in the down-sampling factor. The SER metric for the
four parameter configurations evaluated goes from 12.46% to Regarding our contributions, front-end configuration seems
a 12.57%, which makes an absolute difference of only 0.11% to have a big impact in this task, specially when classifying
between the best and the worst case. This way, by forcing a classes that contain music. Just by modifying the input fea-
certain amount of inertia in the output of the neural network, tures we have achieved a significant improvement in the perfor-
the system is able to achieve a SER of 12.46%, decreasing mance of our system. Furthermore, the introduction of RNNs in
significantly the error when compared to the output of the RNN. the audio segmentation task has been proven to be successful,
improving the results obtained so far with traditional statistical
Finally, Table 3 shows the results obtained in the Albayzı́n models such as GMM/HMM or factor analysis. In future work
2010 database by different systems already presented in the lit- we intend to improve even more these results by introducing
erature. The winner team of the Albayzı́n 2010 evaluation pro- more complex neural architectures after the BLSTM layers.
posed a segmentation-by-classification approach based on a hi-
erarchical GMM/HMM including MFCCs, chroma and spectral
entropy as input features [29]. The best result in this database 6. Acknowledgements
so far uses a solution based on Factor Analysis combined with a
Gaussian back-end and MFCCs with 1st and 2nd order deriva- We gratefully acknowledge the support of NVIDIA Corporation
tives as input features [10]. When compared with this result, with the donation of the Titan Xp GPU used for this research.
90
7. References [17] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[1] S. Chen, P. Gopalakrishnan et al., “Speaker, environment and
channel change detection and clustering via the Bayesian Infor- [18] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
mation Criterion,” in Proc. DARPA broadcast news transcription Continual prediction with LSTM,” 1999.
and understanding workshop, vol. 8, 1998, pp. 127–132. [19] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech
[2] C.-H. Wu, Y.-H. Chiu, C.-J. Shia, and C.-Y. Lin, “Automatic enhancement and recognition using multi-task learning of long
segmentation and identification of mixed-language speech using short-term memory recurrent neural networks,” in Sixteenth An-
delta-BIC and LSA-based GMMs,” IEEE Transactions on audio, nual Conference of the International Speech Communication As-
speech, and language processing, vol. 14, no. 1, pp. 266–276, sociation, 2015.
2006. [20] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recogni-
[3] P. Delacourt and C. J. Wellekens, “DISTBIC: A speaker-based tion with deep bidirectional LSTM,” in Automatic Speech Recog-
segmentation for audio data indexing,” Speech communication, nition and Understanding (ASRU), 2013 IEEE Workshop on.
vol. 32, no. 1-2, pp. 111–126, 2000. IEEE, 2013, pp. 273–278.
[4] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, “Automatic seg- [21] M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural net-
mentation, classification and clustering of broadcast news audio,” works for language modeling,” in Thirteenth annual conference
in Proc. DARPA speech recognition workshop, vol. 1997, 1997. of the international speech communication association, 2012.
[5] A. Misra, “Speech/nonspeech segmentation in web videos,” in [22] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end
Thirteenth Annual Conference of the International Speech Com- text-dependent speaker verification,” in Acoustics, Speech and
munication Association, 2012. Signal Processing (ICASSP), 2016 IEEE International Confer-
ence on. IEEE, 2016, pp. 5115–5119.
[6] H. Meinedo and J. Neto, “A stream-based audio segmentation,
classification and clustering pre-processing system for broadcast [23] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
news using ANN models,” in Ninth European Conference on lation by jointly learning to align and translate,” arXiv preprint
Speech Communication and Technology, 2005. arXiv:1409.0473, 2014.
[24] M. A. Bartsch and G. H. Wakefield, “To catch a chorus: Using
[7] K. J. Piczak, “Environmental sound classification with convolu-
chroma-based representations for audio thumbnailing,” in Appli-
tional neural networks,” in Machine Learning for Signal Process-
cations of Signal Processing to Audio and Acoustics, 2001 IEEE
ing (MLSP), 2015 IEEE 25th International Workshop on. IEEE,
Workshop on the. IEEE, 2001, pp. 15–18.
2015, pp. 1–6.
[25] S. Ruder, “An overview of gradient descent optimization algo-
[8] G. Richard, M. Ramona, and S. Essid, “Combined supervised
rithms,” arXiv preprint arXiv:1609.04747, 2016.
and unsupervised approaches for automatic segmentation of ra-
diophonic audio streams,” in Acoustics, Speech and Signal Pro- [26] A. Paszke, G. Chanan, Z. Lin, S. Gross, E. Yang, L. Antiga, and
cessing, 2007. ICASSP 2007. IEEE International Conference on, Z. Devito, “Automatic differentiation in PyTorch,” Advances in
vol. 2, 2007, pp. II–461. Neural Information Processing Systems 30, pp. 1–4, 2017.
[9] Y. Patsis and W. Verhelst, “A [27] F. Gustafsson, “Determining the initial states in forward-
speech/music/silence/garbage/classifier for searching and backward filtering,” IEEE Transactions on Signal Processing,
indexing broadcast news material,” in Database and Expert Sys- vol. 44, no. 4, pp. 988–992, 1996.
tems Application, 2008. DEXA’08. 19th International Workshop [28] NIST, “The 2009 (RT-09) Rich Transcription Meeting Recogni-
on, 2008, pp. 585–589. tion Evaluation Plan,” (Melbourne 28-29 May 2009).
[10] D. Castán, A. Ortega, A. Miguel, and E. Lleida, “Audio [29] A. Gallardo Antolı́n and R. San Segundo Hernández, “UPM-
segmentation-by-classification approach based on factor analysis UC3M system for music and speech segmentation,” 2010.
in broadcast news domain,” EURASIP Journal on Audio, Speech,
and Music Processing, vol. 2014, no. 1, p. 34, 2014.
[11] J. Ajmera, I. McCowan, and H. Bourlard, “Speech/music segmen-
tation using entropy and dynamism features in a hmm classifica-
tion framework,” Speech communication, vol. 40, no. 3, pp. 351–
363, 2003.
[12] L. Lu, H. Jiang, and H. Zhang, “A robust audio classification and
segmentation method,” in Proceedings of the ninth ACM interna-
tional conference on Multimedia. ACM, 2001, pp. 203–211.
[13] T. Butko and C. Nadeu, “Audio segmentation of broadcast news in
the Albayzin-2010 evaluation: overview, results, and discussion,”
EURASIP Journal on Audio, Speech, and Music Processing, vol.
2011, no. 1, p. 1, 2011.
[14] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly,
V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep
neural networks for acoustic modeling in speech recognition,”
IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97,
2012.
[15] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neu-
ral network learning for speech recognition and related applica-
tions: An overview,” in Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International Conference on. IEEE, 2013,
pp. 8599–8603.
[16] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent
pre-trained deep neural networks for large-vocabulary speech
recognition,” IEEE Transactions on audio, speech, and language
processing, vol. 20, no. 1, pp. 30–42, 2012.
91
IberSPEECH 2018
Improving Transcription of Manuscripts with Multimodality and Interaction

Emilio Granell, Carlos-D. Martı́nez-Hinarejos, Verónica Romero
Pattern Recognition and Human Language Technology Research Center,

Universitat Politècnica de València, Camı́ de Vera s/n, 46022, València, Spain
{egranell,cmartine,vromero}@dsic.upv.es
Abstract used to integrate the transcriber feedback into the main stream
of information for word correction, by using on-line handwrit-
State-of-the-art Natural Language Recognition systems allow ing and speech [7, 8, 9]. In this work, we explore the idea of
transcribers to speed-up the transcription of audio, video or im- using speech dictation for feeding the interactive system with
age documents. These systems provide transcribers an initial an additional source of information of the full text to transcribe.
draft transcription that can be corrected with less effort than The rest of the paper is organised as follows: Section 2
transcribing the documents from scratch. However, even the presents our multimodal proposal; Section 3 introduces the ex-
drafts offered by the most advanced systems based on Deep perimental framework; Section 4 explains the performed exper-
Learning contain errors. Therefore, the supervision of those iments and the obtained results; finally, Section 5 offers the con-
drafts by a human transcriber is still necessary to obtain the cor- clusions and future work lines.
rect transcription. This supervision can be eased by using inter-
active and assistive transcription systems, where the transcriber
and the automatic system cooperate in the amending process. 2. Multimodal Computer Assisted
Moreover, the interactive system can combine different sources Transcription of Text Images
of information in order to improve their performance, such as
This section presents our proposal, which is composed of two
text line images and the dictation of their textual contents.
parts, multimodal recognition and interaction.
In this paper, the performance of a multimodal interactive
and assistive transcription system is evaluated on one Spanish
historical manuscript. Although the quality of the draft tran- 2.1. Multimodal Recognition Framework
scriptions provided by a Handwriting Text Recognition system The natural language recognition problem aims to recover the
based on Deep Learning is pretty good, the proposed interactive text represented in an input signal. In the case of HTR, this input
and assistive approach reveals an additional reduction of tran- signal is usually a segmented line of a digitalised handwritten
scription effort. Besides, this effort reduction is increased when document [10]. Then, given a handwritten text line image or
using speech dictations over an Automatic Speech Recognition a speech signal represented by a feature vector sequence x =
system, allowing for a faster transcription process. (x1 , x2 , . . . , x|x| ), the problem for HTR and ASR is finding
Index Terms: speech recognition, human-computer inter- the most likely word sequence ŵ [2], that is:
action, handwriting recognition, assistive transcription, deep
learning ŵ = arg max P(w | x) = arg max P(x | w)P(w) (1)
w∈W w∈W
1. Introduction where W denotes the set of all permissible sentences, P(w) is

Many documents used every day include handwritten text and, the probability of w = (w1 , w2 , . . . , w|w| ) approximated by
in many cases, such as for detecting fraudulent bank checks [1], the language model (usually n-gram) [11], and P(x | w) is the
it would be interesting to recognise these text images automat- probability of observing x by assuming that w is the underly-
ically. However, state-of-the-art handwritten text recognition ing word sequence for x, evaluated by the optical or acoustical
(HTR) systems can not suppress the need of human supervision models for HTR and ASR, respectively. For both modalities, the
when high quality transcriptions are needed [2, 3, 4, 5]. state-of-the-art morphological models are based on deep neural
A way of taking advantage of the HTR system is to com- networks [3, 12].
bine it with the knowledge of a human transcriber, constitut- In this work, the search (or decoding) of ŵ, for both modali-
ing the so-called Computer Assisted Transcription of Text Im- ties, was performed by using the EESEN decoding method [13],
ages (CATTI) scenario [6]. In this framework, the HTR system which is based on Weighted Finite State Transducers (WFST).
and the transcriber cooperate interactively to obtain the perfect The main reason for using this decoding method is that it allows
transcription of the text line images. At each interaction step, obtaining not only a single best hypothesis, but also a huge set
the system uses the text line image and a previously validated of best hypotheses compactly represented into a unimodal word
part (prefix) of its transcription to propose an improved output. lattice, in systems with morphological models based on neural
Then, the user finds and corrects the next system error, thereby networks.
providing a longer prefix which the system uses to suggest a To obtain the multimodal final output lattice, the lattices
new, hopefully better continuation. Moreover, the accuracy of generated by the unimodal systems can be combined by remov-
the interactive system can be improved by providing it with ad- ing the total cost of all paths from the unimodal lattices and by
ditional sources of information, such as the speech dictation doing a union of the reweighted lattices [14]. The architecture
of the handwritten text over an automatic speech recognition of our multimodal recognition framework is presented in Fig-
(ASR) system. ure 1, that shows that the output is a word lattice (more details
In previous related works, multimodal combination was in Section 3.3).
92 10.21437/IberSPEECH.2018-20
: esta
ente
ar 13.4 0 -4.4
ced Hog.9 -9.0 29
pro 19.9 -15 -0.5
- .6 4
-1.8 -12 2 8 :
tal -1.1 esta 404
194 Hos.3 -5.9 -4.4
-15
,
-1 3 .8 -0.5
-0.3 del este
-3.8 ital
2.3 246 cap.6 -10
.0
ente -0
ced -2.3 .3 287 -0.3
pro 10.8 -14
-
-3.5 233
e
ient
pac .9
-19
-3.5
e
ient
crec2.7
-2
gría -3.5
Ale .0 190
-16 ente
.3 ced
-11 pro 10.8
-
nía 3- .5
Ago.3
-20 ,
-7.2 4.1 194
la .3
ia -11
7.8 158 glor
.5
-5.6 -27 190
de -9.7
5.5 156
-1.6
sto
Cri 7 6
26. 1 4
-9.5
Preprocess DepthConcat 0
Convolutional block Linear + Softmax
Columnwise Concat WFST decoding and lattice generation constrained to a language model
Recurrent block: BLSTM Lattice combination
Figure 1: Multimodal (handwriting and speech) recognition system architecture.
2.2. Multimodal and Interactive Framework performed over all possible suffixes of p [6].
This suffix search can be efficiently carried out by using lat-
In the CATTI framework the transcriber is involved in the
tices [6] obtained from the combination of the HTR and ASR
transcription process, since he/she is responsible for validating
recognition outputs. In each interaction step, the decoder parses
and/or correcting the system hypothesis. The system takes into
the validated prefix p over the lattice and then continues search-
account the handwritten text image and the transcriber feedback
ing for a suffix which maximises the posterior probability ac-
in order to improve the proposed hypotheses [6]. An exam-
cording to Equation (2). This process is repeated until a com-
ple of a CATTI operation is shown in Figure 2. In this exam-
plete and correct transcription of the input text line image is
ple, in the traditional post-edition approach, a transcriber should
obtained.
have to correct about two errors from the recognised hypothe-
sis (Agonı́a and Hospital). However, using our interactive ap-
proach only one explicit user-correction is necessary to get the 3. Experimental Framework
correct transcription. This section presents the datasets, the preprocess, the models,
Formally, in the traditional CATTI framework [6], the sys- the system setup, and the evaluation metrics used in the experi-
tem uses a given feature sequence, xhtr , representing a hand- ments.
written text line image and a user validated prefix p of the tran-
scription. In this work, in addition to xhtr , a sequence of fea- 3.1. Datasets
ture vectors xasr , which represents the speech dictation of the
textual contents of the text line image, is used to improve the The datasets used in this work correspond to a Spanish histori-
system performance. Therefore, the CATTI system should try cal manuscript, a Spanish phonetic corpus, and a set of speech
to complete the validated prefix by searching for a most likely samples provided by five different native Spanish speakers.
suffix ŝ taking into account both sequences of feature vectors.
Following the asumptions presented in [15], the CATTI prob- 3.1.1. Historical Manuscript: The Cristo Salvador Corpus
lem can be formulatted as: The Cristo Salvador corpus is a 19th century Spanish
manuscript provided by Biblioteca Valenciana Digital (Bi-
ŝ = arg max P(xhtr | p, s) · P(xasr | p, s) · P(s | p) (2)
s ValDi), and it is publicly available for research purposes on the
website of the Pattern Recognition and Human Language Tech-
where the concatenation of p and s is w. As in conventional nology (PRHLT) research center 1 . It is a single writer book
HTR and ASR, P(xhtr | p, s) and P(xasr | p, s) can be ap- composed of 53 pages (the page 41 is presented in Figure 3)
proximated by morphological models and P(s | p) by a lan-
guage model conditioned by p. Therefore, the search must be 1 https://www.prhlt.upv.es
93
Image
Speech
ITE-0 p
ŝ Cristo de la Alegrı́a, procedente del Hogar: esta
ITE-1 m ⇑
p Cristo de la
ŝ Agonı́a, procedente del Hogar: esta
m ⇑
p Cristo de la Agonı́a, procedente del
ITE-2 ---- ---------------------------------------------
ŝ Hostal: esta
v Hospital
p Cristo de la Agonı́a, procedente del Hospital
ŝ : esta
FINAL v #
p≡t Cristo de la Agonı́a, procedente del Hospital: esta
Figure 2: Example of CATTI operation using mouse-actions

(MA). Starting with an initial recognised hypothesis ŝ from the
combination of both modalities, the user validates its longest
well-recognised prefix p, making a MA m (that is, positioning
the cursor in the place where the error is), and the system emits
a new recognised hypothesis ŝ. As the new hypothesis corrects
the erroneous word, a new cycle starts. After the new valida-
tion, the system provides a new suffix ŝ that does not correct the
mistake; thus, the user types the correct word v, generating a
new validated prefix p that is used to suggests a new hypothesis
ŝ. This process is repeated until the final error-free transcrip- Figure 3: Page 41 of the Cristo Salvador corpus.
tion t is obtained. The underlined boldface word in the final
transcription is the only one which was corrected by the user.
that were manually divided into lines (such as the line shown at
the top of Figure 4). This corpus presents some problematic im- Figure 4: Examples of an extracted text line image (top) and the
age features, such as smear, background variations, differences result of the preprocess given to the neural network (bottom).
in bright, and bleed-through (ink that trespasses to the other sur-
face of the sheet).
We followed the directives of the hard partition defined in
3.2. Preprocess and Feature Extraction
previous works [16, 17]. The first 30 pages (662 text lines) were
used for training the optical and language models, while the fol- All the text line images were scaled to 64 pixels in height and a
lowing 3 pages (78 text lines) were used for validation purposes. pre-processing was applied for correcting the slant and remov-
The test set was composed of the lines of the page 41 (24 lines, ing the background noise [19]. A text line image and the result-
222 words); this page was selected for being, according to pre- ing image after the image preprocess are presented in Figure 4.
liminary error recognition results, a representative page of the With respect to speech feature extraction, 39 Mel-
whole test set (the remaining 20 pages, 473 lines). This corpus Frequency Cepstral Coefficients composed of the first 12 cep-
contains 1213 lines, with a vocabulary of 3451 different words, strals and log frame energy with first and second order deriva-
and a set of 92 different characters, taking into account lower- tives were extracted from the audio files [20].
case and uppercase letters, numbers, punctuation marks, special
symbols, and blank spaces. 3.3. Models
3.1.2. Speech Dataset: Albayzin and Cristo Salvador Optical models are Convolutional Recurrent Neural Networks
(CRNN), which consist of a convolutional and a recurrent
The Spanish phonetic corpus Albayzin [18] was used for train- blocks [21]. The convolutional blocks are composed of 3 con-
ing the ASR acoustical models. This corpus consists of a set volutional layers of 16, 32, and 48 features maps. Each convo-
of three sub-corpus recorded by 304 speakers using a sampling lutional layer has kernel sizes of 3 × 3 pixels, horizontal and
rate of 16 KHz and a 16 bit quantisation. The training partition vertical strides of 1 pixel, LeakyReLU as activation function,
used in this work includes a set of 6800 phonetically balanced and a maximum pooling layer with non-overlapping kernels of
utterances, specifically, 200 utterances read by four speakers, 2 × 2 pixels only at the output of the first two layers. Then, the
25 utterances read by 160 speakers, and 50 sentences read by recurrent blocks are composed of 3 recurrent layers. Each recur-
40 speakers with a total duration of about 6 hours. A set of rent layer is composed of 256 Bidirectional Long-Short Term
25 acoustical classes, 23 monophones, short silence, and long Memory (BLSTM) units. Finally, a linear fully-connected out-
silence, was estimated from this corpus. put layer is used after the recurrent block. Those models were
Test data for ASR was the product of the acquisition of the trained using Laia [22].
dictation of the contents of the lines of the page 41 by five dif- Acoustical models were trained using EESEN [13]. This
ferent native Spanish speakers (i.e., a total set of 120 utterances, acoustical model is a Recurrent Neural Network (RNN) com-
with a total duration of about 9 minutes) using a sample rate of posed of 351 inputs for 9 neighbouring frames of cepstral fea-
16 KHz and an encoding of 16 bits (to match the conditions of tures, 6 hidden layers with 250 BLSTM units, and an output
Albayzin data). layer with a softmax function [12].
94
The lexicon models for both modalities are in HTK lexi- Table 1: Experimental Results.
con format, where each word is modelled as a concatenation of
characters for HTR or phonemes for ASR. Post-edition CATTI
Experiment
The particularities of historical manuscripts, such as, writ- WER Oracle WER WSR EFR P-value
ing style, epoch and subject, make it very difficult to find ex- HTR 8.9% 1.8% 4.1% 53.9% 0.0511
ternal resources that allow to improve the models. In general, ASR 31.4% 8.5% 10.4% −16.9% 0.3898
Multimodal 10.6% 0.8% 1.8% 79.8% 0.0004
a part of the book is used to train the models that are used to
automatically transcribe the rest of the book. Therefore, the
language model (LM) was estimated directly from the tran-
scriptions of the pages included on the HTR training set us- concretely it presents a WER equal to 8.9% and an oracle WER
ing the SRILM ngram-count tool [23]. This language model equal to 1.8%. In this case, speech recognition does not seem to
is a 2-gram with Kneser-Ney back-off smoothing [24] interpo- be a good substitute for handwriting recognition. The quality of
lated with the whole lexicon in order to avoid out-of-vocabulary the lattices obtained by the speech recognition system present a
words, and it presents a perplexity of 742.8 for the test data. WER equal to 31.4% and an oracle WER equal to 8.5%.
Regarding multimodality, the quality of the lattices ob-
3.4. System Setup tained from the lattice combination of both modalities presents
a WER equal to 10.6%. However, these multimodal lattices
As previously stated, the decoding and lattice generation based
presents an oracle WER equal to 0.8%. Even though the com-
on WFST for both modalities were implemented using the
bination technique does not improve the unimodal HTR WER,
EESEN recogniser [13], however, the multimodal lattice com-
it allows to reduce the oracle WER substantially. Therefore, an
bination was performed using lattice-combine from Kaldi [25].
outstanding effect on interactive transcription can be expected,
In order to optimise the presented multimodal and interac-
since the oracle WER is related to the quality of the alternatives
tive framework, the values of the main variables were set up
offered by the interactive and assistive system (the lower the
on a validation set, as well as the limit of mouse actions for
oracle WER, the better the alternatives).
correcting each erroneous word on the interactive transcription
Concerning the CATTI results, 4.1% of estimated interac-
experiments, that was set to 3 [6].
tive human effort (WSR) was required for obtaining the perfect
transcription from the HTR lattices, which represents 53.9% of
3.5. Evaluation Metrics
relative effort reduction (EFR) over the HTR baseline (WER
The quality of the transcriptions is assessed using the Word Er- equal to 8.9%, p = .051). On the other side, no effort re-
ror Rate (WER), which allows us to obtain a good estimation duction can be considered when only ASR is used at the in-
for the transcriber post-edition effort. The WER is based on the put of the interactive system. However as expected, the mul-
Levenshtein edit distance [26] and it can be defined as the min- timodal combination not only represents 56.1% of relative im-
imum number of words that have to be substituted, deleted and provement on the estimate interactive human effort (1.8% over
inserted to transform the transcription into the reference text, 4.1%, p = .091), but these improvements are statistically sig-
divided by the number of words in the reference text. nificant when compared with the HTR baseline (EFR equal to
The quality of the lattices can be defined as the quality of 79.8%, p < .001).
the best hypotheses contained in them, and it is known as oracle
error rates. Then, the quality of the word lattices is estimated 5. Conclusions
by the oracle WER, which represents the smaller WER that can
be obtained from the word sequences contained in them. In this paper, the use of multimodal combination for improv-
The overall interactive performance is given by Word ing the CATTI system presented in previous works has been
Stroke Ratio (WSR), which can also be computed by using the studied. Multimodal combination allows us to provide addi-
reference text. After each hypothesis proposed by the system, tional sources of information to the assistive transcription sys-
the longest common prefix between the hypothesis and the ref- tem, such as speech dictation of the textual contents of the doc-
erence text is obtained and the first error from the hypothesis is ument to transcribe.
corrected. This process is iterated until a full match is achieved. The obtained results show that the combination technique
Therefore, the WSR can be defined as the number of user cor- used, even though it does not improve the best hypothesis of-
rections that are necessary to produce correct transcriptions us- fered by the unimodal HTR system, it may produce new bi-
ing the interactive system, divided by the total number of words grams that increase the search alternatives. Moreover, the ad-
in the reference text. This definition makes the WER compara- justment of the word posterior probabilities can increase the
ble to the WSR. The relative difference between them gives us probabilities of the correct words, reaching better hypotheses
the effort reduction (EFR), which is an estimation of the reduc- that allows the assistive transcription system to provide an ad-
tion of the transcription effort that can be achieved by using the ditional and significant reduction of the human effort.
interactive system. In future work, we will study the use of other combination
The statistical significance of the experimental results is es- techniques, the use of sentences in the handwritten text corpus
timated by means of p-values with a threshold of significance instead of lines (in order to make multimodality more natural),
of α = 0.05 that were calculated through the Welch t-test [27] and the use of the information of context given by the previous
using the statistical computing tool R [28]. lines.
4. Experimental Results 6. Acknowledgments

Table 1 presents the obtained experimental results. As it can be This work was partially supported by the following research
observed, in the post-edition results the quality of the lattices of- projects: READ - 674943 (European Union’s H2020) and
fered by the handwritten text recognition system is pretty good, CoMUN-HaT - TIN2015-70924-C2-1-R (MINECO / FEDER).
95
7. References [15] E. Granell, V. Romero, and C.-D. Martı́nez-Hinarejos, “Mul-
timodality, interactivity, and crowdsourcing for document tran-
[1] S. Chhabra, G. Gupta, M. Gupta, and G. Gupta, “Detecting Fraud- scription,” Computational Intelligence, vol. 34, no. 2, pp. 398–
ulent Bank Checks,” in Proc. of the 15th IFIP International Con- 419, 2018.
ference on Digital Forensics, 2017, pp. 245–266.
[2] A. H. Toselli, A. Juan, D. Keysers, J. González, I. Salvador, [16] V. Romero, A. H. Toselli, L. Rodrı́guez, and E. Vidal, “Computer
H. Ney, E. Vidal, and F. Casacuberta, “Integrated Handwriting Assisted Transcription for Ancient Text Images,” in Image Anal-
Recognition and Interpretation using Finite-State Models,” Int. ysis and Recognition, ser. Lecture Notes in Computer Science,
Journal of Pattern Recognition and Artificial Intelligence, vol. 18, M. Kamel and A. Campilho, Eds. Springer Berlin Heidelberg,
no. 4, pp. 519–539, 2004. 2007, vol. 4633, pp. 1182–1193.
[3] T. Bluche, H. Ney, and C. Kermorvant, “A Comparison of [17] V. Alabau, V. Romero, A. L. Lagarda, and C. D. Martı́nez-
Sequence-Trained Deep Neural Networks and Recurrent Neu- Hinarejos, “A Multimodal Approach to Dictation of Handwrit-
ral Networks Optical Modeling for Handwriting Recognition,” in ten Historical Documents.” in Proc. 12th Interspeech, 2011, pp.
Proc. of the 2nd SLSP, 2014, pp. 199–210. 2245–2248.
[4] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, [18] A. Moreno, D. Poch, A. Bonafonte, E. Lleida, J. Llisterri, J. B.
and J. Schmidhuber, “A Novel Connectionist System for Uncon- Mariño, and C. Nadeu, “Albayzin speech database: design of the
strained Handwriting Recognition,” IEEE Transaction on PAMI, phonetic corpus,” in Proc. of EuroSpeech, 1993, pp. 175–178.
vol. 31, no. 5, pp. 855–868, 2009. [19] M. Villegas, V. Romero, and J. A. Sánchez, “On the modifica-
[5] S. España-Boquera, M. Castro-Bleda, J. Gorbe-Moya, and tion of binarization algorithms to retain grayscale information for
F. Zamora-Martı́nez, “Improving offline handwriting text recon- handwritten tet recognition,” in Proc. of IbPRIA, 2015, pp. 208–
gition with hybrid HMM/ANN models,” IEEE Transactions on 215.
Pattern Analysis and Machine Intelligence, vol. 33, no. 4, pp. [20] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition.
767–779, 2011. Prentice Hall, 1993.
[6] V. Romero, A. H. Toselli, and E. Vidal, Multimodal Interactive
Handwritten Text Transcription, ser. Series in Machine Perception [21] J. Puigcerver, “Are Multidimensional Recurrent Layers Really
and Artificial Intelligence (MPAI). World Scientific Publishing, Necessary for Handwritten Text Recognition?” in Proc. of the
2012. 14th ICDAR, vol. 1. IEEE, 2017, pp. 67–72.
[7] E. Granell, V. Romero, and C. D. Marı́nez-Hinarejos, “An Interac- [22] J. Puigcerver, D. Martin-Albo, and M. Villegas, “Laia: A
tive Approach with Off-line and On-line Handwritten Text Recog- deep learning toolkit for HTR,” 2016. [Online]. Available:
nition Combination for Transcribing Historical Documents,” in https://github.com/jpuigcerver/Laia/
Proc. of the 12th IAPR-DAS, 2016, pp. 269–274. [23] A. Stolcke, “SRILM-an extensible language modeling toolkit.” in
[8] C.-D. Martı́nez-Hinarejos, E. Granell, and V. Romero, “Com- Proc. of the 3rd Interspeech, 2002, pp. 901–904.
paring different feedback modalities in assisted transcription of [24] R. Kneser and H. Ney, “Improved backing-off for m-gram lan-
manuscripts,” in Proc. of the 13th IAPR-DAS, 2018, pp. 115–120. guage modeling,” in Proc. of ICASSP, vol. 1, 1995, pp. 181–184.
[9] A. Toselli, V. Romero, M. Pastor, and E. Vidal, “Multimodal inter-
active transcription of text images,” Pattern Recognition, vol. 43,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
no. 5, pp. 1824–1825, 2010.
J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech
[10] V. Romero, J. A. Sanchez, V. Bosch, K. Depuydt, and J. de Does, Recognition Toolkit,” in Proc. of ASRU, 2011.
“Influence of text line segmentation in handwritten text recogni-
tion,” in Proc. of the 13th ICDAR, 2015, pp. 536–540. [26] V. I. Levenshtein, “Binary codes capable of correcting deletions,
insertions, and reversals,” Soviet Physics Doklady, vol. 10, no. 8,
[11] F. Jelinek, Statistical Methods for Speech Recognition. MIT
pp. 707–710, February 1966.
Press, 1998.
[12] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition [27] B. L. Welch, “The Generalization of ‘Student’s’ Problem when
with deep recurrent neural networks,” in Proc. of ICASSP, 2013, Several Different Population Variances are Involved,” Biometrika,
pp. 6645–6649. vol. 34, no. 1/2, pp. 28–35, 1947.
[13] Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end [28] R Core Team, R: A Language and Environment for Statistical
speech recognition using deep RNN models and WFST-based de- Computing, R Foundation for Statistical Computing, Vienna, Aus-
coding,” in Proc. of ASRU, 2015, pp. 167–174. tria, 2017, https://www.R-project.org/. Last access: May 2017.
[14] H. Xu, D. Povey, L. Mangu, and J. Zhu, “Minimum bayes risk
decoding and system combination based on a recursion for edit
distance,” Computer Speech & Language, vol. 25, no. 4, pp. 802–
828, 2011.
96
IberSPEECH 2018
Improving Pronunciation of Spanish as a Foreign Language for L1 Japanese

Speakers with Japañol CAPT Tool
Cristian Tejedor-Garcı́a1 , Valentı́n Cardeñoso-Payo1 , Marı́a J. Machuca2 , David
Escudero-Mancebo1 , Antonio Rı́os2 , Takuya Kimura3
Department of Computer Science, University of Valladolid, Spain

1
2
Department of Spanish Philology, Autonomous University of Barcelona, Spain
3
Department of Spanish Language and Literature, Seisen University, Japan
cristian@infor.uva.es
Abstract cent studies suggest that these learners mispronounce vowels

and consonants [8, 9], one of the most frequent errors being the
Availability and usability of mobile smart devices and
substitution of Spanish round back high vowel [u] by Japanese
speech technologies ease the development of language learning
unrounded central-back vowel [W]. Consonant mispronuncia-
applications, although many of them do not include pronunci-
tions usually stem from different phonological processes in both
ation practice and improvement. A key to success is to choose
languages: phoneme substitutions between allophonic variants
the correct methodology and provide a sound experimental val-
or changes in the place of articulation are the most common er-
idation assessment of their pedagogical effectiveness. In this
rors. For instance, [l] and [R] are allophones of the same rhotic
work we present an empirical evaluation of Japañol, an applica-
phoneme /r/ in Japanese. By contrast, both sounds are different
tion designed to improve pronunciation of Spanish as a foreign
phonemes in Spanish. Therefore, the trill Spanish phoneme /r/
language targeted to Japanese people. A structured sequence
is often pronounced by Japanese speakers as a flap ([R]) or even
of lessons and a quality assessment of pronunciations before
as a lateral ([l]) [10].
and after completion of the activities provide experimental data
about learning dynamics and level of improvement. Explana- Most frequent mistakes are found in fricatives: [f] is often
tions have been included as corrective feedback, comprising confused with [x], especially when these sounds are followed by
textual and audiovisual material to explain and illustrate the [u]. In this way, two different Spanish words ”fuego” (fire) and
correct articulation of the sounds. Pre-test and post-test utter- ”juego” (play) can be pronounced in the same way by Japanese
ances were evaluated and scored by native experts and auto- speakers (/f/ is realised as labiodental and /x/ as velar) and none
matic speech recognition, showing a correlation over 0.86 be- of these sounds exist in Japanese. The only similar fricative
tween both predictions. Sounds [s], [fl], [R] and [s], [fR], [T] ex- phoneme is /h/, and its realization depends on the vocalic con-
plain the most frequent perception and production failures, re- text, for example, [ç] appears before palatal vowel or semivowel
spectively, which can be exploited to plan future versions of the (/i/ or /y/) and [F], before velar one [W]. [T] is another frica-
tool, including gamified ones. Final automatic scores provided tive sound which causes pronunciation problems. The substitu-
by the application highly correlate (r>0.91) to expert evaluation tion of this phoneme by [s] is a very common mistake, a phe-
and a significant pronunciation improvement can be measured. nomenon known as Asian ”seseo” (lisp) [11], although it should
Index Terms: Computer-Assisted Pronunciation Training, not be considered a characteristic of non-nativeness, since this
corrective feedback, pronunciation training, second language pronunciation is accepted in Spanish American variants [12].
learning, speech recognition, speech synthesis, minimal pairs Finally, according to phonotactic rules, Spanish allows conso-
nant followed by consonant in onset clusters (i.e., ‘prado´) and
coda ones (i.e., ‘inscripción´). The combination of fricative /f/
1. Introduction and /l/ or /r/ in onset also triggers mispronunciation. This prob-
Steady advances in automatic speech recognition (ASR) and lem is not only due to the fact that these consonants exhibit
speech synthesis (TTS) lead to an increase of their use in errors in simple onset position, but also because most of the
Computer-Assisted Pronunciation Training (CAPT) tools [1]. Japanese syllables have a consonant followed by vowel struc-
Availability and usability of mobile smart devices motivates the ture.
development of foreign language learning applications [2, 3], In previous works, we presented the development of seri-
some of them include pronunciation practice and improvement ous games for training foreign pronunciation based on minimal
as the main focus. Although existing literature on CAPT indi- pairs tasks [13, 14]. We were able to assess user’s pronunci-
cate that these systems tell us how good learners perform and ation level in a L2, that is, those with a certified higher level
how to improve their pronunciation [4, 5], choosing the correct consistently reached better scores in the game [15]. Besides,
methodology and elaborating a sound experimental validation we also have discussed that the introduction of corrective feed-
of the assessment of pedagogical effectiveness of these tools back [16] allowed us to confirm that there was pronunciation
are essential to further develop the field [6, 7]. improvement among users after the first stages of use. However,
Japañol1 is a CAPT tool we have specifically designed for continued use of the tool seemed to invariably lead to stagna-
native Japanese learners of Spanish as a foreign language. Re- tion. Finally, while the freedom of movement on game-oriented
1 This study has been partially supported by the Ministerio de tools lead users to boost their score by repeating those tasks
Economı́a y Empresa (MINECO) and the European Regional Devel- they found easy, learning-oriented tools insist on users’ diffi-
opment Fund FEDER (TIN2014-59852-R) and by Consejerı́a de Edu- culties offering guided and corrective feedback and achieving
cación of Junta de Castilla y León (VA050G18). better effectiveness and efficiency pedagogical results [17].
97 10.21437/IberSPEECH.2018-21
In this paper we present an empirical evaluation of effec-
tiveness in pronunciation training in Japanese students of Span-
ish as a foreign language. Corrective feedback in pronunciation
training of second language acquisition could be implicit or ex-
plicit, in terms of whether or not the learner is informed of the
corrected form of the error [18]. Since it is recognized as an
essential component of these kind of systems, an explanation
mode has been included, comprising textual and audiovisual
material which provides explanations and illustrations on the
correct articulation of the sounds included in each lesson. We
also provide a guided path to users to follow in order to com-
plete all activities, based on their results. The preferable charac-
teristics of corrective feedback are: unambiguous, understand-
able, detectable, short and should preferably take account of
learner characteristics, both proficiency and literacy level [19].
We provide different types of feedback depending on the tasks
and results (see subsection 2.2).
In section 2, the experimental procedure is described, which Figure 1: Standard flow to complete a lesson in Japañol.
includes the participants and protocol stages, the speech mate-
rial processing and Japañol description. Results section shows
users’ interaction degree with the CAPT system, the most dif- tion which are not used in this work. All data was presented to
ficult Spanish sounds found in perception and production tasks human raters randomly and without user association. The audio
and user’s human rater and ASR scores consistency, correlation files were also processed with the same ASR used in the appli-
and improvement. We end the paper with a discussion about the cation (Google ASR). Human raters were asked to concentrate
relevance of the results and the conclusions and future work. in the specific sound of each word which should be generated
correctly, ignoring the bad pronunciations of the rest of sounds.
2. Experimental procedure
2.1. Protocol description 2.2. Software tool of the CAPT system
A total of eight native Japanese students between 20 and 22 Figure 1 shows the regular sequence of steps in order to com-
years old were selected as participants for our experiment. All plete a lesson in Japañol. After user’s authentication in (step 1),
participants qualified for the same Spanish as foreign language seven lessons are presented at the main menu of the application
for beginners (A2-B1) course at the Language Center of the (step 2). Each lesson includes a pair of Spanish sound contrasts
University of Valladolid. In this way we guaranteed (1) that and users achieve a particular score, expressed in percentage.
all students had the same initial level of Spanish and (2) that Lessons are divided in five main training modes: Theory, Ex-
our experiment realistically reproduced the variety of persons posure, Discrimination, Pronunciation and Mixed modes (step
that attend to these type of courses. 3) in which each one proposes several task-types with a fixed
Systematization of Spanish mispronunciations produced by number of mandatory task-tokens. The final lesson score is the
Japanese speakers was the first step to choose minimal pairs mean score of the last three modes. Users are guided by the sys-
in Spanish [20]. A set of 56 words, chosen according to the tem in order to complete all training modes of a lesson. When
pairs to be worked out along the game and selected accord- reaching a score below 60% in Discrimination, Pronunciation or
ing to their phonetic difficulty, where spoken by each of the Mixed modes, users are recommended to go back to exposure
8 participants, both before (pre-test) and after (post-test) train- mode as a feedback resource and then return to the failed mode.
ing sessions. Participants exclusively used the CAPT system in Besides, next lesson is enabled when users reach a minimum
the three 45-minutes-maximum training sessions; with a delay score of 60%.
of at least 48 hours. A total of 84 minimal pairs corresponding The first training mode is Theory (step 4). A brief and
to mispronunciations were presented to participants, 12 in each simple video describing the target contrast of the lesson is pre-
of the seven lessons. Students were asked not to complete more sented to the user as a first contact with feedback. At the end of
than three lessons per session. The software application was the video, next mode becomes available; but users may choose
installed under an Android emulator (NOX App player2 ) in the to review the material as many times as they want. Exposure
computers of the Language Center multimedia laboratory. At (step 5) is the second mode. Users strengthen the lesson con-
the beginning of the first session, students were instructed on trast experience previously introduced in Theory mode, in order
how to use the software. Then, they had to work individually to support their assimilation. Three minimal pairs are displayed
and did not have any communication with either instructors or to the user. In each one of them, both words are synthetically
classmates. Each participant worked inside an individual cubi- produced by Google TTS for five times (highlighting the cur-
cle equipped with a headset with microphone. rent word), alternately and slowly. After that, users must record
Once finished the post-test, a manual revision and isolation themselves at least one time per word and listen to their own
of the recorded words of pre and post-test was carried out in and system’s sound. Words are represented with their ortho-
order to elaborate a perceptual listening test. Five expert pho- graphic and phonemic forms. A replay button allows to listen
neticians and native speakers assigned a correct/incorrect value to the specified word again. Step 6 refers to Discrimination
to each word, plus some extra annotations about the pronuncia- mode, in which ten minimal pairs are presented to the user con-
secutively. In each one of them, one of the words is syntheti-
2 Official website (last visited June, 24th 2018) https://www. cally produced, randomly. The challenge of this mode consists
bignox.com/ on identifying which word is produced. As feedback elements,
98
Table 1: Events by user with the CAPT tool in the whole experiment. THE, EXP, DIS, PRO and MIX correspond to Theory, Exposure,
Discrimination, Pronunciation and Mixed modes, respectively. n, m and M are the mean, minimum and maximum values. Time (min)
row represents the time spent in minutes per person in each mode in the whole experiment. #Tries is the number times a mode is
practiced by each user. Mand. and Req. mean mandatory and requested TTS listenings. Productions use ASR.
THE EXP DIS PRO MIX

n m M n m M n m M n m M n m M
Time (min) 16.96 11.13 20.75 18.73 12.77 21.90 7.26 6.03 8.93 39.82 22.43 72.85 17.35 9.78 21.43
#Tries 8.25 7 10 11.38 10 14 10.16 8 15 10.38 8 14 7.38 4 10
#Mand.listenings - - - 308.75 220 390 94.75 80 134 - - - 22.13 12 30
#Req.listenings - - - 97.5 70 126 35.0 0 82 43.13 0 99 27.25 9 52
#Recordings - - - 64.6 47 81 - - - - - - - - -
#Discriminations - - - - - - 94.75 80 134 - - - 22.13 12 30
#Productions - - - - - - - - - 166.5 124 242 80.63 38 113
words have their orthographic and phonetic transcription repre- Table 2: Confusion matrix of discrimination task-tokens (diago-
sentations. Users can also request listen to the target word again nal: right discrimination task-tokens). The rows are the sounds
with a replay button. Speed varies alternately between slow and expected by the tool and the columns are the sounds selected by
normal speed rates. Finally, the system changes word color to the user. success refers to the success rate of the correspond-
green (success) or red (failure) with a chime sound. Pronun- ing sound row. #Lis is the number of requested listenings to the
ciation is the fourth mode (step 7) which aim is to produce as sound row.
well as possible, both words, separately, of the five minimal
pairs presented with their phonetic transcription. Google’s ASR
determines automatically and in real time acceptable or non-
acceptable inputs. In each production attempt the tool displays
a text message with the recognized speech, plays a right/wrong
sound and changes word’s color to green or red. The maximum
number of attempts per word is five in order not to discourage
users. However, after three consecutive failures, the system of-
fers to the user the possibility of request a word synthesis as an
explicit feedback as many times as they want with a replay but-
ton. Mixed mode is the last mode of each lesson (step 8). Nine
production and perception tasks alternate at random in order to
further consolidate obtained skills and knowledge.
results provide in first place the desired word. Users can try a
3. Results maximum of five wrong productions per word. We can observe
Table 1 shows user’s interaction degree with Japañol. After all a success improvement from first to last attempt (last column),
training sessions, Japanese learners spent an average of 100.12 being the highest ones [fl] (48.4%) and [fR] (27.7%) sounds. At
minutes performing the proposed tasks. Users consumed a first attempt, the most confused pair in production tasks is [fl]-
83.06% of the mean effective time for carry out interactive exer- [fR] (88 times) and the least confused one is [l]-[rr] (20 times).
cises in EXP, DIS, PRO and MIX modes. As a mean term, users The sounds with the lowest discrimination success rate are [fl]
listened to the TTS system 628.51 times and used the ASR sys- and [s] (both < 35%), and those with the highest discrimination
tem 247.13 times, giving a rate of 8.74 uses of the TTS/ASR per success rates are [l] and [R] (both > 75%). At last attempt, the
minute. Table 1 also shows important differences in the use of most confused pair in production tasks is [s]-[T] (102 times) and
the tool depending on the user. For instance, the user who per- the least confused one is [l]-[R] (3 times). The sounds with the
formed tasks of the PRO mode in the fastest way spent 22.43 lowest production success rate are [s] and [fR] (both < 65%). A
minutes and the one who spent more time 72.85 minutes. This higher number of requested listenings (first column) appears for
contrast can also be observed in the rest of the modes and in the last attempt pronunciation at lower success rates. The highest
number of times they interacted with the tool. The inter-user production rate success sounds are [l] and [R] (both > 96%).
differences affect both the number of times the users make use Table 4 presents the scores for each user at any of the given
of the ASR (162 minimum vs. 355 maximum) and the number stages of the experiment (pre-test, CAPT tool, post-test, and a
of times they requested the use of TTS (79 vs. 359 times). delta score of pre and post-test). EXP and ASR scores refer to
Tables 2 and 3 display the confusion matrices between both tests learners’ qualifications by human raters and Google
the sounds of the minimal pairs in perception and production ASR, respectively. These scores are computed by summing up
events. The most confused pair in discrimination tasks is [l]-[R] the number of correct words per speaker and normalizing the
(54 times) and the least confused one is [T]-[f] (5 times). The result to the range [0,10]. JAP score is computed by the number
sounds with the lowest pronunciation success rate are [s] and [fl] of correct and incorrect task-tokens while doing the required
(both < 71%), and those with the highest pronunciation success task-types of the training modes (Discrimination, Pronunciation
rate are [f] and [T] (both > 92%), corresponding to the lowest and Mixed modes) as a qualification to rank the participants.
number of requested listenings (5 and 20). Concerning to EXP values, a consistency test among hu-
Table 3 shows pronunciation events results at first and last man raters based on the Fleiss’ Kappa statistical indicator was
attempt per word utterance. A produced word is correct if ASR carried out both for pre-test and post-test evaluations. For lo-
99
Table 3: Confusion matrix of pronunciation task-tokens at first and last attempt per word sequence (diagonal: right pronunciation
task-tokens at first and last attempt per word sequence). The rows are the sounds expected by the tool and the columns are the sounds
produced by the user. success refers to the success rate of the corresponding sound row. #Lis is the number of requested listenings to
the sound row.
Table 4: Different user scores in pre-test, post-test and game by ciation is accepted in Latin American Spanish. About liquid
experts, ASR and game in a scale of [0, 10]. ID, EXP, ASR and consonants, Japanese speakers are more successful at phoneti-
JAP refer to user identifier, experts, Google ASR and Japañol, cally producing [l] and [R] than discriminating these phonemes.
respectively. Japanese speakers have already acquired these sounds since
they are allophones of a same liquid phoneme in Japanese. For
Pre-test Game Post-test ∆ (Pre/Post) this reason, it does not seem to be necessary to distinguish them
ID EXP ASR JAP EXP ASR EXP ASR in Japanese, whereas it is in Spanish.
07 9.8 5.1 9.4 9.4 7.5 -0.4 2.4 Applied methodology has proved coherent and efficient be-
05 8.6 4.5 9.1 8.8 4.6 0.2 0.2 cause easier tasks (exposure, discrimination) were presented be-
01 8.1 2.9 8.0 7.9 4.6 -0.3 1.7 fore tougher ones (pronunciation) in time and repetition terms
02 7.3 3.7 8.4 7.8 4.8 0.4 1.2 (Table 1). It has also led to better final scores (Table 4). Par-
03 6.9 1.9 6.3 7.6 2.7 0.8 0.8 ticipants consistently resorted to TTS models when faced with
08 6.8 2.8 6.5 7.1 3.4 0.4 0.6 difficulties both in perception and production modes (#Lis col-
06 5.8 0.8 5.9 6.5 3.2 0.8 2.4 umn of Tables 2 and 3 and #Req.List.row of Table 1). The
04 5.4 2.1 6.9 7.3 2.9 1.9 0.8 fact that our CAPT tool makes use of ASR and TTS systems
is the principal pedagogical and operational concern. The qual-
ity of synthetic voice in the rendering of minimal pairs appears
gistic reasons, post-test evaluation was carried out first and a to have been adequate. Judging from gathered data, the TTS
moderate agreement [21] was found (Kappa value=0.50). There system employed seems to be beneficial for students [22, 23]:
is a considerable increment for pre-test, reaching a substantial the rate of success significantly increases after undertaking the
agreement (Kappa value=0.63). Comparing subjective and ob- exposure activities imposed by feedback. The role of ASR is
jective scores, delta score values (last two columns) are pos- even more crucial as it offers diagnosis to users as real-time
itive in almost all cases both in EXP and ASR (only two of automatic feedback. Nowadays, ASR systems have some limi-
them decrease in EXP, both in top three). They also show a tations such as isolated or infrequent words, adaptation to tech-
fair correlation with pre-test expert scoring (r=-0.856) and post- nology and L1 transferred pronunciation utterances as correct
test expert scores with ASR ones (r=-0.735). Pre-test scores as- L2 words (false positives). However, [24] demonstrated the ef-
signed by experts (EXP) show reasonable linear regression cor- fectiveness of ASR-based CAPT tool for training users in the
relation with those obtained by applying the ASR in the same production of decontextualized isolated words and [25] reported
test (r=0.890). A similar correlation is found for post-test EXP L2 French vowel /y/ production improvement after training with
and ASR (r=0.834). Game scores (JAP) show good correlation a mobile ASR system.
with EXP post-test results (r=0.912), clearly over the correlation Human raters’ post-test scores fairly correlate with ASR
found between JAP and pre-test human rater results (r=0.867). and game ones, being useful in the future to be able to eval-
uate a large amount of users reducing human costs (Table 4).
The lowest pre-test scores’ users improved more than the best
4. Discussion and Conclusion ones. However, they did not reach better results than the top
We have described the experiment results with a CAPT system ones. This is due to the fact that the tool does not give ex-
of native Japanese learners of Spanish as a foreign language, tra activities when the limit is reached. As a future work, we
that include a software tool, Japañol, that we believe is worth are collaborating with Seisen University with a bigger group
taking into account when thinking about possible teaching com- of students divided into experimental, control and in-classroom
plement. The results have shown that Japanese learners of Span- groups. We are also considering the possibility of applying the
ish have difficulty with [f] in onset clusters position in both per- methodology to other more exotic and minority languages. Fi-
ception (Table 2) and production (Table 3). [s]-[T] present sim- nally, a comparison between a game-oriented version versus a
ilar results, they tend to substitute [T] by [s], but this pronun- learning-oriented one could be our next step.
100
5. References [16] C. Tejedor-Garcı́a, V. Cardeñoso-Payo, E. Cámara-Arenas,
C. González-Ferreras, and D. Escudero-Mancebo, “Playing
[1] R. I. Thomson and T. M. Derwing, “The effectiveness of L2 pro- around minimal pairs to improve pronunciation training,” IFCASL
nunciation instruction: A narrative review,” Applied Linguistics, 2015, 2015.
vol. 36, no. 3, p. 326, 2014.
[17] C. Tejedor-Garcı́a, D. Escudero-Mancebo, C. González-Ferreras,
[2] A. Kukulska-Hulme, “Language learning defined by time and
E. Cámara-Arenas, and V. Cardeñoso-Payo, “Evaluating the Ef-
place: A framework for next generation designs,” in Left to My
ficiency of Synthetic Voice for Providing Corrective Feedback in
Own Devices: Learner Autonomy and Mobile Assisted Language
a Pronunciation Training Tool Based on Minimal Pairs,” SLaTE,
Learning, ser. Innovation and Leadership in English Language
2017.
Teaching, J. E. Dı́az-Vera, Ed. Bingley, UK: Emerald Group
Publishing Limited, 2012, vol. 6, pp. 1–13. [Online]. Available: [18] Y. Sheen and R. Ellis, “Corrective feedback in language teaching,”
http://oro.open.ac.uk/30756/ Handbook of research in second language teaching and learning,
[3] A. Pareja-Lora, C. Calle-Martı́nez, and P. Rodrı́guez-Arancón, vol. 2, pp. 593–610, 2011.
New perspectives on teaching and working with languages in the [19] B. Penning de Vries, C. Cucchiarini, H. Strik, and R. van Hout,
digital era. Research-publishing. net, 2016. “The Role of Corrective Feedback in Second Language Learn-
[4] W. Li and D. Mollá-Aliod, “Computer processing of oriental lan- ing: New Research Possibilities by Combining CALL and Speech
guages. language technology for the knowledge-based economy,” Technology,” Proceedings of SlaTE 2010, Tokyo, Japan, nov
Lecture Notes in Computer Science, vol. 5459, 2009. 2010.
[5] M. Carranza, “Diseño de aplicaciones para la práctica de la pro- [20] C. Tejedor-Garcı́a and D. Escudero-Mancebo, “Uso de pares
nunciación mediante dispositivos móviles y su incorporación en mı́nimos en herramientas para la práctica de la pronunciación del
el aula de ele,” El español entre dos mundos: Estudios de ELE en español como lengua extranjera,” Revista de la Asociación Euro-
Lengua y Literatura, pp. 279–297, 2014. pea de Profesores de Español. El español por el mundo, no. 1, pp.
[6] A. Neri, C. Cucchiarini, and H. Strik, “The effectiveness of 355–363, 2018.
computer-based speech corrective feedback for improving seg- [21] J. R. Landis and G. G. Koch, “The measurement of observer
mental quality in L2-Dutch,” ReCALL, vol. 20, no. 02, pp. 225– agreement for categorical data,” biometrics, pp. 159–174, 1977.
243, 2008.
[22] G. Smith, W. Cardoso, and C. G. Fuentes, “Text-to-speech syn-
[7] J. Lee, J. Jang, and L. Plonsky, “The Effectiveness of Second Lan- thesizers: Are they ready for the second language classroom?”
guage Pronunciation Instruction: A Meta-Analysis,” Applied Lin- Concordia University, 2015.
guistics, vol. 36, no. 3, p. 345, 2015.
[23] D. Liakin, W. Cardoso, and N. Liakina, “The pedagogical use of
[8] M. Carranza, “Errores y dificultades especı́ficas en la adquisición mobile speech synthesis (TTS): focus on French liaison,” Com-
de la pronunciación del español le por hablantes de japonés y puter Assisted Language Learning, vol. 30, no. 3-4, pp. 325–342,
propuestas de corrección,” Nuevos enfoques en la enseñanza del 2017.
español en Japón, pp. 51–78, 2012.
[24] A. Neri, O. Mich, M. Gerosa, and D. Giuliani, “The effective-
[9] A. Blanco-Canales and M. Nogueroles-López, “Descripción y
ness of computer assisted pronunciation training for foreign lan-
categorización de errores fónicos en estudiantes de español/l2.
guage learning by children,” Computer Assisted Language Learn-
validación de la taxonomı́a de errores aacfele,” Logos: Revista
ing, vol. 21, no. 5, pp. 393–408, 2008.
de Lingüı́stica, Filosofı́a y Literatura, vol. 23, no. 2, pp. 196–225,
2013. [25] D. Liakin, W. Cardoso, and N. Liakina, “Learning L2 pronun-
[10] A. R. Bradlow, D. B. Pisoni, R. Akahane-Yamada, and ciation with a mobile speech recognizer: French /y/,” CALICO
Y. Tohkura, “Training Japanese listeners to identify English /r/ Journal, vol. 32, no. 1, p. 1, 2015.
and /l/: IV. Some effects of perceptual learning on speech produc-
tion,” The Journal of the Acoustical Society of America, vol. 101,
no. 4, pp. 2299–2310, 1997.
[11] J. M. Cabezas Morillo, “Las creencias de los estudi-
antes japoneses sobre la pronunciación española: un
análisis exploratorio,” Trabajo de Máster, Universidad
Pablo de Olavide, 2009. [Online]. Available: http://www.
mecd.gob.es/dam/jcr:5ebaa0b9-7b77-4a28-a732-b743be08dd06/
2010-bv-11-05cabezas-pdf.pdf
[12] M. Carranza, C. Cucchiarini, J. Llisterri, M. J. Machuca, and
A. Rı́os, “A corpus-based study of spanish L2 mispronun-
ciations by Japanese speakers,” in Edulearn14 Proceedings.
6th International Conference on Education and New Learn-
ing Technologies. IATED Academy, 2014, pp. 3696–3705.
[Online]. Available: http://liceu.uab.cat/∼joaquim/publicacions/
Carranza et al 14 Corpus Spanish L2.pdf
[13] C. Tejedor-Garcı́a, V. Cardeñoso-Payo, E. Cámara-Arenas,
C. González-Ferreras, and D. Escudero-Mancebo, “Measuring
pronunciation improvement in users of CAPT tool TipTopTalk!”
Interspeech, pp. 1178–1179, September 2016.
[14] C. Tejedor-Garcı́a, D. Escudero-Mancebo, E. Cámara-Arenas,
C. González-Ferreras, and V. Cardeñoso-Payo, “Improving L2
production with a gamified computer-assisted pronunciation train-
ing tool, TipTopTalk!” IberSPEECH 2016: IX Jornadas en Tec-
nologı́as del Habla and the V Iberian SLTech Workshop events,
pp. 177–186, 2016.
[15] D. Escudero-Mancebo, E. Cámara-Arenas, C. Tejedor-Garcı́a,
C. González-Ferreras, and V. Cardeñoso-Payo, “Implementation
and test of a serious game based on minimal pairs for pronuncia-
tion training,” SLaTE, pp. 125–130, 2015.
101
IberSPEECH 2018
Exploring E2E speech recognition systems for new languages

Conrad Bernath1 , Aitor Álvarez1 , Haritz Arzelus1 , Carlos-D. Martı́nez-Hinarejos2
Human Speech and Language Technology Group,

1
Vicomtech, Spain
2
Pattern Recognition and Human Language Technologies Research Center,
Universitat Politècnica de València, Spain
[cbernath,aalvarez,harzelus]@vicomtech.org, cmartine@dsic.upv.es
Abstract to extract the best features and to employ a single optimization

criterion that leads to a better general performance.
Over the last few years, advances in both machine learning The interest in building E2E ASR systems that directly map
algorithms and computer hardware have led to significant im- the input speech signal to grapheme/character sequences and
provements in speech recognition technology, mainly through that jointly train the acoustic, pronunciation and language mod-
the use of Deep Learning paradigms. As it was amply demon- els has exponentially grown in the last years [3, 4, 5, 6].
strated in different studies, Deep Neural Networks (DNNs)
Two main approaches have been mainly employed to build
have already outperformed traditional Gaussian Mixture Mod-
E2E ASR models. The Connectionist Temporal Classification
els (GMMs) at acoustic modeling in combination with Hidden
(CTC) is probably the most widely used criterion for systems
Markov Models (HMMs). More recently, new attempts have
based on characters [2, 7, 8], sub-words [9] or words [10]. It
focused on building end-to-end (E2E) speech recognition archi-
employs Markov assumptions and dynamic programming to ef-
tectures, especially in languages with many resources like En-
ficiently solve sequential problems [2, 3, 7]. On the other hand,
glish and Chinese, with the aim of overcoming the performance
attention-based methods use an attention mechanism to perform
of LSTM-HMM and more conventional systems.
alignment between acoustic frames and characters [4, 5, 6]. Un-
The aim of this work is first to present the different tech-
like CTC, it does not require several conditional independence
niques that have been applied to enhance state-of-the-art E2E
assumptions to obtain the label sequence probabilities, allowing
systems for American English using publicly available datasets.
extremely non-sequential alignments like in the case of machine
Secondly, we describe the construction of E2E systems for
translation.
Spanish and Basque, and explain the strategies applied to over-
come the problem of the limited availability of training data, Several techniques have been applied in the literature to en-
especially for Basque as a low-resource language. At the eval- hance the performance of E2E ASR engines. These techniques
uation phase, the three E2E systems are also compared with are mainly focused on compensating for the lack of training
LSTM-HMM based recognition engines built and tested with data, adapting the characteristics of the system to some specific
the same datasets. domains or on gaining robustness in different acoustic environ-
Index Terms: speech recognition, deep learning, end-to-end ments. Data augmentation aims to extend the training mate-
speech recognition, recurrent neural networks rial by generating new synthetic data through noise injection or
by augmenting the audios using Vocal Tract Length Perturba-
tion (VTLP), tempo modification or speed alteration [11]. The
1. Introduction Transfer Learning technique is the improvement of a new learn-
Automatic Speech Recognition (ASR) systems have historically ing by taking advantage of the knowledge obtained from a pre-
employed Hidden Markov Models (HMMs) to capture the time viously learned related task [12]. For E2E ASR models, this
variability of the speech signal and Gaussian Mixture Mod- technique consists of using the weights of a previously trained
els (GMMs) to model the HMM state probability distributions. model in a particular language as initial weights for the train-
Relevant advances in both machine learning and computational ing of a new target language. Fine Tuning is commonly applied
capacity have led to significant improvements in the field, when existing E2E models have to be adapted to some particu-
mainly by means of Deep Learning algorithms. Thereby, nu- lar conditions. It can be seen as a subtype of Transfer Learning
merous works have shown that Deep Neural Networks (DNNs) but without freezing layers or making changes over them.
in combination with HMMs can outperform GMM-HMM sys- A number of enhancement techniques have also been em-
tems at acoustic modeling on a variety of datasets [1]. ployed to improve the performance of these systems under
Recently, new attempts have focused on building End-to- acoustically noisy conditions. In addition to the front-end
End (E2E) ASR architectures, especially in languages with methods applied at feature level [13, 14] or training on noisy
many resources like English and Chinese [2]. These new ar- data with a range of SNR values [15], other methods such
chitectures intend to benefit from the vast amount of new data as Dropout [16] or Curriculum Learning [17] have also been
available, and to make training a much more efficient process by proven to improve robustness against noise.
allowing a better optimization of the system as a whole unit [3]. In this work, the construction and evaluation of several E2E
Besides, one of their most remarkable advantages is to build systems are presented for English, Spanish and Basque, consid-
ASR engines without the need of a phoneme set or a pronuncia- ering the latter a low resource language. In addition, several
tion lexicon of the language. This definitely reduces the human new modeling techniques explained above were applied with
effort to manually construct resources to help the system to esti- the aim of outperforming baseline E2E engines and exploring
mate their components. This novel approach allows the system further possibilities within more challenging acoustic domains
102 10.21437/IberSPEECH.2018-22
and languages. Finally, the results achieved were compared Table 1: Training, development and test subsets for English.
with those obtained by LSTM-HMM based systems through the
use of the same training and evaluation datasets. subset hours
This paper is organized as follows. Section 2 describes train-clean 464.2
the main architecture of the E2E systems developed within this train-other 496.7
work. Section 3 describes the training and evaluation data em- dev-clean 5.4
ployed, whilst Section 4 presents the baseline ASR systems test-clean 5.4
built for each language including both E2E and LSTM-HMM dev-other 5.3
architectures. The experiments performed are described in Sec- test-other 4.1
tion 5 and finally, Section 6 draws conclusions and looks at the test-noisy 5.4
future work.
2. System overview The clean partitions correspond to those pools with a lower
All the E2E systems presented in this work were developed fol- WER in the whole corpus, whilst the other ones contain the
lowing the Deep Speech 2 architecture [2]. The core of the most difficult audios a priori. Test-noisy was artificially created
system is basically an RNN model, in which speech spectro- within this work using synthesis by superposition [22] of noise
grams are ingested and text transcriptions are provided as out- samples to the test-clean subset.The noise samples correspond
put. Although the Long-short-term Memory (LSTM) is widely to audios from different acoustic environments selected from
used as RNN model, in this architecture Gated Recurrent Units the Youtube platform.
(GRU) [18] are employed, since they have been proven to be Regarding the text data used to train the LMs, they were
trained more rapidly and to be less likely to diverge [19]. composed by 22,000 books from Project Gutenberg2 repository,
A sequence of 2 layers of 2D convolutional neural networks toting up 803 million tokens and 900,000 unique words.
(CNN) are employed as spectral feature extractor from spectro-
grams. The first layer was composed of 1 input and 32 output 3.2. Spanish
channels and it uses filters of size 41 × 11 and stride of size The Spanish subset of the SAVAS corpus [23] was used as the
2 × 2. The second layer takes as input the output of the first main dataset. It is composed of broadcast news contents from
layer, composed of 32 channels. The output of the second layer the Basque Country’s public broadcast corporation EiTB (Eu-
incorporates 32 channels as well. This second layer employs a skal Irrati Telebista), and includes audios in both clear (studio)
filter dimension of 21 × 11 and stride of size 2 × 1. A 2D batch and noisy (outside) conditions. This media dataset was then
normalization function is applied to the output of both layers, in transferred through both land- and mobile-lines using different
addition to a hard tanh function as an activation function. combinations, generating new telephone domain subsets, as it
The E2E systems are set up using 5 layers of bidirectional is summarized in Table 2.
GRU layers. Each hidden layer is composed of 800 hidden
units. After the bidirectional recurrent layers, a fully connected Table 2: Training, development and test subsets for Spanish.
layer is applied as the last layer of the whole model. The output
corresponds to a softmax function which computes a probability subset hours
distribution over characters. This distribution is computed over train-media 132.5
each timestep. The size of this output layer was equal to the train-land-mobile 397.5
total number of the characters to predict. dev-media 4
During the training process, the CTC loss function is com- test-media-clean 4
puted to measure the error of the predictions, whilst the gradi- test-media-noisy 4
ent is computed using backpropagation through time algorithm dev-land 4
with the aim of updating the network parameters. The optimizer
test-land 4.6
is the Stochastic Gradient Descent (SGD).
dev-mobile 4
In addition, external language models (LMs) were inte-
test-mobile 4.6
grated during the decoding of the E2E systems with the aim
of overcoming the initial results. To this end, modified Kneser-
Ney smoothed n-grams models of several orders were estimated Concerning text data, they were obtained by merging tran-
using the KenLM toolkit [20]. scriptions of the training audios and generic domain news
crawled from the Internet. The number of texts summed up a
3. Corpora description total of 320 million words.
In this section, the acoustic and text data used to train and eval- 3.3. Basque
uate both E2E and LSTM-HMM systems are presented for each
language. The Basque training data was also composed of the Basque sub-
set of the SAVAS corpus, and the audios were gathered from
3.1. English Basque broadcast news programs as well. No telephone do-
main partition was generated in this case. Table 3 describes the
The freely available corpus LibriSpeech [21], a read speech cor- main characteristics of the SAVAS Basque corpus.
pus based on audio-books from LibriVox1 , was used as dataset. The text data were also obtained by merging transcriptions
The training, development and testing subsets were maintained and generic news crawled from the Internet. In total, 180 mil-
as original. These partitions are detailed in Table 1. lion words were employed for the LMs estimation.
1 https://librivox.org/ 2 https://www.gutenberg.org/
103
Table 3: Training, development and test subsets for Basque. Table 4: WER (%) results for linear- and Mel-based systems for
each language.
subset hours
train-media 142.25 Test(-media-)clean Test(-media-)noisy
dev-media 4 No LM 3-gram No LM 3-gram
test-media-clean 4
English
test-media-noisy 4
Baseline-EN 10.6 5.6 35.3 23.6
Mel-scale-EN 10.2 5.4 34.8 22.2
Spanish
4. Baseline E2E and LSTM-HMM systems Baseline-ES 24.0 10.3 39.5 19.2
Using the above described corpora, the E2E and LSTM-HMM Mel-scale-ES 21.9 10.3 36.7 18.9
based baseline systems were trained in order to be compared to
Basque
the evolved ASR systems presented in Section 5. The charac-
Baseline-EU 23.8 12.9 38.8 17.3
teristics of these baseline systems are described in the following
Mel-scale-EU 21.2 8.9 36.2 19.2
subsections.
4.1. Baseline E2E models

5.2. Training a mixed model for media and telephone
Each E2E model per language was trained over the correspond-
ing corpus described above, through a default setup and without Telephone speech sampling is usually limited by the bandwidth
applying any enhancement technique. Linear-scale based spec- at which the audio is transmitted through the telephone channel,
trograms were employed as input. English E2E baseline models getting values as much at 8 kHz (4 kHz Nyquist). This com-
were trained using the train-clean and train-other partitions for monly implies to train separate acoustic models per domain.
15 epochs and a batch size of 10, whilst the Spanish and Basque In this work, a mixed E2E model for the media and tele-
baselines were built with the train-media partition for 25 epochs phone domains was built, and then a fine-tuning technique
and a batch size of 20 because of the corpus characteristics. was applied over each domain training set to generate domain
adapted individual models. It was only performed for the Span-
4.2. LSTM-HMM models ish language and the performance of the resulting systems was
compared to the Spanish baseline system. Following the re-
These models were estimated with the Kaldi toolkit [24]. The sults obtained in the previous experiment, and with the aim of
acoustic models corresponded to a hybrid LSTM-HMM imple- loosing the less valuable telephone speech information as pos-
mentation, where unidirectional LSTMs were trained to pro- sible, Mel-scale based spectrograms were employed for audio
vide posterior probability estimates for the HMM states. Be- parametrization. In addition, the Spanish train-media dataset
sides, modified Kneser-Ney smoothed 3-gram and 5-gram mod- was 3-fold augmented through speed perturbation to balance the
els were used for decoding and re-scoring of the lattices respec- quantity of data in both domains.
tively. Both LMs were estimated with the KenLM toolkit.
The results achieved over the media test set and the tele-
phone test set are presented in Tables 5 and 6 respectively.
5. Experiments
5.1. Linear- and Mel-scale based spectrograms Table 5: WER results for the baseline, mixed (Mixed-ES)
and fine-tuned (Mixed-FT-Media and Mixed-FT-Phone) models
The original Deep Speech 2 architecture employs linear spec- over the Spanish media test set.
trograms as the main audio parametrization method. This ex-
periment was focused on analyzing the use of Mel-scale based
Test-media-clean Test-media-noisy
spectrogram as input data to the E2E model as it was employed
No LM 3-gram No LM 3-gram
in [25], as it aims to highlight the most relevant information for
the human hearing according to the Mel scale. Baseline-ES 24.0 10.3 39.5 19.2
Two models were compared for the three languages under Mixed-ES 16.5 8.8 21.3 11.5
evaluation; the above presented baseline E2E models trained Mixed-FT-Media 15.7 8.5 20.5 10.9
using linear spectrograms, and the evolved new models esti- Mixed-FT-Phone 18.6 9.6 22.4 11.5
mated with the same configuration and training-set but using
Mel-based spectrograms for audio parametrization.
The results in Table 4 show a noticeable improvement for
all the experiments except one when using Mel-scale based Table 6: WER results for the baseline, mixed (Mixed-ES)
spectrograms. Focusing on the results without using LM (No and fine-tuned (Mixed-FT-Media and Mixed-FT-Phone) models
LM), a relative improvement of 3% was reached for English over the Spanish telephone test set.
on the clean test set, whilst these enhancements were of 8.7%
and 10% for Spanish and Basque. This positive behavior was Test-mobile Test-land
also maintained over the noisy test set, obtaining relative im- No LM 3-gram No LM 3-gram
provements of 1.4%, 7.0% and 6.7% for each corresponding
Baseline-ES 59.5 33.8 51.7 25.9
language. The results using LM followed the same tendency,
Mixed-ES 22.4 12.8 18.4 9.9
with remarkable differences obtained specially for Basque over
Mixed-FT-Media 23.4 13.0 18.7 10.0
the test-media-clean set, where a real improvement of 4 percent-
Mixed-FT-Phone 22.2 12.1 17.2 9.3
age points was reached (from 12.9 to 8.9).
104
Looking at the results obtained for the media test set in Table 8: WER comparison of Kaldi vs E2E for SAVAS Spanish.
clean conditions, relative improvements of 7.5% and 8.3% of
the Mixed-ES and the fine-tuned Mixed-FT-Media models can Test clean Test noisy
be observed with respect to the baseline. Besides, improve- 3-gram 5-gram 3-gram 5-gram
ments of 18.3% were achieved by the fine-tuned Mixed-FT-
Spanish
Media model for the media noisy test set when comparing to
Evolved-ES 8.5 7.2 10.9 9.3
the baseline system.
LSTM-HMM 7.9 7.7 11.9 10.8
Regarding the results on the telephone domain presented in
Table 6, the three models under evaluation present better per- Basque
formance than the Baseline-ES model. As it was expected, Evolved-EU 8.9 6.6 19.2 15.9
the Mixed-FT-Phone model shows the best results with a real LSTM-HMM 7.8 6.0 10.8 8.2
improvement of 37.3% with respect to the baseline over the
Test-mobile set, composed of audios transferred by the mo-
bile channel. This improvement achieves a 34.5% in the case
of the Test-land set. The Mixed-ES model also shows notable As it can be observed in Table 7, the evolved E2E sys-
enhancements, with slightly worse WER than the Mixed-FT- tem outperformed the LSTM-HMM system and all the refer-
Phone model. The Mixed-FT-Media performs satisfactorily as ence systems over the clean test. It shows the effect of apply-
well, but it shows a little lower performance than the others. ing Mel-scale based parametrization, and techniques like fine-
Finally, it can be also observed that the Mixed-ES model tuning and speed-based data augmentation. It has to be also
performs almost as well as the fine-tuned models even without remarked that the WER obtained is 0.9 percentage points lower
having applied any fine-tuning technique. than the achieved by a human manual transcription.
On the contrary, for the Test-other partition, the LSTM-
5.3. Best E2E models compared to LSTM-HMM systems HMM model shows a better performance presenting an error
rate 0.2 percentage points lower than the proposed E2E model.
The aim of this task was to select the best E2E models presented With respect to the reference systems, the Evolved-EN model
above and compare them to (1) the state-of-the-art models in obtains a better performance than Deep Speech 1 model, but
the literature if available and (2) LSTM-HMM based systems still, an error of 2.2 points higher than Deep Speech 2. It
developed on top of the Kaldi toolkit. For this experiment, the has to be considered that the model from Deep Speech 2 was
selected models for each language were called as Evolved-EN, trained with 12,000 hours (including the Librispeech corpus),
Evolved-ES and Evolved-EU. more than ten times more than the Evolved-EN system (960h).
• Evolved-EN. It corresponded to the English baseline Table 8 shows how the best Spanish E2E model overcame
model evolved through fine-tuning and a speed-based all the results obtained by the LSTM-HMM based system with
data augmentation techniques applied for 10 more the exception of the Test clean when a 3-gram was applied.
epochs. The decoding was performed using an external It shows the robustness of the fine-tuned mixed model trained
3-gram LM and using a beam size of 1000. with data from different acoustic domains. Finally, the scarcity
of training data for Basque benefited the LSTM-HMM model
• Evolved-ES. This model refers to the Spanish fine-tuned against the E2E model, which was estimated using Mel spec-
Mixed-FT-Media. It was decoded with two configura- trograms as the only enhancement technique.
tions; (1) using an external 3-gram LM and a beam size
of 600 and (2) with a 5-gram LM with a beam size of
1000. 6. Conclusions and Future Work
• Evolved-EU. The selected Basque model was the Mel- In this work, E2E ASR systems for English, Spanish and
scale-EU model presented in subsection 5.1. It was de- Basque have been developed and evaluated against reference
coded using external 3-gram and 5-gram LM models and LSTM-HMM architectures. Besides, the positive impact
with a beam size of 600 and 1000 respectively. of applying some enhancement techniques have been demon-
strated through different experimental evaluations.
In Table 7, a comparison between the evolved model, the As main conclusions, it can be stated that using Mel-scale
Kaldi based LSTM-HMM model, and the results obtained from based spectrograms overcomes linear-based ones, as it was
reference systems in the literature are presented for the English proven in all the experiments. Moreover, it was shown that
language. Since Test-noisy was created for this work, the results robust hybrid E2E models that perform almost as well as in-
for this subset are only given for the first two models. domain models can be generated for acoustically different en-
vironments. With regard to the training data, as in the case of
Table 7: WER comparison between our best model, the LSTM- Basque compared to English, it was clear that more data leads
HMM model and reference systems for English language. to better results. However, when the resources are limited, using
an external LM of a high order can improve the performance, as
Test-clean Test-other Test-noisy it was shown for all the languages under evaluation.
Evolved-EN 4.9 15.4 21.0 As a future work, one important task will be the genera-
LSTM-HMM 6.0 15.2 26.3 tion of new Spanish and Basque E2E models with more train-
DeepSpeech-1 [2] 7.8 21.7 - ing data, since these models were still weak without LM. More-
DeepSpeech-2 [2] 5.3 13.2 - over, current N-grams will be replaced by RNN based LMs,
Human [2] 5.8 12.6 - specially for Basque as an agglutinative language. Finally, fol-
Wav2Letter [8] 6.9 - - lowing novel studies, new research efforts will be made to de-
velop methodologies for a semi-supervised learning within E2E
architectures.
105
7. References [17] S. Braun, D. Neil, and S.-C. Liu, “A curriculum learning method
for improved noise robustness in automatic speech recognition,”
[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, in Signal Processing Conference (EUSIPCO), 2017 25th Euro-
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep pean. IEEE, 2017, pp. 548–552.
neural networks for acoustic modeling in speech recognition: The
shared views of four research groups,” IEEE Signal processing [18] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
magazine, vol. 29, no. 6, pp. 82–97, 2012. F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase repre-
[2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat- sentations using rnn encoder-decoder for statistical machine trans-
tenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen lation,” arXiv preprint arXiv:1406.1078, 2014.
et al., “Deep speech 2: End-to-end speech recognition in english [19] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical
and mandarin,” in International Conference on Machine Learn- exploration of recurrent network architectures,” in International
ing, 2016, pp. 173–182. Conference on Machine Learning, 2015, pp. 2342–2350.
[3] A. Graves and N. Jaitly, “Towards end-to-end speech recognition [20] K. Heafield, “Kenlm: Faster and smaller language model queries,”
with recurrent neural networks,” in Proceedings of the 31st In- in Proceedings of the Sixth Workshop on Statistical Machine
ternational Conference on International Conference on Machine Translation. Association for Computational Linguistics, 2011,
Learning - Volume 32, ser. ICML’14. JMLR.org, 2014, pp. II– pp. 187–197.
1764–II–1772.
[21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
[4] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend rispeech: an asr corpus based on public domain audio books,” in
and spell: A neural network for large vocabulary conversational Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE
speech recognition,” in Acoustics, Speech and Signal Processing International Conference on. IEEE, 2015, pp. 5206–5210.
pp. 4960–4964. [22] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos,
E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al.,
[5] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
“Deep speech: Scaling up end-to-end speech recognition,” arXiv
gio, “Attention-based models for speech recognition,” in Ad-
vances in neural information processing systems, 2015, pp. 577–
585. [23] A. del Pozo, C. Aliprandi, A. Álvarez, C. Mendes, J. P. Neto,
[6] L. Lu, X. Zhang, and S. Renals, “On training the recurrent neural S. Paulo, N. Piccinini, and M. Raffaelli, “Savas: Collecting, an-
network encoder-decoder for large vocabulary end-to-end speech notating and sharing audiovisual language resources for automatic
recognition,” 2016 IEEE International Conference on Acoustics, subtitling.” in LREC, 2014, pp. 432–436.
Speech and Signal Processing (ICASSP), pp. 5060–5064, 2016. [24] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
[7] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
recognition using deep rnn models and wfst-based decoding,” in “The kaldi speech recognition toolkit,” in IEEE 2011 workshop
Automatic Speech Recognition and Understanding (ASRU), 2015 on automatic speech recognition and understanding, no. EPFL-
IEEE Workshop on. IEEE, 2015, pp. 167–174. CONF-192584. IEEE Signal Processing Society, 2011.
[8] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end- [25] V. Liptchinsky, G. Synnaeve, and R. Collobert,
to-end convnet-based speech recognition system,” arXiv preprint “Letter-based speech recognition with gated convnets,”
arXiv:1609.03193, 2016. CoRR, vol. abs/1712.09444, 2017. [Online]. Available:
[9] H. Liu, Z. Zhu, X. Li, and S. Satheesh, “Gram-ctc: Automatic unit http://arxiv.org/abs/1712.09444
selection and target decomposition for sequence labelling,” arXiv
[10] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Na-
hamoo, “Direct acoustics-to-word models for english conver-
sational speech recognition,” arXiv preprint arXiv:1703.07754,
2017.
[11] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmen-
tation for speech recognition,” in Sixteenth Annual Conference of
the International Speech Communication Association, 2015.
[12] L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of
Research on Machine Learning Applications and Trends: Algo-
rithms, Methods, and Techniques. IGI Global, 2010, pp. 242–
264.
[13] Y. Fujita, R. Takashima, T. Homma, R. Ikeshita, Y. Kawaguchi,
T. Sumiyoshi, T. Endo, and M. Togami, “Unified asr system using
lgm-based source separation, noise-robust feature extraction, and
word hypothesis selection,” in Automatic Speech Recognition and
Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015,
pp. 416–422.
[14] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nico-
laou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end
speech emotion recognition using a deep convolutional recurrent
5204.
[15] D. Palaz, R. Collobert et al., “Analysis of cnn-based speech recog-
nition system using raw speech as input,” in Proceedings of IN-
TERSPEECH, no. EPFL-CONF-210029, 2015.
search, vol. 15, no. 1, pp. 1929–1958, 2014.
106
IberSPEECH 2018
Listening to Laryngectomees: A study of Intelligibility and Self-reported

Listening Effort of Spanish Oesophageal Speech.
Sneha Raman, Inma Hernaez, Eva Navas, Luis Serrano
AHOLAB Signal Processing Laboratory, University of the Basque Country (UPV/EHU), Spain
sneha.raman@ehu.eus, inma.hernaez@ehu.eus, eva.navas@ehu.eus, lserrano@aholab.ehu.eus
Abstract scribed. The main advantage of measures based on ASR is that

it is an objective measure and therefore easy to replicate and to
Oesophageal speakers face a multitude of challenges, such as implement. However it only evaluates machine intelligibility,
difficulty in basic everyday communication and inability to in- and not how intelligible it is to humans. It also does not con-
teract with digital voice assistants. We aim to quantify the diffi- sider other important factors like pleasantness, acceptability or
culty involved in understanding oesophageal speech (in human- listening effort.
human and human-machine interactions) by measuring intel-
Some studies have been conducted in measuring the intel-
ligibility and listening effort. We conducted a web-based lis-
ligibility of Spanish oesophageal speech. In [6] the voice intel-
tening test to collect these metrics. Participants were asked to
ligibility characteristics for Spanish oesophageal and tracheoe-
transcribe and then rate the sentences for listening effort on a
sophageal speech is reported. This study was conducted for two
5-point Likert scale. Intelligibility, calculated as Word Error
syllable words. Another study [7] showed how the formant fre-
Rate (WER), showed significant correlation with user rated ef-
quencies were higher and the duration of vowels was longer for
fort. Speaker type (healthy or oesophageal) had a major effect
laryngectomees as compared to healthy speech. The work in [8]
on intelligibility and effort. Listeners familiar with oesophageal
describes a real time recognition system for vowel segments of
speech did not have any advantage over non familiar listeners in
Spanish oesophageal voice.
correctly understanding oesophageal speech. However, they re-
The above mentioned studies focus on the micro level of
ported lesser effort in listening to oesophageal speech compared
words and vowels. Sentence level HSR studies on the intelli-
to non familiar listeners. Additionally, we calculated speaker-
gibility of Spanish oesophageal voice is a less traversed area of
wise mean WERs and they were significantly lower when com-
investigation. In this study, sentences are used as our stimuli.
pared to an automatic speech recognition system.
Index Terms:Spoken language understanding, Speech intelli- The downside of intelligibility measurements is that they
gibility, Speech and voice disorders, Pathological speech and indicate only how many words have been correctly identified
language, Speech perception but not how difficult it was to identify them. This does not
do justice to the problem of how effortful the listening was,
especially in adverse listening conditions such as pathological
1. Introduction speech. A review of listening effort and various methods of
Oesophageal speech is an equipment-free speech production measuring listening effort is presented in [9].
method used by people whose larynx has been surgically re- Some research focussed on measuring the processing load
moved (laryngectomees). In spite of the absence of vocal folds, associated with oesophageal speech. In [10] the authors mea-
they can still utter intelligible speech using alternative vibrat- sured the acceptability of oesophageal, electro-laryngeal and
ing elements. In the production of oesophageal speech, the healthy speech. They found that healthy speech was the most
pharyngo-oesophageal segment is used as a substitutive vibrat- acceptable, followed by superior oesophageal speech and then
ing element. Air is swallowed from the mouth and is introduced artificial larynx speech. In [11] high intelligibility tracheoe-
into the oesophagus, after which it is expelled in a controlled sophageal speech was played to listeners and they were asked
way, thereby producing vibration. This generation mechanism to rate the effort of listening as well as acceptability for each
introduces acoustic artefacts and makes oesophageal speech dif- sample and found an inverse correlation between listening ef-
ficult and effortful to understand [1] [2], which greatly affects fort and acceptability. Another observation from this study was
communication, interpersonal relationships and hence, quality that even highly intelligible speech can have varying listener ef-
of life. Moreover, these less intelligible voices are problem- fort. In this study we attempt to explore this processing load
atic for Automatic Speech Recognition (ASR) systems that are phenomenon in addition to the ASR and HSR intelligibility
becoming ubiquitous in human-computer interaction technolo- measurements. Firstly, we investigate whether the intelligibil-
gies. The aim of the work presented in this paper is to quantify ity (both ASR and HSR) of healthy and disordered speech is
the difficulty in understanding oesophageal speech by measur- comparable. Secondly, we are interested in seeing if the intelli-
ing intelligibility and listening effort. Intelligibility is quantified gibility and listening effort are correlated.
in both Human Speech Recognition (HSR) and ASR contexts. In [12], the idea of intelligibility differences between expe-
Several studies have devised systems for the analysis of rienced and inexperienced listeners of oesophageal speech was
pathological speech which also includes intelligibility measure- explored and the findings stated that oesophageal speech was
ments [3] [4] [5]. In [3] the authors use a Hidden Markov Model ranked similarly for intelligibility by both experienced and in-
based Automatic Speech Recognition (ASR) system to measure experienced listeners. Following this thread, we were interested
objective intelligibility of laryngectomees, and also cleft lip and in investigating if the same result was observed for our dataset
palate speech. A similar tool for Dutch pathological speech infor listeners that are familiar and unfamiliar with oesophageal
telligibility calculation is proposed in [5]. In [4] some intelli- speech. We consider friends, family and close relatives of oe-
gibility measurement techniques that do not use ASR are de- sophageal speakers as familiar listeners.
107 10.21437/IberSPEECH.2018-23
In short, the hypotheses of the experiment described in this female, making it gender inclusive (there are only 4 women in
paper are: the whole database and only 2 of them fulfilled the two criteria
• Intelligibility or Word Error Rate measurement is corre- of proficiency and accessibility).
lated with user rated listening effort. The criteria for choosing healthy speakers was quality of
recording as well as gender balance. One male and one female
• Healthy voices are more intelligible and less effortful, healthy speaker was chosen.
compared to oesophageal voices.
• Listeners familiar with oesophageal speech find it less 2.2.2. Selection of Sentences
effortful to process, compared to listeners that are not.
A pilot listening test was conducted within the lab to assess the
• ASR performs worse than HSR for healthy voices, but feasibility of this corpus for the sentence transcription task. The
even more so for oesophageal voices. participants chosen for this pilot study were unfamiliar with the
We begin by describing the methodology, corpus and details sentences of the corpus and thus not subject to priming. After
of the listening test. This will be followed by analysis methods the pilot test, the participants reported that some sentences were
used and results. Finally, the conclusions and future work are too long to remember and hence, effortful to transcribe. Ad-
presented. ditionally, although semantically and syntactically correct, the
sentences were rich in content and contained words that are dif-
ficult to guess, often containing proper names, dates etc.
2. Methodology This led us to reconsider the length of sentences and we
2.1. Experimental Design decided to choose a subset of shorter sentences, which would
make them suitable for sentence transcription. We used the Cor-
The main task for this experiment was the word recall and tran-
pusCRT tool [20] which generates a phonetically balanced sub-
scription task: Participants listened to a sentence and then wrote
set of sentences based on the provided phonetic criteria. In this
what they have understood. According to [13] the strengths of
case, the criteria we used was a maximum of 40 phonemes and
sentence repetition tasks are that they are ”fairly simple cog-
this gave us a set of 30 sentences, each of which had a maximum
nitive tasks” and that they are ”consistent throughout the age
of 10 words. Some examples of the sentences are the following:
span” in the area of neurophysiological tests. Moreover, sen-
’¿Qué diferencia hay entre el caucho y la hevea?’ What is the
tence transcription tasks have been widely used for subjec-
difference between rubber and hevea?, ’Unos dı́as de euforia y
tive intelligibility measurements. The work in [14] reports the
meses de atonı́a.’ A few days of euphoria and months of atony.
agreement of sentence transcription tasks with a wide range of
intelligibility quantification techniques and in [15] the method All the selected sentences (both from oesophageal and
is described as ”human speech recognition”. Therefore, we healthy speakers) were normalised to a common peak value
chose this approach to calculate WER and consequently the in- (0.8) to achieve a homogeneous and comfortable level of loud-
telligibility. ness.
We were also interested in knowing the listening effort of
these utterances. We had the participants rate the sentences for 2.3. The Listening Test
listening effort on a 5 point Likert scale. The options were ’very We created six mutually exclusive sets of sentences such that
little’,’a little’, ’some’, ’quite’ and ’a lot’. each set contained 30 different sentences and exactly 5 sen-
To avoid priming and sentence order bias, the sentences tences from each speaker. As a result, all 180 sentences (30 sen-
were played only once and in a random order. tences from six speakers) was covered after every sixth partici-
pant. This ensured equal coverage of all sentences and speakers.
2.2. Corpus and Stimuli Each participant was assigned one of these sets and they listened
The parallel data used for this task is 100 phonetically balanced to the sentences in a random order.
sentences selected from a bigger corpus [16], recorded by 35 The participants were asked to use headphones for the study
healthy speakers and 32 oesophageal speakers. The record- unless impossible. They were assured that it was not a test of
ings of oesophageal speakers were done in an acoustically iso- hearing and that the test was being conducted to obtain their
lated room with a studio microphone (Neumann TLM 103). honest and uninhibited response. They were told that the sen-
The recordings of the healthy speakers have variable sources tence could be played only once and that they should pay close
because they have been acquired through an online platform attention and type what they hear. If they missed some portions
[17]. However, some of them were made in the aforemen- or were unsure of what they heard, they could put three dots (...)
tioned acoustically isolated room although with a different mi- in that place. Additionally, they were asked to mark a response
crophone. for the amount of effort they experienced for that sample on the
aforementioned Likert scale. The first couple of sentences that
2.2.1. Selection of Speakers were presented were practice sentences (one healthy and one
oesophageal), to familiarise the participant with the task. These
For this experiment we chose oesophageal speakers based on sentences were sampled from the same corpus [21] but different
two criteria: proficiency and accessibility. Proficient speakers from the ones that appear in the actual test.
were those who underwent laryngectomy, and had begun train- We took the following information from the participants:
ing to speak for at least two years prior to the recording. Ad- age, presence of hearing impairment, the kind of audio equip-
ditionally, an oesophageal voice quality assessment tool [18], ment used (good quality headphones, normal quality head-
based on the factors (speaking rate, regularity etc.) of the A4S phones, good loudspeakers and bad equipment) and whether the
scale of [19], was used as a guide to assess proficiency. Acces- listener had close contact with laryngectomees.
sibility of speakers was considered because the opportunity to The listening test1 was web based and it was possible to
obtain follow-up recordings could be useful for future research.
Based on these criteria, we chose 4 speakers, three male and one 1 https://aholab.ehu.eus/users/sneha/Listening test.php
108
reach out to a wide range of participants. However, this also Table 1: Mean WER and Effort
meant differences in audio equipment and the effects of this on
the responses are reported in the results section. Oesophageal Healthy
Effort Familiar 2.61 1.25
2.4. Automatic Speech Recognition Not familiar 3.54 1.26
Total mean effort 3.07 1.255
To have an objective measure of the intelligibility (WER) we WER (in %) Familiar 17.39 7.42
prepared an ASR system for Spanish using the Kaldi toolkit Not familiar 18.35 4.85
[22]. This approach was chosen as it allowed us to control the Total mean WER 17.87 6.16
processing operations followed during the recognition as well
as basic aspects of the recognition process such as the lexicon
and the language model. It is implemented following the recipe WER results (F(3,1256)=0.707, p=0.548) and on listening ef-
s5 for the Wall Street Journal database. The acoustic features fort (F(3,1256)=0.705, p=0.549).
used are 13 Mel-Frequency Cepstral Coefficients (MFCCs) to In addition, we present the WER results from the ASR sys-
which a process of mean and variance normalization (CMVN) tem for all speakers.
is applied to mitigate the effects of the channel. The details of
the training procedures are described in [23]. 3.1. Word Error Rates from HSR
The audio material used to train the Spanish recogniser was
healthy laryngeal speech as described in [24]. However, due Table 1 presents mean WERs and Figure 1 shows the speaker-
to the characteristics of the sentences used for the evaluation, wise WERs for familiar and unfamiliar listeners. OM, OF, HM,
some modifications were made in this ASR system. Although HF are acronyms for Oesophageal Male, Oesophageal Female,
the acoustic models were maintained, a new lexicon was cre- Healthy Male and Healthy Female respectively. Mean WER
ated from the 100 sentences corpus used in the experiment (701 is always higher for oesophageal speech compared to healthy
words). This was done because using the original lexicon (with speech, as expected. There is no major difference in the WER
37, 632 entries) as much as 23% of the words were out of vo- for familiar and unfamiliar listeners in the case of oesophageal
cabulary (OOV) words. This is due to the fact that the sen- speech. This result corroborates the conclusions in [12]. For
tences are phonetically balanced and many sentences containing healthy speech there is slight difference of around 3 points in
proper names and many unusual words were chosen to maxi- the mean WER, but as can be seen in Figure 1 the difference is
mize the variability of the phonetic content. Together with this not meaningful.
reduced lexicon, a unigram language model with equally prob- The ANOVA results show that familiarity with oesophageal
able words was used. speech had no effect on WER (F(1,1590)=0.360,0.548). On
Although the final WER numbers obtained in this way the other hand, speaker-type has a strong effect on WER
are not comparable to a realistic ASR situation, the procedure (F(1,1590)=129.552, p<0.001).
serves our purpose of evaluating the intelligibility of the sen-
tences, comparing the performance of healthy and oesophageal
speakers, and establishing a baseline reference for future devel-
opments in the field (such as evaluating the improvements of
speech modification algorithms).
3. Analysis and Results

We had 57 native Spanish participants in this test, out of which
15 of them had close contact with laryngectomees and hence
were familiar with oesophageal speech. The age of listeners
ranged from 21 to 70 and mean age was 36.6.
Prior to calculating WER, an initial clean-up was performed
on the data. This included removing any punctuations or spe-
cial characters, and some typing errors (accented vowels, use of
upper and lower case, spelling of proper or foreign names etc.).
The WER was obtained after correcting these transcription er-
rors.
WER was calculated [25] using the Levenshtein distance
Figure 1: Mean speaker-wise Word Error Rates for oesophageal
between the reference sentence and the hypothesis sentence (the
(OM1, OM2, OM3, OF1) and healthy (HM1, HF1) speakers.
sentence transcribed by the listener). This method calculates
Error bars show 95% confidence intervals
the distance by quantifying the insertions, deletions and substi-
tutions that are observed in the hypothesis sentence when com-
pared to the reference sentence.
Self reported listening effort responses were assigned nu- 3.2. Self-reported Listening Effort
meric values that ranged from 1 to 5 with 1 corresponding to Mean self-reported listening effort values are stated in Ta-
’very little’ and 5 to ’a lot’. ble 1 and Figure 2 shows the speaker-wise values. As ex-
We performed ANOVA analysis on the dataset using the pected, it is higher for oesophageal speech compared to healthy
JASP tool [26] to quantify the effects of speaker type and fa- speech. However, when listening to oesophageal speech the per-
miliarity of the listener with oesophageal speech. The au- ceived effort is significantly lower for familiar listeners than
dio device used by the listener had no effect on the HSR for not familiar listeners. Indeed, ANOVA analysis shows
109
that familiarity with oesophageal speech has an effect on ef-
fort (F(1,1590)=84.94, p<0.001) and Speaker-type has a strong
effect on effort (F(1,1590)=1243.94, p<0.001).
Figure 4: Word Error Rates for HSR and ASR
mance. As expected, WER for oesophageal speakers is signifi-

cantly higher than healthy speakers.
Figure 4 also shows the HSR results. We can observe HSR
and ASR perform differently for the different speakers. How-
ever, the number of speakers in this experiment is small to draw
any reliable conclusion about the variation of ASR and HSR
across speakers.
Figure 2: Mean speaker-wise self reported effort values for oe- 4. Conclusions and Future Work
sophageal (OM1, OM2, OM3, OF1) and healthy (HM1, HF1)
speakers. 1 corresponds to least effortful and 5 to most effortful. Healthy voices are on an average three times as intelligible as
Error bars show 95% confidence intervals oesophageal voices. The mean self reported effort was also
three times larger for oesophageal speech compared to healthy
voices. There was significant correlation between intelligibility
and effort. Speaker type had an effect on both intelligibility and
effort. Listeners familiar with oesophageal fared the same for
intelligibility as people who were not. However, they reported
less effort in listening to oesophageal speech than the not famil-
iar listeners. The ASR system we chose for this task had poorer
WER for oesophageal voice compared to healthy voice.
The listening effort obtained through this study is based on
the listener’s own interpretation of ’effort involved in listening’.
This will provide us with a reference for comparison when we
perform objective listening effort measurements in the future
using physiological methods such as EEG and pupillometry. If
these subjective measures are found to be correlated with the
physiological measurements, then that opens the possibility of
using the less cumbersome self report strategy to achieve our
purpose of evaluation.
Figure 3: Correlation between WER and user rated listening
effort Both HSR intelligibility and ASR intelligibility play differ-
ent but important roles in oesophageal speech evaluation. While
improved HSR would enable better human-human interactions,
an improved ASR performance would enable better human-
3.3. Correlation of Intelligibility and Listening Effort machine interactions (eg. digital voice assistants). Lower listen-
Correlation between intelligibility (WER) and self reported ef- ing effort would also contribute towards better communication
fort is 0.479 (Pearson’s r, p <0.001). This is a weak but sig- with humans.
nificant correlation that indicates that sentences with more tran- Our main future work is to build an oesophageal voice
scription errors are perceived as more effortful. This relation- restoration system aimed at better ASR and HSR intelligibility
ship between WER and self-reported effort is illustrated as a and low listening effort.
box-plot in Figure 3.
3.4. Word Error Rates from ASR

5. Acknowledgements
The ASR experiment was performed using all the 100 avail- This project has received funding from the European Unions
able sentences for each speaker and not only the subset used H2020 research and innovation programme under the Marie
for human intelligibility measurements. This was convenient Curie European Training Network ENRICH (675324).
in order to obtain a reliable WER measure. It can be observed The work has been partially funded by the Spanish Min-
from Figure 4 that the ASR performs poorly for both healthy istry of Economy and Competitiveness with FEDER support
and oesophageal speech. The fact that the system is using a (MINECO/FEDER, UE) (RESTORE project, TEC2015-67163-
unigram language model contributes greatly to this poor perfor- C2-1-R).
110
6. References [18] N. Tits, “Exploring the parameters describing the quality and in-
telligibility of alaryngeal voices,” University of Mons, 2017.
[1] B. Weinberg, “Acoustical properties of esophageal and tracheoe-
sophageal speech,” Laryngectomee rehabilitation, pp. 113–127, [19] T. Drugman, M. Rijckaert, C. Janssens, and M. Remacle, “Tra-
1986. cheoesophageal speech: A dedicated objective acoustic assess-
ment,” Computer Speech & Language, vol. 30, no. 1, pp. 16–31,
[2] T. Most, Y. Tobin, and R. C. Mimran, “Acoustic and perceptual 2015.
characteristics of esophageal and tracheoesophageal speech pro-
duction,” Journal of communication disorders, vol. 33, no. 2, pp. [20] A. Sesma and A. Moreno, “Corpuscrt 1.0: Diseno de corpus
165–181, 2000. orales equilibrados,” UPC, Tech. Rep., Dec.2000.
[3] A. Maier, T. Haderlein, U. Eysholdt, F. Rosanowski, A. Batliner, [21] D. Erro, I. Hernáez, E. Navas, A. Alonso, H. Arzelus, I. Jauk,
M. Schuster, and E. Nöth, “Peaks–a system for the automatic eval- N. Q. Hy, C. Magarinos, R. Pérez-Ramón, M. Sulı́r et al.,
uation of voice and speech disorders,” Speech Communication, “Zuretts: online platform for obtaining personalized synthetic
vol. 51, no. 5, pp. 425–437, 2009. voices,” Proceedings of eNTERFACE, pp. 1178–1193, 2014.
[4] C. Middag, T. Bocklet, J.-P. Martens, and E. Nöth, “Combin- [22] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
ing phonological and acoustic asr-free features for pathological N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
speech intelligibility assessment,” in Twelfth Annual Conference “The kaldi speech recognition toolkit,” in IEEE 2011 workshop
of the International Speech Communication Association, 2011. on automatic speech recognition and understanding, no. EPFL-
CONF-192584. IEEE Signal Processing Society, 2011.
[5] C. Middag, J.-P. Martens, G. Van Nuffelen, and M. De Bodt,
“Dia: a tool for objective intelligibility assessment of pathological [23] S. P. Rath, D. Povey, K. Veselỳ, and J. Cernockỳ, “Improved fea-
speech,” in 6th International workshop on Models and Analysis of ture processing for deep neural networks.” in Interspeech, 2013,
Vocal Emissions for Biomedical Applications. Firenze University pp. 109–113.
Press, 2009, pp. 165–167. [24] L. Serrano, D. Tavarez, I. Odriozola, I. Hernaez, and I. Saratxaga,
[6] J. L. Miralles and T. Cervera, “Voice intelligibility in patients who “Aholab system for albayzin 2016 search-on-speech evaluation,”
have undergone laryngectomies,” Journal of Speech, Language, in IberSPEECH, 2016, pp. 33–42.
and Hearing Research, vol. 38, no. 3, pp. 564–571, 1995. [25] E. Polityko. Word error rate. [Online]. Available:
[7] T. Cervera, J. L. Miralles, and J. González-Àlvarez, “Acousti- https://www.mathworks.com/examples/matlab/community/19873-
cal analysis of spanish vowels produced by laryngectomized subword-error-rate , access date: 20th February 2018
jects,” Journal of speech, language, and hearing research, vol. 44, [26] JASP Team, “JASP (Version 0.8.6)[Computer software],” 2018.
no. 5, pp. 988–996, 2001. [Online]. Available: https://jasp-stats.org/ ,access date: 20th
[8] A. Mantilla, H. Pérez-Meana, D. Mata, C. Angeles, J. Alvarado, February 2018
and L. Cabrera, “Recognition of vowel segments in spanish
esophageal speech using hidden markov models,” in Computing,
2006. CIC’06. 15th International Conference on. IEEE, 2006,
pp. 115–120.
[9] R. McGarrigle, K. J. Munro, P. Dawes, A. J. Stewart, D. R. Moore,
J. G. Barry, and S. Amitay, “Listening effort and fatigue: What
exactly are we measuring? a british society of audiology cogni-
tion in hearing special interest group white paper,” International
journal of audiology, 2014.
[10] S. Bennett and B. Weinberg, “Acceptability ratings of normal,
esophageal, and artificial larynx speech,” Journal of Speech, Lan-
guage, and Hearing Research, vol. 16, no. 4, pp. 608–615, 1973.
[11] K. F. Nagle and T. L. Eadie, “Listener effort for highly intelligible
tracheoesophageal speech,” Journal of Communication Disorders,
vol. 45, no. 3, pp. 235–245, 2012.
[12] W. L. Cullinan, C. S. Brown, and P. D. Blalock, “Ratings of intel-
ligibility of esophageal and tracheoesophageal speech,” Journal
of communication disorders, vol. 19, no. 3, pp. 185–195, 1986.
[13] J. Meyers, K. Volkert, and A. Diep, “Sentence repetition test: Up-
dated norms and clinical utility,” vol. 7, pp. 154–9, 02 2000.
[14] K. M. Yorkston and D. R. Beukelman, “A comparison of tech-
niques for measuring intelligibility of dysarthric speech,” Journal
of communication disorders, vol. 11, no. 6, pp. 499–512, 1978.
[15] R. P. Lippmann, “Speech recognition by machines and humans,”
Speech communication, vol. 22, no. 1, pp. 1–15, 1997.
[16] I. Sainz, D. Erro, E. Navas, I. Hernáez, J. Sanchez,
I. Saratxaga, and I. Odriozola, “Versatile Speech Databases
for High Quality Synthesis for Basque,” in 8th international
conference on Language Resources and Evaluation (LREC),
2012, pp. 3308–3312. [Online]. Available: http://www.lrec-
conf.org/proceedings/lrec2012/pdf/126 Paper.pdf
[17] D. Erro, I. Hernáez, A. Alonso, D. Garcı́a-Lorenzo, E. Navas,
J. Ye, H. Arzelus, I. Jauk, N. Hy, C. Magariños, R. Pérez-Ramón,
M. Sulı́r, X. Tian, and X. Wang, “Personalized synthetic voices
for speaking impaired: Website and app,” in Proceedings of the
Annual Conference of the International Speech Communication
Association, INTERSPEECH, 2015.
111
IberSPEECH 2018
Towards an Automatic Evaluation of the Prosody of Individuals with Down

Syndrome.
Mario Corrales-Astorgano1 , Pastora Martı́nez-Castilla2 , David Escudero-Mancebo1 , Lourdes
Aguilar3 , César González-Ferreras1 , Valentı́n Cardeñoso-Payo1

1
2
Department of Developmental and Educational Psychology, UNED, Spain
3
Departament of Hispanic Philology, Universitat Autònoma de Barcelona, Spain
mcorrales@infor.uva.es, pastora.martinez@psi.uned.es, descuder@infor.uva.es,
lourdes.aguilar@uab.cat, cesargf@infor.uva.es, valen@infor.uva.es
Abstract tences. Nevertheless, characterizing prosodic impairments in

populations with developmental disorders is a hard task [7].
Prosodic skills may be powerful to improve the communication To fulfill such an aim, prosody assessment procedures appro-
of individuals with intellectual and developmental disabilities. priate for use with individuals with intellectual and/or devel-
Yet, the development of technological resources that consider opmental disabilities need to be employed. The Profiling Ele-
these skills has received little attention. One reason that ex- ments of Prosody in Speech-Communication (PEPS-C) test has
plains this gap is the difficulty of including an automatic assess- proved to be successful in this respect [8, 9]. When used with
ment of prosody that considers the high number of variables and English-speaking children with Down syndrome, lower perfor-
heterogeneity of such individuals. In this work, we propose an mance than expected by chronological age is observed in all
approach to predict prosodic quality that will serve as a baseline prosody tasks [10]. After comparisons with typically devel-
for future work. A therapist and an expert in prosody judged the oping children matched for mental age, impairments are also
prosodic appropriateness of individuals with Down syndrome’ found for the discrimination and imitation of prosody [10].
speech samples collected with a video game. The judgments There are technological tools focused on language therapy
of the expert were used to train an automatic classifier that pre- [11, 12]. However, the difficulty of separating the effects of
dicts the quality by using acoustic information extracted from each of the suprasegmental features on communication together
the corpus. The best results were obtained with an SVM clas- with the multiplicity of right possibilities to arrive to the same
sifier, with a classification rate of 79.30%. The difficulty of the intonational meaning explains that little attention has been paid
task is evidenced by the high inter-human rater disagreement, to the development of technological resources that specifically
justified by the speakers’ heterogeneity and the evaluation con- consider the learning of prosody in students with special needs,
ditions. Although only 10% of the oral productions judged as specifically in those with Down syndrome. To advance in the
correct by the referees were classified as incorrect by the au- line of developing specific resources to minimize the limitations
tomatic classifier, a specific analysis with bigger corpora and concerning prosody and pragmatics in individuals with Down
reference recordings of people with typical development is nec- syndrome, we have developed an educational video game to
essary. train prosody, PRADIA: Mistery in the city [13, 14]1 .
Index Terms: Prosody, Automatic Classification, Down syn- Although the video game was designed with the aim of
drome, Educational Video games training prosody in individuals with Down syndrome, it became
a tool to collect their oral productions and thus to construct a
1. Introduction prosodic corpus. These aims are achieved thanks to the fact
The collective of individuals with Down syndrome shows a se- that the main way of interaction of the player with the game
ries of cognitive, learning and attentional limitations. All the is through the voice. To advance in the game, the player must
areas of language are altered, but not in the same degree, as give an adequate oral response in different communicative cir-
described in [1]. Although lexical acquisition is delayed, mor- cumstances, where prosodic features are the most relevant to
phology and syntax appear to be more affected than vocabulary achieve a correct pragmatic interpretation. In its current version,
[2]. Related to pragmatics, individuals with Down syndrome the video game needs the constant presence of a person (ideally
show difficulties when producing and understanding questions a therapist) who guides the gamer throughout the adventure and
and emotions, signaling turn-taking, or keeping topics in con- who evaluates the success in the resolution of the production ac-
versation, and the study in [3] demonstrated that children with tivities. The assistance of the therapist has been proved crucial
Down syndrome are impaired relative to norms from typically to motivate individuals with Down syndrome. Even so, it would
developing children in all areas of pragmatics. At phonological be desirable to improve their autonomy and to help trainers in
level, speech intelligibility is seriously damaged by the presence their therapies with new functionalities by including a module
of errors on producing some phonemes, the loss of consonants of automatic assessment of prosodic quality.
and the simplification of syllables [4]. If we turn to the field of automatic assessment, the attempts
What concerns to prosody, [5] report disfluencies (stutter- 1 The activities of Down syndrome speech analysis continue
ing and cluttering) and impairments in the perception, imita- (1/2018-12/2020) in the project funded by the Ministerio de Ciencia,
tion and spontaneous production of prosodic features; authors Innovación y Universidades and the European Regional Development
of [6] have connected some of the speech errors with difficul- Fund FEDER (TIN2017-88858-C2-1-R) and in the project funded by
ties in the identification of boundaries between words and sen- Junta de Castilla y León (VA050G18)
112 10.21437/IberSPEECH.2018-24
of classifying different speech dimensions is well researched, Table 1: Corpus description. Concerning the therapist decision,
but focused on specific aspects or reduced populations. Some Cont.R (Continue Right) means that the activity was rightly re-
works focus on speech intelligibility of people with aphasia [15] solved, Cont (Continue) means that the activity was satisfac-
or speech intelligibility in general [16]. Others try to identify torily resolved and Rep (Repeat) means that the activity was
speech disorders in children with cleft lip and palate [17]. In faultily resolved. Concerning the expert judgment, Right means
addition, speech emotions and autism spectrum disorders recog- that the recording was rightly produced and Wrong means that
nition have been investigated [18]. The point is that all these the recording was wrongly produced.
works include a subjective evaluation done by experts as a gold
Therapist decision Expert judgment
standard to train the classification systems. (real time) (offline)
In this work, we analyse the difficulties of automatically Speaker #Utterances Cont.R Cont. Rep. Right Wrong Corpus
S01 120 70 33 17 87 33 C1
predicting the quality of the prosody of an oral production and S02 106 90 16 0 81 25 C1
propose a new approach that will serve as a baseline for future S03 97 93 3 1 78 19 C1
S04 131 19 51 61 75 56 C1
work. Recordings of individuals with Down syndrome collected S05 151 21 54 76 77 74 C1
in different sessions of use of the educational video game PRA- S06 30 x x x 19 11 C2
DIA: Mystery in the city were used to obtain information about S07 34 x x x 13 21 C2
S08 28 x x x 23 5 C2
the relevant features needed to make an automatic classification S09 43 x x x 20 23 C2
of the productions. The speech corpus obtained along the time S10 33 x x x 29 4 C2
S11 57 x x x 31 26 C3
of game was judged by a therapist, who evaluated in real time S12 12 x x x 7 5 C3
the quality of the oral productions, and by a prosody expert, who S13 7 x x x 2 5 C3
S14 11 x x x 3 8 C3
did an off-line evaluation. The difference in the experimental S15 33 x x x 19 14 C3
procedure will be used to investigate if an automatic system can S16 10 x x x 6 4 C3
only rely on prosodic variables to judge the oral productions S17 8 x x x 5 3 C3
S18 11 x x x 6 5 C3
of the players (offline evaluation), or whether other features re- S19 10 x x x 6 4 C3
lated to the game dynamics should also be incorporated in the S20 10 x x x 6 4 C3
S21 9 x x x 1 8 C3
system. The judgments of the expert are used to train an auto- S22 7 x x x 3 4 C3
matic classifier that predicts quality by using acoustic informa- S23 8 x x x 3 5 C3
tion extracted from the audios of the corpus. Total 966 293 157 155 465 302
In section 2, the experimental procedure is described, which

includes the procedure for corpus collection, the processing of
speech material and the classification of the samples. The re- distributed in 4 sessions of 1 hour per week. Participants were
sults section shows the effectiveness of the procedure, although, supported by a speech and language therapist who knew them
at the same time, the difference in the evaluation procedures in advance and was an expert at working with individuals with
highlights the need of carefully defining the selection of fea- Down syndrome. The therapist explained the game, helped par-
tures in the process of classification. We end the paper with a ticipants when needed, and took notes about how each session
discussion about the relevance of the results and the conclusions developed. Importantly, the therapist also assessed participants’
and future work section. speech productions and thus monitored their rhythm of progress
within the video game.
2. Experimental procedure C2 subcorpus was also recorded using PRADIA software.
These recordings were obtained through the video game within
2.1. Corpus description one session of software testing with real users. This test session
was done in a school of special education located in Valladolid
The three subcorpora were recorded using the video game, but (Spain). Five adults with Down syndrome, aged 18 to 25, par-
the version of the video game and the recording context were ticipated in this test. The judgments obtained during this game
different. The complete description of each subcorpus can be session were discarded for this work because the speech pro-
seen on Table 1. ductions were not evaluated by a therapist. The oral productions
To build the subcorpus C1, five young adults with DS (mean were judged in an offline mode by the expert in prosody.
age 198 months) were recruited from a local Down syndrome C3 subcorpus was recorded using an older version of PRA-
Foundation located in Madrid (Spain). To account for the vari- DIA software, the Magic Stone [23], with less types of pro-
ability often found in individuals with Down syndrome and get duction activities. Eighteen young adults with Down syndrome
measurements of different developmental variables, all of the participated in the different game sessions, which focused on
participants were administered with the following tests. The how these users interacted with the video game. Five of these
Peabody Picture Vocabulary Scale-III [19] was used to assess eighteen speakers participated as well in the recordings of the
verbal mental age, the forward digit-span subtest included in the C2 subcorpus, so their productions were discarded from C3 sub-
Wechsler Intelligence Scale for Children-IV [20] was employed corpus. As well as in the C2 subcorpus, the judgments decided
to evaluate verbal short-term memory and Raven’s Coloured by the assistant that helped players complete the adventure were
Progressive Matrices [21] served as a means to measure non- not considered in the classifications. Instead, the oral produc-
verbal cognitive level. Descriptive characteristics and scores tions were judged in an offline mode by the expert in prosody.
obtained are shown in Table 2. The full PEPS-C battery in its
Spanish version [22] was also administered to participants to 2.2. Corpus evaluation
have specific measurements of prosody level. Mean percentage
of success in perception and production PEPS-C tasks is also During the game sessions, a speech therapist sits next to the
presented in Table 2. Once these assessments were completed, player and evaluates the production activities in real time. Con-
participants were administered with the PRADIA video game. sequently, in C1 corpus, the therapist adapted her judgments to
Each participant used PRADIA for a total duration of 4 hours, both the general developmental level of participants and their
113
Table 2: Description of the C1 subcorpus. For each speaker, this 2.4. Automatic classification
table shows Chronological age (CA), Verbal mental age (VA),
Short-term verbal memory (STVM), and Non-verbal cognitive As explained is section 2.2, the recordings were evaluated by
level (NVCL). Ages are expressed in months. In addition, the the therapist and the prosody expert. Since the final aim of
mean percentage of success in perception (MPercT) and pro- the module is to decide if the gamer can continue the game or
duction (MProdT) PEPS-C tasks are included. should repeat the activity (without considering degrees of fail-
ure), the evaluation of the expert was used to build the classi-
Speaker Gender CA VA STVM NVCL MPercT MProdT fier. According to this, the output of the different classifiers are
S01 f 195 84 94 17 69.79% 48.30% Right (R) or Wrong (W), based on the prosody expert scoring.
S02 m 204 99 134 18 76.04% 72.10%
S03 f 178 96 78 20 73.96% 74.65% The Weka machine learning toolkit [28] was used and three dif-
S04 m 190 60 below 74 10 60.42% 49.76% ferent classifiers were used to compare their performance: the
S05 m 223 69 below 74 13 56.25% 54.84%
C4.5 decision tree (DT), the multilayer perceptron (MLP) and
the support vector machine (SVM). In addition, the results of
using the recordings of the three corpora as well as all combi-
emotional and motivational level. The video game allows to nations of these corpora were compared.
evaluate the result of the oral activities typing a concrete key Furthermore, the stratified 10-fold cross-validation tech-
on the computer keyboard where the game is installed. If the nique was used to create the training and testing datasets. We
evaluation is Cont.R (Continue with right result) or Cont (Con- also used feature selection before training the classifiers: the
tinue but the oral activity could be better), the video game ad- features were selected by measuring the information gain of the
vances to the next activity. If the evaluation is Rep. (Repeat), training set and discarding the ones in which the information
the game offers a new attempt in which the player has to repeat gain equals zero (column Feat. in Table 3).
the activity. For each activity, there is a predetermined number
of attempts: when the attempts finish, the video game goes to
the next screen to avoid frustration on the player, even if the 3. Results
activity has not been successfully completed (and the therapist Table 1 and Table 2 show a high difference between speakers re-
continues judging with Rep.). lated to their developmental level and prosodic skills. S04 and
On the other hand, an expert in prosody evaluated the three S05 have the lowest scores in verbal mental age (60 and 69,
subcorpora of oral productions of 23 speakers with Down syn- respectively), short-term verbal memory (below 74 both speak-
drome in an offline mode. Due to the difficulty of the task and ers) and non-verbal cognitive level (10 and 13, respectively).
the different context of the evaluation, the prosody expert used In addition, both of them have the lowest mean percentage of
a reduced evaluation system (Right or Wrong production). The success in perception PEPS-C tasks (60.42% and 56.25%, re-
judgments were made relying on purely auditive basis, with- spectively) and lower mean percentage of success in production
out any acoustic analysis of the sentences, and the focus was PEPS-C tasks (49.76% and 54.84%, respectively). These low
on the intonational and prosodic structure. Related to this, fac- scores are related with the quality of the productions, with a
tors of intelligibility, quality in pronunciation or adjustment to higher percentage of W assignments from the prosody expert
the expected sentence were not taken into account. Even in the (42.75% and 49% respectively) and higher percentage of Rep.
case of speakers with low cognitive level and serious problems from the therapist (47% and 50%, respectively).
of intelligibility, the main criterion was whether they had mod- The classification results highly depend on the corpus and
eled prosody with certain success, even if the message was not the classifier used (Table 3). SVM classifier works better with
understood. Following the categories of intonational phonol- all corpora and the worst results are obtained using DT classifier
ogy [24] and the learning objectives included in PRADIA [14], (best case is 79.3% vs 64.94% baseline). The best results are
criteria concerning intonation, accent and prosodic organization obtained in Case A and D by using any of the three classifiers
were used to judge if the sentence was Right or Wrong: in short, (UAR 0.83 with SVM classifier). The classification accuracy
adjustment to the expected modality; respect for the difference decreases when the C3 corpus is entered (C, E, F and G cases)
between lexical stress and accent (tonal prominence); and ad- as the number of speakers substantially increases. Moreover,
justment to the organization in prosodic groups relying mainly when the same features are used to identify speakers instead
in the distinction between function and content words. of the quality of the utterance (column #SR rate of Table 3),
scenarios Case C, E and G are the worst ones and scenarios
2.3. Feature extraction Case A and B are the best. In order to see the influence of
The openSmile toolkit [25] was used to extract acoustic fea- the speaker in the classification results, we present results per
tures from each recording of C1, C2 and C3 subcorpora. The speaker in Table 4.
GeMAPS feature set [26] was selected due to the variety of We focus on Case D to present results per speakers in Ta-
acoustic and prosodic features contained in this set. This set ble 4. Only the samples of corpus C1 are analyzed because
contains frequency related features, energy related features, they were evaluated by the two evaluators. Comparing the R-
spectral features and temporal features. The arithmetic mean W judgments of the expert with the classifier predictions, there
and the coefficient of variation were calculated on these fea- is a high recall in R-R case for all speakers (S01 83.91%, S02
tures. Furthermore, 4 additional temporal features were added: 87.65%, S03 97.44%, S04 94.67%, S05 87.01%). The coin-
the silence and sounding percentages, silences per second and cidence in W-W case is lower: while S02 and S05 present a
the mean silences. The complete description of these features reasonable classification rate (72% and 70.27%, respectively),
can be found in previous research [27]. In this work, only results for S03 goes down to 26.32%. Concerning this result,
prosodic features (frequency, energy and temporal) have been we note that most of the utterances judged as wrong by the ex-
used because spectral features improve the speaker identifica- pert were rated as right by the therapist (100% in cell W-Cont.R
tion, and classifiers can be adapted to each speaker in the classi- for S3). As average, we obtain only 10.05% of false negatives.
fication process. In total, 34 prosodic features were employed. This will be discussed in the next section as a positive result for
114
Table 3: Classification results depending on the corpus and the classifier used. The prosody expert judgments were used to train the
classifiers. BL means the performance baseline of each group of samples (number of samples of the most populated class divided
by all the samples). DT means Decision trees, SVM means Support vector machines and MLP means Multilayer Perceptron. CR
means the classification rate, AUC means the Area Under the Curve and AUR means the Unweighted Average Recall. The number of
samples (utt.), the number of speakers (SPK), the number of features (Feat.) and the speaker classification rate using SVM (SR rate)
are presented. The output of the different classifiers are Right or Wrong, based on prosody expert scoring.
DT SVM MLP
Corpora BL CR AUC UAR CR AUC UAR CR AUC UAR #Utt. #Feat. #SPK SR rate
Case A C1 65.79% 69.57% 0.68 0.74 78.49% 0.74 0.83 73.23% 0.7 0.79 605 21 5 69.92%
Case B C2 61.90% 60.26% 0.58 0.61 72.68% 0.7 0.79 68.49% 0.67 0.73 168 16 5 88.01%
Case C C3 50.78% 65.76% 0.66 0.66 61.58% 0.62 0.69 63.71% 0.64 0.64 193 7 13 30.05%
Case D C1+C2 64.94% 70.77% 0.68 0.75 79.3% 0.76 0.83 72.57% 0.7 0.78 773 21 10 64.94%
Case E C1+C3 62.16% 66.29% 0.65 0.69 72.31% 0.7 0.79 67.17% 0.65 0.74 798 20 18 52.26%
Case F C2+C3 55.96% 60.94% 0.6 0.64 66.47% 0.66 0.75 64% 0.63 0.69 361 13 18 64.27%
Case G C1+C2+C3 62.11% 66.88% 0.66 0.71 74.32% 0.71 0.81 69.37% 0.66 0.76 996 20 23 59.21%
real time situations. Table 4: Percentage of coincidence between therapist decision,

classifier (SVM in case D) and prosody expert per speaker. Con-
Concerning the therapist judgments, Cont.R decision could
cerning the classifier, R represents the utterances classified as
be identified as a Right assignment in a high percentage of cases
Right by the classifier and W represents the utterances classi-
for S01, S02 and S03 speakers (69%, 85% and 95% respec-
fied as Wrong by the classifier. Each row percentage is relative
tively). They are the participants with higher developmental
to the number of each type of utterances of prosody expert eval-
level, according to Table 2. Among these three participants,
uation.
the first one -with the lowest inter-judge agreement- showed the
lowest prosodic level from the outset. In general, the corre- Expert judgment Classified as Therapist decision
spondence between real time decisions and expert judgment is Speaker #Total utt type #utt R W Cont.R Cont. Rep.
S01 120 R 87 83.91% 16.09% 68.97% 24.14% 6.90%
not straightforward, with a high variety in the contingency ta- W 33 57.58% 42.42% 30.30% 36.36% 33.33%
ble. Concerning the therapist Rep. decision, it is clear that the S02 106
W
R 81
25
87.65% 12.35%
28.00% 72.00%
85.19% 14.81%
84.00% 16.00%
0.00%
0.00%
highest percentages of agreement are obtained for S04 and S05 S03 97 R 78 97.44% 2.56% 94.87% 3.85% 1.28%
W 19 73.68% 26.32% 100.0% 0.00% 0.00%
speakers (62.5% and 72.97%, respectively), who are the least S04 131 R 75 94.57% 5.33% 21.33% 44.00% 34.67%
qualified speakers in Table 2. W 56 41.07% 58.93% 5.36% 32.14% 62.5%
S05 151 R 77 87.01% 12.99% 20.78% 50.65% 28.57%
W 74 29.73% 70.27% 6.76% 20.27% 72.97%
4. Discussion and Conclusions Total 605

W
R 398
207
89.96% 10.05%
41.06% 58.94%
80.20% 68.79% 35.48%
19.80% 31.21% 64.52%
The study shows some of the variables that contribute to account

for the difficulties of conducting an automatic evaluation of
prosody in Down syndrome. As shown in Table 2, the chrono- only 10% of the samples evaluated as Right by the expert are
logical age of the participants for whom both the therapist and classified as Wrong by the classifier. It is future work to reduce
prosody expert evaluations were available was similar. How- the rate of false positives in order to obtain the best possible
ever, their skills for reasoning, recalling auditory verbal material reliable evaluation system.
and understanding vocabulary were clearly different. Pheno-
type variability is common in Down syndrome [3] and needs to The differences between the therapist and prosody expert
be considered if prosody is to be evaluated. When developmen- evaluations highlight the importance of evaluation contexts. If
tal level is low, the quality of the prosodic productions is also the automatic evaluation module aims to be included in a real
low. As a result, the likelihood of human agreement as to the time video game, aspects different from prosody should be con-
appropriateness of the output decreases. This shows the diffi- sidered, in the line of what the therapist did in her evaluation.
culties inherent to the task being carried out. Furthermore, even The player profile -among other features related to the progress
in the cases of higher cognitive level, variability in the linguistic in the game- should also be incorporated in the system. In ad-
profile can also play a role. Thus, levels of vocabulary are not dition, the evaluation scale can be improved by adding more
necessarily paired with those of prosody perception and produc- dimensions to be scored by the experts. Instead of having a
tion. Differences in the evaluation context also explain the vari- global score of the prosody of a recording, the experts could
ability between the expert and therapist’s judgments. While the assign a different score to different prosodic dimensions (into-
former only based her decisions on intonational criteria, the lat- nation, pauses, rhythm), with the aim of making a more precise
ter also took into consideration the progress of the player within classification.
the video game. In doing so, avoiding frustration was a priority; The high variability of speech of individuals with Down
therefore, levels of frustration tolerance and number of failures syndrome has been evidenced in experimental results. Further
influenced the therapist’s decisions. research should compile a bigger and more balanced corpus of
In our video game, not to evaluate as wrong a right utter- the speech of individuals with Down syndrome and record a
ance is very important; otherwise, frustration may arise. This reference corpus of people with typical development. Never-
is even more important when individuals with Down syndrome theless, inter-speaker variability should be considered as an in-
are the players since they can be particularly prompted to this trinsic feature of the voices of individuals with Down syndrome
feeling [29]. Therefore, one of the main aims of the video game so that both the reference of correctness and the particular lim-
is to engage and motivate the users. For this, the percentage of itations of the speaker must be taken into account to attain an
false positives must be as low as possible. Table 4 shows that effective automatic prosodic assessment.
115
5. References [20] S. Corral, D. Arribas, P. Santamarı́a, M. Sueiro, and J. Pereña,
“Escala de inteligencia de Wechsler para niños-IV,” Madrid: TEA
[1] G. E. Martin, J. Klusek, B. Estigarribia, and J. E. Roberts, “Lan- Ediciones, 2005.
guage characteristics of individuals with down syndrome,” Topics
in language disorders, vol. 29, no. 2, p. 112, 2009. [21] J. Raven, J. C. Raven et al., Test de matrices progresivas:
manual/Manual for Raven’s progessive matrices and vocabulary
[2] P. A. Eadie, M. Fey, J. Douglas, and C. Parsons, “Profiles of
scalesTest de matrices progresivas. Paidós,, 1993, no. 159.9.
grammatical morphology and sentence imitation in children with
072.
specific language impairment and down syndrome,” Journal of
Speech, Language, and Hearing Research, vol. 45, no. 4, pp. 720– [22] P. Martı́nez-Castilla and S. Peppé, “Developing a test of prosodic
732, 2002. ability for speakers of iberian spanish,” Speech Communication,
vol. 50, no. 11-12, pp. 900–915, 2008.
[3] E. Smith, K.-A. B. Næss, and C. Jarrold, “Assessing pragmatic
communication in children with down syndrome,” Journal of [23] C. González-Ferreras, D. Escudero-Mancebo, M. Corrales-
communication disorders, vol. 68, pp. 10–23, 2017. Astorgano, L. Aguilar-Cuevas, and V. Flores-Lucas, “Engaging
adolescents with down syndrome in an educational video game,”
[4] G. Laws and D. V. Bishop, “Verbal deficits in down’s syndrome
International Journal of Human–Computer Interaction, vol. 33,
and specific language impairment: a comparison,” International
no. 9, pp. 693–712, 2017.
Journal of Language & Communication Disorders, vol. 39, no. 4,
pp. 423–451, 2004. [24] D. R. Ladd, Intonational phonology. Cambridge University
Press, 2008.
[5] R. D. Kent and H. K. Vorperian, “Speech impairment in down
syndrome: A review,” Journal of Speech, Language, and Hearing [25] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent devel-
Research, vol. 56, no. 1, pp. 178–210, 2013. opments in opensmile, the Munich open-source multimedia fea-
[6] B. Heselwood, M. Bray, and I. Crookston, “Juncture, rhythm and ture extractor,” in Proceedings of the 21st ACM international con-
planning in the speech of an adult with down’s syndrome,” Clini- ference on Multimedia. ACM, 2013, pp. 835–838.
cal Linguistics & Phonetics, vol. 9, no. 2, pp. 121–137, 1995. [26] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg,
[7] S. J. Peppé, “Why is prosody in speech-language pathology so E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S.
difficult?” International Journal of Speech-Language Pathology, Narayanan et al., “The Geneva minimalistic acoustic parameter
vol. 11, no. 4, pp. 258–271, 2009. set (GeMAPS) for voice research and affective computing,” IEEE
Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202,
[8] P. Martı́nez-Castilla, M. Sotillo, and R. Campos, “Prosodic abil- 2016.
ities of spanish-speaking adolescents and adults with Williams
syndrome,” Language and Cognitive Processes, vol. 26, no. 8, [27] M. Corrales-Astorgano, D. Escudero-Mancebo, and C. González-
pp. 1055–1082, 2011. Ferreras, “Acoustic characterization and perceptual analysis of the
relative importance of prosody in speech of people with down syn-
[9] S. Peppé, J. McCann, F. Gibbon, A. O’Hare, and M. Rutherford, drome,” Speech Communication, vol. 99, pp. 90–100, 2018.
“Receptive and expressive prosodic ability in children with high-
functioning autism,” Journal of Speech, Language, and Hearing [28] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and
Research, vol. 50, no. 4, pp. 1015–1028, 2007. I. H. Witten, “The weka data mining software: an update,” ACM
SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.
[10] V. Stojanovik, “Prosodic deficits in children with down syn-
drome,” Journal of Neurolinguistics, vol. 24, no. 2, pp. 145–155, [29] J. Grieco, M. Pulsifer, K. Seligsohn, B. Skotko, and A. Schwartz,
2011. “Down syndrome: Cognitive and behavioral functioning across
the lifespan,” in American Journal of Medical Genetics Part C:
[11] O. Saz, S.-C. Yin, E. Lleida, R. Rose, C. Vaquero, and W. R. Seminars in Medical Genetics, vol. 169, no. 2. Wiley Online
Rodrı́guez, “Tools and technologies for computer-aided speech Library, 2015, pp. 135–149.
and language therapy,” Speech Communication, vol. 51, no. 10,
pp. 948–967, 2009.
[12] W. R. Rodrı́guez, O. Saz, and E. Lleida, “A prelingual tool for
the education of altered voices,” Speech Communication, vol. 54,
no. 5, pp. 583–600, 2012.
[13] “Pradia,” http://www.pradia.net, accessed: 2018-07-18.
[14] L. Aguilar and Gutiérrez-González, “Aprendizaje prosódico en un
videojuego educativo dirigido a personas con sı́ndrome de down:
definición de objetivos y diseño de actividades,” Revista de Edu-
cación Inclusiva, under revision.
[15] D. Le, K. Licata, C. Persad, and E. M. Provost, “Automatic as-
sessment of speech intelligibility for individuals with aphasia,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-
cessing, vol. 24, no. 11, pp. 2187–2199, 2016.
[16] A. Maier, T. Haderlein, U. Eysholdt, F. Rosanowski, A. Batliner,
M. Schuster, and E. Nöth, “Peaks–a system for the automatic eval-
uation of voice and speech disorders,” Speech Communication,
vol. 51, no. 5, pp. 425–437, 2009.
[17] A. Maier, F. Hönig, C. Hacker, M. Schuster, and E. Nöth, “Au-
tomatic evaluation of characteristic speech disorders in children
with cleft lip and palate,” in Ninth Annual Conference of the In-
ternational Speech Communication Association, 2008.
[18] H.-y. Lee, T.-y. Hu, H. Jing, Y.-F. Chang, Y. Tsao, Y.-C. Kao, and
T.-L. Pao, “Ensemble of machine learning and acoustic segment
model techniques for speech emotion and autism spectrum disor-
ders recognition.” in INTERSPEECH, 2013, pp. 215–219.
[19] L. Dunn, L. Dunn, and D. Arribas, “Test de vocabulario en
imágenes peabody,” Madrid: TEA, 2006.
116
IberSPEECH 2018
Whispered-to-voiced Alaryngeal Speech Conversion with Generative

Adversarial Networks
Santiago Pascual1 , Antonio Bonafonte1 , Joan Serrà2 , Jose A. Gonzalez3
1
Universitat Politècnica de Catalunya, Barcelona, Spain
2
Telefónica Research, Barcelona, Spain
3
Universidad de Málaga, Málaga, Spain
santi.pascual@upc.edu
Abstract strategy is to derive those from other parameters available in

the whispered signal, such as estimated speech formants [10].
Most methods of voice restoration for patients suffering from A key application of whisper-to-voiced speech conversion
aphonia either produce whispered or monotone speech. Apart is to provide individuals with aphonia with a more naturally
from intelligibility, this type of speech lacks expressiveness and sounding voice. People who have their larynx removed as a
naturalness due to the absence of pitch (whispered speech) or ar- treatment for cancer inevitably lose their voice. To speak again,
tificial generation of it (monotone speech). Existing techniques laryngectomees can resort to a number of methods, such as the
to restore prosodic information typically combine a vocoder, voice valve, which produces an unnatural, whispered sound, or
which parameterises the speech signal, with machine learn- the electrolarynx, a vibrating device placed against the neck that
ing techniques that predict prosodic information. In contrast, generates the lost glottal excitation but, nonetheless, produces a
this paper describes an end-to-end neural approach for estimat- robotic voice due to its constant vibration. In recent years, the
ing a fully-voiced speech waveform from whispered alaryngeal use of whisper-to-speech reconstruction methods [3, 4, 5, 11],
speech. By adapting our previous work in speech enhancement or silent speech interfaces [12, 6, 13] in which an acoustic sig-
with generative adversarial networks, we develop a speaker- nal is synthesised from non-audible speech-related biosignals
dependent model to perform whispered-to-voiced speech con- such as the movements of the speech organs, have started to be
version. Preliminary qualitative results show effectiveness in investigated to provide laryngectomees with a better and more
re-generating voiced speech, with the creation of realistic pitch naturally sounding voice.
contours. In this paper, and in contrast to previous approaches,
Index Terms: pitch restoration, whispered speech, generative we present a speaker-dependent end-to-end model for voiced
adversarial networks, alaryngeal speech. speech generation based on generative adversarial networks
(GANs) [14]. With an end-to-end model directly performing the
conversion between waveforms, we avoid the explicit extraction
1. Introduction of spectral information, the error-prone prediction of intermedi-
Whispered speech refers to a form of spoken communication in ate parameters like pitch, and the use of a vocoder to synthesize
which the vocal folds do not vibrate and, therefore, there is no speech from such intermediate parameters. With a GAN learn-
periodic glottal excitation. This can be intentional (e.g., speaking the mapping from whispered to voiced speech, we avoid
ing in whispers), or as a result of disease or trauma (e.g., patients the specification of an error loss over raw audio and make our
suffering from aphonia after a total laryngectomy). The lack of model truly generative, thus being able to produce new realis-
pitch reduces the expressiveness and naturalness of the voice. tic pitch contours. Our results show this novel pitch generation
Moreover, it can be a serious impediment for speech intelligi- as an implicit process of waveform restoration. To evaluate our
bility in tonal languages [1] or in the presence of other interfer- proposal, we compared the pitch contour distributions predicted
ing sources (i.e., cocktail party problem [2]). The conversion by our proposal with those obtained by a regression model, ob-
from whispered to voiced speech, either by reconstructing par- serving that our proposal is able to attain more natural pitch
tially existent pitch contours or by generating completely new contours than those predicted by a regression model, including
ones, is an area of research that not only has relevant practical a more realistic variance factor that relates to more expressive-
and real-world applications, but also fosters the development of ness.
advanced speech conversion systems. The remainder of the paper is structured as follows. In sec-
In general, existing methods for whispered-to-voiced tion 2 we describe the method used to restore the voiced speech
speech conversion either follow a data-driven or an analysis- with GANs. The experimental setup is described in section 3,
by-synthesis approach. In the data-driven approach, machine which includes descriptions of the dataset, an RNN baseline and
learning is used to estimate the pitch from the available speech the hyper-parameters for our GAN model. Finally, sections 4
parameters (e.g., mel frequency cepstral coefficients; MFCCs). and 5 contain the discussions of results and conclusions respec-
Then, a vocoder is used to synthesize speech from those by, tively.
for instance, predicting fundamental frequencies and voiced/un-
voiced decisions from frame-based spectral information of the 2. Generative Adversarial Networks for
whispered signal, using Gaussian mixture models (GMMs) [3, Voiced Speech Restoration
4, 5] or deep neural networks [6]. The analysis-by-synthesis
approach follows a similar methodology to code-excited linear The proposed model is an improvement over our previous work
prediction [7, 8, 9]. To estimate pitch parameters, a common on speech enhancement using GANs (SEGAN) [15, 16] in order
117 10.21437/IberSPEECH.2018-25
to handle speech reconstruction tasks. SEGAN was designed as
a speaker- and noise-agnostic model to generate clean/enhanced
versions of aligned noisy speech signals. From now on, we
change signal names for the new task, so we rather work with
natural, voiced (i.e. restored) and whispered speech signals. To
adapt the architecture to the task of voiced speech restoration,
we decide to remove the audio alignment requirement, as the
data we use has slight misalignments between input and output
speech (see Section 3.1 for more details). In addition, we intro-
duce a number of improvements that consistently stabilize and
facilitate its training after direct regularization over the wave-
form is removed. These modifications also refine the generated
quality at the generator output when regression is removed.
2.1. SEGAN
We now outline the most basic aspects of SEGAN, specifically
highlighting the ones that have been subject to change. For the
sake of brevity we refer the reader to the original paper and
code [15] for more detailed explanations on the old architecture
and setup. The SEGAN generator network (G) embeds an input
noisy waveform chunk into the latent space via a convolutional
encoder. Then, the reconstruction is made in the decoder by
Figure 1: Generator network architecture. Skip connection with
‘deconvolving’ back the latent signals into the time domain. G
learnable al are depicted with purple boxes. These are summed
features skip connections with constant factors (acting as iden-
to each intermediate activation of the decoder. Encoder and
tity functions) to promote that low-level features could escape
a potentially unnecessary compression from the encoder. Such decoder are like in the original SEGAN [15], but with half the
skip connections also improve training stability, as they allow amount of layers and doubled pooling per layer.
gradients to flow better across the deep structure of G, which has
a total of 22 layers. In the denoising setup, an L1 regularization
term helped centering output predictions around 0, discourag- tries to send messages to G about a bad behavior whenever the
ing G to explore bizarre amplitude magnitudes that could make content between both chunks, the one coming from G and the
the discriminator network (D) converge to easy discriminative reference one, changes. Note that we are using the least-squares
solutions for the fake adversarial case. GAN form (LSGAN) in the adversarial component, so our loss
functions, for D and G respectively, become
2.2. Adapted SEGAN
1
The SEGAN architecture has been adapted to cope with mis- min V (D) = Ex,w̃∼pdata (x,w̃) [(D(x, w̃) − 1)2 ]+
D 3
alignments in the input/output signals as mentioned before, as 1
well as to achieve a more stable architecture and to produce + Ez∼pz (z),w̃∼pdata (w̃) [D(G(z, w̃), w̃)2 ]
3
better quality outputs. In the current setup, similarly to the orig- 1
inal SEGAN mechanism, we inject whisper data to G, which + Ex,xr ∼pdata (x) [D(x, xr )2 ]
3
compresses it and then recovers a version of the utterance with
min V (G) = Ez∼pz (z),w̃∼pdata (w̃) [(D(G(z, w̃), w̃) − 1)2 ],
prosodic information. To cope with misalignments, we get rid G
of the L1 regularization term, as this was forcing a one-to-one
correspondence between audio samples, assuming input and where w̃ ∈ RT is the whispered utterance, x ∈ RT is the
output had the same phase. In its place we use a softer reg- natural speech, xr ∈ RT is a randomly chosen natural chunk
ularization which works in the spectral domain, similar to the within the batch, G(z, w̃) ∈ RT is the enhanced speech, and
one used in the parallel Wavenet [17]. We use a non-averaged [D(x, w̃), D(G(z, w̃), w̃), D(x, xr )] are the discriminator deci-
version of this loss though, as we work with large frames dur- sions for each input pair. All of these signals are vectors of
ing training (16,384 samples per sequence), and averaging the length T samples except for D outputs, which are scalars. T is
spectral frames over this large span could be ineffective. More- a hyper-parameter fixed during training but it is variable during
over, we calculate the loss as an absolute distance in decibels test inference.
between the generated speech and the natural one. After removing the regularization factor L1 , the generator
The spectral regularization is added to the adversarial loss output can explore large amplitudes whilst adapting to mimic
coming from D with a weighting factor λ. In SEGAN, D the speech distribution. As a matter of fact, this collapsed the
is a learnable comparative loss function between natural or training whenever the tanh activation was placed in the output
voiced signals and whispered ones. This means we have a layer of G to bound its output to [−1, 1], because the amplitude
(natural, whispered) paired input as a real batch sam- grew quickly with aggressive gradient updates and tanh would
ple and (voiced, whispered) as a fake batch sample. In not allow G to properly update anymore due to saturation. The
contrast, G has to make (voiced,whispered) true, thus way to correct this was bounding the gradient of D by applying
being the adversarial objective. In the current setup, we add spectral normalization as proposed in [18]. The discriminator
an additional fake signal in D that will enforce the preserva- does not have any batch normalization technique in this imple-
tion of intelligibility when we forward data through G: the mentation, and its architecture is the same as in our previous
(natural,random natural shuffle) pair. This pair work.
118
Figure 2: From left to right: natural speech, whispered speech (input to G) and output from G as voiced signal.
The new design of G is shown in Figure 1. It remains as CMU Arctic corpus [20] (25 minutes for each speaker, approx-
a fully convolutional encoder-decoder structure with skip con- imately). Then, whispered speech was generated from the ar-
nections, but with two changes. First, we reduce the number ticulatory data by using the RNN-based articulatory-to-speech
of layers by augmenting the pooling factor from 2 to 4 at ev- synthesiser described in [13]. In this work, these whispered sig-
ery encoder-decoder layer. This is in line with preliminary ex- nals are taken as the input to SEGAN, which acts as a post-filter
periments on the denoising task, where increasing pooling has enhancing the naturalness of the signals. For each whispered
been effective to improve objective scores for that task. Sec- signal we have a natural version, which is the original speech
ond, we introduce learnable skip connections, and these are now signal recorded by the subject. To simplify our first modeling
summed instead of concatenated to decoder feature maps. We approach we used one male and one female speakers, namely
thus have now learnable vectors al which multiply every chan- M4 and F1, and built two speaker-dependent SEGAN models.
nel of its corresponding shuttle layer l by a scalar factor αl,k . These speakers are selected for the better level of intelligibil-
These factors are all initialized to one. Hence, at the j-th de- ity of their whisper data within their genders. We want to note,
coder layer input we have the addition of the l-th encoder layer however, that both female speakers are less intelligible in their
response following whisper form than male speakers. These two speakers’ data is
split into two sets: (1) training, with approximately 90% of the
hj = hj−1 + al hl , utterances and (2) test, with the remaining approximate 10%.
In order to have augmented data we follow the same chunking
where is an element-wise product along channels. method as in our previous work [15] but window strides are one
order of magnitude smaller. Hence we have a canvas of 16,384
3. Experimental Setup samples (≈ 1 second at 16kHz) every 50 ms, in contrast with
the previous 0.5 s.
To evaluate the performance of our technique, a clinical appli-
cation involving the generation of audible speech from captured
3.2. SEGAN Setup
movement of the speech articulators is tested. More details
about the experimental setup in terms of dataset, baseline and We use the same kernel widths of 31 as we had in [15], both
hyper-parameters for our proposed approach are given below. when encoding and decoding and for both G and D networks.
The feature maps are incremental in the encoder and decremen-
3.1. Task and Dataset tal in the decoder, having {64, 128, 256, 512, 1024, 512, 256,
In our previous work [6, 13], a silent speech system aimed at 128, 64, 1} in the generator and {64, 128, 256, 512, 1024}
helping laryngectomy patients to recover their voices was de- in the discriminator convolutional structures. The discrimina-
scribed. The system comprised an articulator motion capture tor has a linear layer at the end with a single output neuron, as
device [19], which monitored the movement of the lips and in the original SEGAN setup. The latent space is constructed
T
tongue by tracking the magnetic field generated by small mag- with the concatenation of the thought vector c ∈ R 1024 ×1024
T ×1024
nets attached to them, and a synthesis module, which generated with the noise vector z ∈ R 1024 , where z ∼ N (0, I).
speech from articulatory data. To generate speech acoustics, re- Both networks are trained with Adam [21] optimizer, with the
current neural networks (RNNs) trained on parallel articulatory two-timescale update rule (TTUR) [22], such that D will have
and speech data were used. The speech produced by this system a four times faster learning rate to virtually emulate many iter-
had a reasonable quality when evaluated on normal speakers, ations in D prior to updating G. This way, we have D learn-
but it was not completely natural owing to limitations when es- ing rate 0.0004 and G learning rate 0.0001, with β1 = 0 and
timating the pitch (i.e., the capturing device did not have access β2 = 0.9, which are the same schedules based on recent suc-
to any information about the glottal excitation). cessful approaches to faster and stable convergent adversarial
In this work, we are interested on determining whether the training [23]. All signals processed by the GAN, either in
proposed adapted SEGAN could improve those signals by gen- the input/output of G or the input of D, are pre-emphasized
erating more natural and realistic prosodic contours. To evaluate with a 0.95 factor, as it proved to help coping with some high-
this, we have articulatory and speech data available, recorded frequency artifacts in the de-noising setup. When we generate
simultaneously for 6 healthy British subjects (2 females and voiced data out of G we de-emphasize it with the same factor to
4 males). Each speaker has recorded a random subset of the get the final result.
119
Figure 3: Histograms of pitch values in Hertz per utterance for male speaker (left) and female speaker (right). The three systems
appearing are natural signals; RNN baseline voiced predictions with vocoder features; and voiced speech using SEGAN.
Figure 4: Section of pitch contour of a test utterance of the male speaker calculated with Ahocoder from 4 different sources: Natural
data (blue); RNN baseline (orange); Voiced with seed 100 (green) and voiced with seed 200 (red). Here it is shown how changing the
seed indeed creates different plausible contours.
3.3. Baseline behavior in a future version of the system with an auxiliary un-
voiced/voiced classifier in the output of G.
To assess the performance of SEGAN in this task we have as Finally, figure 2 shows examples of waveforms and spec-
reference the RNN-based articulatory-to-speech system from trograms for natural, whispered and voiced signals. We can
our previous work [13] and the natural data for each modeled appreciate how, for a small chunk of waveform, the generator
speaker. The recurrent model is used to predict both the spec- network is able to refine low frequencies and gets rid of high
tral (i.e., MFCCs) and pitch parameters (i.e., fundamental fre- frequency noises to approximate the natural data. Preliminary
quency, aperiodicities and unvoiced-voiced decision) from the listening tests suggest that this model can achieve a good natural
articulatory data, so the source is articulatory data and not whis- voiced version of the speech, but some artifacts intrinsic to the
pered speech in that case. The STRAIGHT vocoder [24] is then convolutional architecture (specially in high-frequencies) have
employed to synthesise the waveform from the predicted pa- to be palliated. This observation is in line with what was also
rameters. prompted in the WaveGAN [26] work, and this is also one of
the potential reasons of the effectiveness of using pre-emphasis.
4. Results We refer the reader to the audio samples to have a feeling of the
current quality of our system 1 .
We analyze the statistics of the generated pitch contours for
the RNN, SEGAN and natural data. Figure 3 depicts the his- 5. Conclusions
tograms of all contours extracted from predicted/natural wave-
forms. Ahocoder [25] was used to extract logF0 curves, which We presented a speaker-dependent end-to-end generative ad-
are then converted to Hertz scale. Then, all voiced frames were versarial network to act as a post-filter of whispered speech
selected and concatenated per each of the three systems. We to deal with a pathological application. We adapted our pre-
come up with a long stream for each system and for the two vious speech enhancement GAN architecture to overcome mis-
genders. It can seen that, for both genders, voiced histograms alignment issues and still obtained a stable GAN architecture
(corresponding to SEGAN) have a broader variance than RNN to reconstruct voiced speech. The model is able to generate
ones, closer to the natural signal shape. This is understandable novel pitch contours by only seeing the whispered version of
if we consider that the RNN was trained with a regression crite- the speech at its input. The method generates richer curves than
rion that optimizes its output towards the mean of the pitch dis- the baseline, which sounds monotonic in terms of prosody. Fu-
tribution. This ends up producing a monotonic prosody effect, ture lines of work include an even more end-to-end approach by
normally manifested as a robotic sounding that can be heard going sensor-to-speech. Also, further study is required to alle-
in the audio samples referenced below. This indicates that the viate intrinsic high frequency artifacts provoked by the type of
adversarial procedure can generate more natural pitch values. decimation-interpolation architecture we base our design on.
Figure 4 shows pitch contours generated by SEGAN with
different random seeds. We have to note that each random seed 6. Acknowledgements
generates a different latent vector z, so the stochasticity cre- This research was supported by the project TEC2015-69266-P
ates novel curves that look plausible. It also can be noted that (MINECO/FEDER, UE).
SEGAN made some errors in determining the correct voicing
decision for some speech segments. We may enforce a better 1 http://veu.talp.cat/whispersegan/
120
7. References [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-
[1] J. Chen, H. Yang, X. Wu, and B. C. Moore, “The effect of f0 sarial nets,” in Advances in Neural Information Processing Sys-
contour on the intelligibility of speech in the presence of interfer- tems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,
ing sounds for mandarin chinese,” The Journal of the Acoustical and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp.
Society of America, vol. 143, no. 2, pp. 864–877, 2018. 2672–2680.
[2] S. Popham, D. Boebinger, D. P. Ellis, H. Kawahara, and J. H.
McDermott, “Inharmonic speech reveals the role of harmonicity [15] S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: Speech en-
in the cocktail party problem,” Nature communications, vol. 9, hancement generative adversarial network,” in Proc. of Inter-
no. 1, p. 2122, 2018. speech, 2017, pp. 3642–3646.
[3] T. Toda, A. W. Black, and K. Tokuda, “Statistical mapping be- [16] S. Pascual, M. Park, J. Serrà, A. Bonafonte, and K.-H. Ahn, “Lan-
tween articulatory movements and acoustic spectrum using a guage and noise transfer in speech enhancement generative adver-
gaussian mixture model,” Speech Communication, vol. 50, no. 3, sarial network,” in Proc. of ICASSP. IEEE, 2018, pp. 5019–5023.
pp. 215–227, 2008.
[17] A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
[4] K. Nakamura, M. Janke, M. Wand, and T. Schultz, “Estimation
K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo,
of fundamental frequency from surface electromyographic data:
F. Stimberg et al., “Parallel wavenet: Fast high-fidelity speech
Emg-to-f 0,” in Proc. of ICASSP. IEEE, 2011, pp. 573–576.
synthesis,” arXiv preprint arXiv:1711.10433, 2017.
[5] K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking- [18] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral
aid systems using gmm-based voice conversion for electrolaryn- normalization for generative adversarial networks,” arXiv preprint
geal speech,” Speech Communication, vol. 54, no. 1, pp. 134–146, arXiv:1802.05957, 2018.
2012.
[19] M. J. Fagan, S. R. Ell, J. M. Gilbert, E. Sarrazin, and P. M. Chap-
[6] J. A. Gonzalez, L. A. Cheah, A. M. Gomez, P. D. Green, J. M. man, “Development of a (silent) speech recognition system for pa-
Gilbert, S. R. Ell, R. K. Moore, and E. Holdsworth, “Direct speech tients following laryngectomy,” Medical engineering & physics,
reconstruction from articulatory sensor data by machine learning,” vol. 30, no. 4, pp. 419–425, 2008.
IEEE/ACM Transactions on Audio, Speech, and Language Pro-
cessing, vol. 25, no. 12, pp. 2362–2374, 2017. [20] J. Kominek and A. W. Black, “The CMU Arctic speech
[7] R. W. Morris and M. A. Clements, “Reconstruction of speech databases,” in Fifth ISCA Workshop on Speech Synthesis, 2004,
from whispers,” Medical Engineering and Physics, vol. 24, no. 7, pp. 223–224.
pp. 515–520, 2002.
[21] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
[8] F. Ahmadi, I. V. McLoughlin, and H. R. Sharifzadeh, “Analysis- mization,” arXiv preprint arXiv:1412.6980, 2014.
by-synthesis method for whisper-speech reconstruction,” in Cir-
cuits and Systems, 2008. APCCAS 2008. IEEE Asia Pacific Con- [22] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and
ference on. IEEE, 2008, pp. 1280–1283. S. Hochreiter, “Gans trained by a two time-scale update rule con-
verge to a local nash equilibrium,” in Advances in Neural Infor-
[9] H. R. Sharifzadeh, I. V. McLoughlin, and F. Ahmadi, “Recon-
mation Processing Systems, 2017, pp. 6626–6637.
struction of normal sounding speech for laryngectomy patients
through a modified celp codec,” IEEE Transactions on Biomed- [23] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena,
ical Engineering, vol. 57, no. 10, pp. 2448–2458, 2010. “Self-attention generative adversarial networks,” arXiv preprint
[10] J. Li, I. V. McLoughlin, and Y. Song, “Reconstruction of pitch arXiv:1805.08318, 2018.
for whisper-to-speech conversion of chinese,” in Chinese Spoken
Language Processing (ISCSLP), 2014 9th International Sympo- [24] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Re-
sium on. IEEE, 2014, pp. 206–210. structuring speech representations using a pitch-adaptive time–
frequency smoothing and an instantaneous-frequency-based f0
[11] A. K. Fuchs and M. Hagmüller, “Learning an artificial f0-contour extraction: Possible role of a repetitive structure in sounds1,”
for alt speech,” in Proc. of Interspeech, 2012. Speech communication, vol. 27, no. 3-4, pp. 187–207, 1999.
[12] B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and
[25] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “Harmonics Plus
J. S. Brumberg, “Silent speech interfaces,” Speech Communica-
Noise Model Based Vocoder for Statistical Parametric Speech
tion, vol. 52, no. 4, pp. 270–287, 2010.
Synthesis,” IEEE Journal of Selected Topics in Signal Processing,
[13] J. A. Gonzalez, L. A. Cheah, P. D. Green, J. M. Gilbert, S. R. vol. 8, no. 2, pp. 184–194, Apr. 2014.
Ell, R. K. Moore, and E. Holdsworth, “Evaluation of a silent
speech interface based on magnetic sensing and deep learning for [26] C. Donahue, J. McAuley, and M. Puckette, “Synthesizing
a phonetically rich vocabulary,” in Proc. of Interspeech, 2017, pp. audio with generative adversarial networks,” arXiv preprint
3986–3990. arXiv:1802.04208, 2018.
121
IberSPEECH 2018
LSTM based voice conversion for laryngectomees

Luis Serrano, David Tavarez, Xabier Sarasola,
Sneha Raman, Ibon Saratxaga, Eva Navas, Inma Hernáez
Aholab
University of the Basque Country (UPV/EHU), Bilbao, Spain
{lserrano, david, xsarasola, sneha, ibon, eva, inma}@aholab.ehu.eus
Abstract ing ubiquitous in the human computer interaction technologies.

The work presented in this paper aims at improving the quality
This paper describes a voice conversion system designed with
and intelligibility of ES, with the final purpose of contributing
the aim of improving the intelligibility and pleasantness of oe-
to a better life of the laryngectomee.
sophageal voices. Two different systems have been built, one
There have been different approaches to enhance the qual-
to transform the spectral magnitude and another one for the
ity and the intelligibility of alaryngeal voices. Some research
fundamental frequency, both based on DNNs. Ahocoder has
works use the source-filter analysis of the pathological signal
been used to extract the spectral information (mel cepstral co-
and focus on the modifications of one or both source and filter.
efficients) and a specific pitch extractor has been developed to
An example of this approach can be found in [4] where an adap-
calculate the fundamental frequency of the oesophageal voices.
tive gain equalizer algorithm is used to modify the source of ES;
The cepstral coefficients are converted by means of an LSTM
or in [5] where a reconstruction of normal sounding speech for
network. The conversion of the intonation curve is implemented
laryngectomy patients is attempted through a modified CELP
through two different LSTM networks, one dedicated to the
codec. Another approach is to work with the prosodic elements.
voiced unvoiced detection and another one for the prediction
In [6], the pitch information extracted from an electroglotto-
of F0 from the converted cepstral coefficients. The experi-
graph (EGG) is used to create a synthetic glottal signal, reduc-
ments described here involve conversion from one oesophageal
ing jitter and shimmer. Additionally, spectral smoothing and
speaker to a specific healthy voice. The intelligibility of the
tilt correction were applied. These modifications reduced the
signals has been measured with a Kaldi based ASR system. A
harshness and breathiness of the TE speech. The same authors
preference test has been implemented to evaluate the subjective
relate in [7] a repairing method of the durations of the patho-
preference of the obtained converted voices comparing them
logical phonemes. In the same line, [8] presents a system where
with the original oesophageal voice. The results show that spec-
concatenation of randomly chosen healthy reference patterns re-
tral conversion improves ASR while restoring the intonation is
places the pathological excitation, adjusting the short, medium
preferred by human listeners.
and long-term variability of the pitch.
Index Terms: voice conversion, speech and voice disorders,
alaryngeal voices, speech intelligibility A different approach to the problem is to make use of a
voice conversion (VC) system. In a VC system the pursued
goal is to transform utterances from a given source speaker into
1. Introduction a specific target speaker, i.e., apply some techniques to per-
The laryngectomees are persons whose larynx has been surgi- ceive the sentences as uttered by the specific target speaker. For
cally removed. The larynx is a fundamental organ in the pro- the purpose of enhancing alaryngeal voices, a healthy speaker
duction of speech since the vocal folds are located inside it. In is chosen as target speaker. Different examples of techniques
spite of this, it is still possible to utter intelligible speech us- based on statistical voice conversion can be found in [9], [10]
ing alternative vibrating elements. The acquired new voice is or [11], where the characteristics of the target speaker can be
called alaryngeal. There are three main ways to produce the tuned to obtain a more personalized converted voice.
alaryngeal voice: oesophageal (ES), electrolaryngeal (EL) and The VC process can be divided in two stages: Firstly, a
tracheoesophageal (TES) speech. The experiments described in training phase is needed in order to learn the correspondences
this paper use only oesophageal speech. laying between source and target acoustic features. These
Unlike the other two methods, the production of ES does learned relationships are then stored in the form of a conversion
not require any device. This kind of speech is learned with function. The second step is the conversion itself. The con-
the help of a speech therapist. In this method the pharyngo- version function is applied to transform new input utterances
oesophageal segment is used as a substitutive vibrating element from the source speaker. Although the identity of the speaker
for the vocal folds. Due to the nature of the intervention, the is also contained in the suprasegmental (prosody) and even lin-
air used to create the vibration of the oesophagus can not come guistic features, VC research has been focused mostly on map-
from the lungs and the trachea as happens during normal speech ping spectral features [12, 13, 14].
production. Instead, the air is swallowed from the mouth and in- The VC is a field that has been researched for a long time,
troduced in the oesophagus, being then expelled in a controlled an exhaustive recent review of the field can be found in [15].
way while producing the vibration. A wide variety of approaches have been proposed to obtain
These huge differences in the production mechanisms lead the conversion function: from codebooks [16, 17] and hid-
to a diminution of naturalness and intelligibility [1, 2, 3]. As den Markov models [18, 19, 20] to Gaussian Mixture Models
a consequence, the communication with others is hindered. (GMMs) [21, 12, 22, 23] or Gaussian processes [24]. In the
Moreover, these less intelligible voices are an added problem last years, special focus has been given to shallow/deep neural
for the automatic speech recognition algorithms that are becom- networks (S/DNNs) solutions [25, 26, 14, 27, 28].
122 10.21437/IberSPEECH.2018-26
In this paper we propose to take advantage of the recent ad-
vances in machine learning techniques to train a deep learning
system to convert from the audio of an oesophageal speaker to
a healthy speaker. Though the literature focuses specially on
spectral conversion, the importance of prosody in oesophageal
speech can not be left out. A good intonation is important
for the utterances to be perceived as more natural and pleas-
ant. Therefore, the proposed system will not only convert the
spectral features but will also estimate f0 . The spectral and f0
conversion contributions have been then evaluated separately by
means of word error rate (WER) and a perceptual test.
2. Voice conversion system

The proposed architecture uses two parallel DNN based systems
to convert separately the spectral envelope and the fundamental
frequency curve f0 . They will be described in detail in the fol-
lowing subsections. Figure 1: MSE of each normalized MCEP coefficient with the
The acoustic analysis is performed using Ahocoder [29]. 95% confidence interval for one fold - 10 sentences (test data)
This vocoder does the harmonic analysis of the speech audio
files every 5 ms, extracting the Mel-cepstral (MCEP) represen-
tation of the spectral envelope, logf0 and maximum voiced fre-
quency (MVF). The MVF is related to the harmonicity of the the other, resulting in two matrices for the source-target speaker
signal and will not be converted (the MVF value obtained from pair. Then, mean and variance normalization is applied to each
the source signal is used). To extract f0 from the alaryngeal dimension of the cepstrum vector and its derivative. In addition,
signals, the default method of Ahocoder has been replaced by a voiced/unvoiced decision vector in the form of 0 or 1 obtained
a more specific method, designed to cope with the irregularities from the logf0 values is appended, so the dimension of the in-
present in these signals. This strategy mainly consists on using put is finally 49. The resulting matrix is divided in sequences of
the autocorrelation method for pitch extraction over the residue 50 frames before entering the net.
signal of the PSIAIF analysis [30]. In this work 25 MCEP co-
The net will predict the 48 MCEP coefficients of the target
efficients in combination with their first order derivatives are
speaker and the voiced/unvoiced decision vector. The metric we
used for spectral conversion. In the experiment described here
search to minimize is the MSE.
the 0th coefficient (related to the energy), is not converted but
directly copied from the source to the target. Similarly, the The number of cells of the LSTM layer is 100. As stated be-
source MVF is not modified and it is directly copied to the tar- fore, the number of frames that composes each batch sequence
get. Other alternatives such us using a constant MVF were also is 50, and the input dimension is equal to 49. The output is ob-
experimented but showed no significant differences in informal tained from a fully connected layer and gives us a sequence of
tests. For pitch prediction 40 MCEP coefficients are used. This 50 frames and dimension 49 for each sequence in the input.
number was chosen due to quality reasons observed during the
We used the RMSProp optimizer (divide the gradient by a
training stages.
running average of its recent magnitude) with a learning rate of
For the networks architecture, the chosen approach was to 0.0001. To avoid overfitting, a dropout ratio of 0.5 was used.
use a Long Short-Term Memory (LSTM) architecture. This We trained the network for 60 epochs.
type of neural network allows the net to retain (and gradually
forget) information about previous time instants taking advan- The performance of the network has been analysed compar-
tage of the strong temporal dependency that exists between con- ing the predicted result with the target data. As the order of the
secutive frames in the speech signals, specially for the f0 curve. coefficients goes up, the similarity between the predicted MCEP
The neural networks are implemented in Python using the Keras and the target one goes down. This can be seen by calculating
library. the MSE for each normalized cepstral coefficient between the
predicted and target MCEPs (Figure 1).
2.1. Spectral conversion
Before training the LSTM network, it is necessary to align the 2.1.1. Maximum Likelihood Parameter Generation (MLPG)
source and target utterances. Due to the characteristics of the
oesophageal speech, there is a very important mismatch be- In order to solve the problem of the smoothing present in the
tween both healthy and oesophageal signals which causes the conversion process, we apply the maximum likelihood parame-
inadequacy of using a dynamic time warping (DTW) algorithm ter generation algorithm [12]. We consider the MCEP obtained
directly [31]. This is why both signals were labelled at phone from the LSTM spectral conversion network as the mean vec-
level, and then the iterative alignment procedure described in tors of a Gaussian distribution. Same as in [34], the covariance
[32] and [33] was applied for each pair of oesophageal and matrix is obtained from the squared error between the original
healthy phones. features and the predicted converted vectors.
With the alignment results, the inputs and outputs of the The global variance (GV) is calculated from the target train-
network are built. The corresponding first order derivatives ing data and is used in the MLPG to do the spectral conversion
are appended to the 24 source aligned cepstra (c0 and c00 ex- along the mean vectors and covariance matrices for each test
cluded). All vectors from all sentences are appended one after sentence.
123
3. Evaluation
3.1. Objective evaluation: Kaldi ASR
To have an objective measure (WER) of the effect of the conver-

sion we have prepared an automatic speech recognizer (ASR)
for Spanish using the Kaldi toolkit [36]. It is implemented
following the recipe s5 for the Wall Street Journal database.
The acoustic features used are 13 Mel-Frequency Cepstral Co-
efficients (MFCCS) to which a process of mean and variance
normalization (CMVN) is applied to mitigate the effects of the
channel.
The training begins with a flat-start initialization of context-
independent phonetic Hidden Markov Models (HMM), and
then a series of accumulative trainings are done. For the fi-
nal step of the recognizer, a neural network is trained. The
input features to the neural network consist of a series of 40-
dimensional features. The network sees a window of these fea-
tures, with 4 frames on each side of the central frame. The fea-
Figure 2: Architecture of the f0 estimation system for training tures are derived by processing the conventional 13-dimensional
and conversion stages. MFCCs. The necessary steps are described in [37], and ba-
sically consist in applying a series of transformations to the
normalized cepstra: first linear discriminant analysis (LDA),
2.2. Fundamental frequency estimation then maximum likelihood linear transform (MLLT) and global
feature-space maximum likelihood linear regression (fMLLR).
Two different networks have been used to perform the estima- At the recognition stage, the same transformations are applied
tion of the fundamental frequency, one for the prediction of f0 to the test data, handling them as a block.
and another one for U/V decision estimation (see Figure 2). In The audio material used to train the ASR is described in
both cases, the LSTM model is trained to map the relationship [38]. However, although the acoustic models have been main-
between the healthy target MCEPs and their corresponding f0 tained, the lexicon has been created from the 100 sentences cor-
sequence. Mean and variance normalization is applied to each pus used in the experiment (701 words). This has been done be-
dimension of the cepstrum vector. Log f0 is linearly interpo- cause using the original lexicon (with 37, 632 entries) as much
lated in unvoiced frames and mean and variance normalization as 23% of the words were out of vocabulary (OOV) words.
is also applied in this case. Development experiments showed This is due to the fact that the sentences are phonetically bal-
that the use of 40 MCEP coefficients provided better results for anced and many sentences containing proper names and many
the U/V estimation than using only 25 coefficients, therefore, 40 unusual words were chosen to maximize the variability of the
MCEP (no derivatives) are used for the intonation modelling. phonetic content. Together with this reduced lexicon, a unigram
For the U/V decision we use one Bidirectional LSTM layer language model with equally probable words has been used. Al-
with 64 cells. A fully connected layer with sigmoid activation though the final WER numbers obtained this way are not com-
provides the final output. The network is optimized using the parable to a realistic ASR situation, the procedure serves our
Adam algorithm with minibatches of size 10 and trained for 60 purpose of evaluating the ability of the conversion system to
epochs. A dropout ratio of 0.2 was applied as regularization. improve (or not) the performing of any ASR system.
We employ the binary cross-entropy as loss function.
Three different recognition experiments have been carried
The same configuration is used for the f0 prediction: one
out. The first one does the recognition of the 100 original sen-
Bidirectional LSTM layer with 64 cells followed by a fully con-
tences recorded by the laryngectomee speaker. The second one
nected layer, with linear activation in this case. Also the MSE
evaluates the resynthesized audio signals with Ahocoder using
is now the metric we search to minimize, as it occurs for the
the original cepstrums and the estimated f0 obtained from the
cepstrum network.
network. The last one uses as input the 100 sentences resynthe-
In the conversion stage, 40 converted MCEP coefficients sized with the converted cepstrum and the estimated f0 .
are needed. Therefore another network has been trained, similar
to the one used for spectral conversion, to obtain the necessary The results are shown in Table 1. The WER of the orig-
number of coefficients. Finally, the logf0 is estimated from the inal oesophageal signal is very high (56.93% vs 10.15% ob-
converted MCEPs by the two conversion networks. tained for the target healthy speaker) showing the difficulty of
the task. As expected, the system performs very similarly when
2.3. Training and testing data only f0 is changed. The small differences are probably due to
the resynthesis process and recalculation of the parameters from
Due to the nature of the data, the amount of data available the waveforms performed by Kaldi. However, when the recog-
for training and testing the conversion system is scarce. The nized signals are those which are resynthesized using the con-
parallel data used are the recordings of 100 phonetically bal- verted cepstrum and f0 , the WER drops by a 15% in absolute
anced sentences selected from a bigger corpus [35] made by terms showing that some problems present in the oesophageal
one healthy speaker and one oesophageal speaker. Out of the spectrum are corrected by the conversion system. The WER of
100 utterances, 90 are chosen for training and 10 for testing the the converted signals is though far away from the performance
system, using 10-fold cross-validation. for the healthy voice.
124
Table 1: WER results for the different experiments
Case WER (%)

original healthy 10.15
original oesophageal 56.93
original spectrum + estimated f0 57.91
converted spectrum + estimated f0 41.48
Figure 4: Detailed results of the preference test. (-2:I strongly
prefer sentence 1, -1:I prefer sentence 1, 0:I don’t have any
preference, 1:I prefer sentence 2, 2: I strongly prefer sentence
2).
original pathological signal in terms of intelligibility and pleas-

antness. The effect of the conversion of the prosody has been
Figure 3: Averaged preference results with confidence intervals
evaluated separately from the complete conversion (spectral and
(-2:I strongly prefer sentence 1, -1:I prefer sentence 1, 0:I don’t
f0 conversion together).
have any preference, 1:I prefer sentence 2, 2: I strongly prefer
sentence 2). The intelligibility has been evaluated by means of ASR,
using the WER. The results show that using a healthy target
speaker the recognition rate can be greatly improved (as much
as an absolute gain of 15% in our experiment). The WER rate is
3.2. Subjective evaluation: perceptual test still far away from that obtained for the healthy voices, but this
A perceptual listening test has been designed in order to eval- demonstrates that the conversion of the spectral parameters is
uate the degree of preference over each processing technique helping the oesophageal signal to be more similar to the healthy
(including no processing at all) by human evaluators. In this voices used to train the ASR system.
test, a total of 18 random sentences are selected for each lis- As expected, f0 -only modification has little or no effect in
tener, and for each of the sentences two of the three cases (orig- the recognition rate. However, this modification was the most
inal oesophageal (ORIGINAL), original spectrum + estimated preferred by the listeners in the perceptual test, even over the
f0 (F0) and converted spectrum + estimated f0 (F0 & SPEC)) original oesophageal voice with no modifications at all. This
are presented to the listener. He or she is asked to rate his or her is an important result because it corroborates the relevance of
preference over one of the two cases, by giving a score in a five having a restored source without modifying the speaker charac-
point scale. teristics.
A total of 31 native Spanish speakers took part in the test. Once the deep learning conversion approach has been
Figure 3 shows the averaged preference results of comparing the proved as valid, in the future we will better adjust the LSTM pa-
three systems in pairs. As it can be seen, the preferred system rameters, for example, the sequence size, to capture the prosody
is the one with the original spectrum and modified intonation. of a speaker in a more adequate way. We also plan to experi-
The system that converts only f0 is clearly preferred over the ment different network architectures.
system with converted spectrum and f0 . Additionally, the orig-
inal oesophageal signals are preferred over the signals with the 5. Acknowledgements
converted spectrum. Thus, the signals with a converted spec-
This work has been partially funded by the Spanish Ministry
trum are least preferred. This can be due to the loss of quality
of Economy and Competitiveness with FEDER support (RE-
introduced by the conversion of the MCEP coefficients. Fig-
STORE project, TEC2015-67163-C2-1-R), the Basque Govern-
ure 4 shows the degree of preference for each pair of systems.
ment (BerbaOla project, KK-2018/00014) and from the Euro-
The figure shows that the ’strong preference’ option is the least
pean Unions H2020 research and innovation programme un-
chosen option in all three pairs. Also, it can be seen that the
der the Marie Curie European Training Network ENRICH
f0 -only modification is the most preferred method, followed by
(675324).
no modification (original signals).
A few examples of the converted utter-
ances can be found in the following website: 6. References
http://aholab.ehu.eus/users/lserrano/ib18/demoib18.html. [1] B. Weinberg, “Acoustical properties of esophageal and tracheoe-
sophageal speech,” Laryngectomee rehabilitation, pp. 113–127,
1986.
4. Conclusions
[2] T. Most, Y. Tobin, and R. C. Mimran, “Acoustic and perceptual
In this paper we have presented a conversion system from one characteristics of esophageal and tracheoesophageal speech pro-
oesophageal speaker to one healthy speaker using a deep learn- duction,” Journal of communication disorders, vol. 33, no. 2, pp.
ing approach. We have trained an LSTM network to convert the 165–181, 2000.
spectral features of the speaker, and two BLSTMs to estimate [3] T. Drugman, M. Rijckaert, C. Janssens, and M. Remacle, “Tra-
the intonation curve: one to learn the underlying relationship cheoesophageal speech: A dedicated objective acoustic assess-
existing between the healthy MCEPs and their corresponding ment,” Computer Speech & Language, vol. 30, no. 1, pp. 16–31,
logf0 , and another to predict from the same healthy cepstrum 2015.
the V/U decision. With the system implemented, we have eval- [4] R. Ishaq and B. G. Zapirain, “Esophageal speech enhancement us-
uated the effects that the conversion has in comparison with the ing modified voicing source,” in Signal Processing and Informa-
125
tion Technology (ISSPIT), 2013 IEEE International Symposium [23] H. Benisty and D. Malah, “Voice conversion using gmm with en-
on. IEEE, 2013, pp. 000 210–000 214. hanced global variance,” in Twelfth Annual Conference of the In-
[5] H. R. Sharifzadeh, I. V. McLoughlin, and F. Ahmadi, “Recon- ternational Speech Communication Association, 2011.
struction of normal sounding speech for laryngectomy patients [24] N. Xu, Y. Tang, J. Bao, A. Jiang, X. Liu, and Z. Yang, “Voice
through a modified celp codec,” IEEE Transactions on Biomed- conversion based on gaussian processes by coherent and asym-
ical Engineering, vol. 57, no. 10, pp. 2448–2458, 2010. metric training with limited training data,” Speech Communica-
[6] A. del Pozo and S. Young, “Continuous tracheoesophageal speech tion, vol. 58, pp. 124–138, 2014.
repair,” in Eusipco, 2006, pp. 1–5. [25] M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegna-
[7] ——, “Repairing tracheoesophageal speech duration,” in Speech narayana, “Transformation of formants for voice conversion using
Prosody, 2008, pp. 187–190. artificial neural networks,” Speech communication, vol. 16, no. 2,
pp. 207–216, 1995.
[8] O. Schleusing, R. Vetter, P. Renevey, J.-M. Vesin, and
V. Schweizer, “Prosodic speech restoration device: Glottal exci- [26] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad,
tation restoration using a multi-resolution approach,” in Interna- “Spectral mapping using artificial neural networks for voice con-
tional Joint Conference on Biomedical Engineering Systems and version,” IEEE Transactions on Audio, Speech, and Language
Technologies. Springer, 2010, pp. 177–188. Processing, vol. 18, no. 5, pp. 954–964, 2010.
[9] H. Doi, K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, [27] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversion using
“Statistical approach to enhancing esophageal speech based deep bidirectional long short-term memory based recurrent neural
on gaussian mixture models,” in Acoustics Speech and Signal networks,” in Acoustics, Speech and Signal Processing (ICASSP),
Processing (ICASSP), 2010 IEEE International Conference on. 2015 IEEE International Conference on. IEEE, 2015, pp. 4869–
IEEE, 2010, pp. 4250–4253. 4873.
[10] ——, “Esophageal speech enhancement based on statistical voice [28] Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J.
conversion with gaussian mixture models,” IEICE TRANSAC- Qian, H. M. Meng, and L. Deng, “Deep learning for acoustic
TIONS on Information and Systems, vol. 93, no. 9, pp. 2472– modeling in parametric speech generation: A systematic review
2482, 2010. of existing techniques and future trends,” IEEE Signal Processing
Magazine, vol. 32, no. 3, pp. 35–52, 2015.
[11] H. Doi, T. Toda, K. Nakamura, H. Saruwatari, and K. Shikano,
“Alaryngeal speech enhancement based on one-to-many eigen- [29] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “Harmonics plus noise
voice conversion,” IEEE/ACM Transactions on Audio, Speech, model based vocoder for statistical parametric speech synthesis,”
and Language Processing, vol. 22, no. 1, pp. 172–183, 2014. IEEE Journal of Selected Topics in Signal Processing, vol. 8,
no. 2, pp. 184–194, 2014.
[12] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on
maximum-likelihood estimation of spectral parameter trajectory,” [30] P. Alku, “Glottal wave analysis with pitch synchronous iterative
IEEE Transactions on Audio, Speech, and Language Processing, adaptive inverse filtering,” Speech communication, vol. 11, no. 2-
vol. 15, no. 8, pp. 2222–2235, 2007. 3, pp. 109–118, 1992.
[13] D. Erro, E. Navas, and I. Hernaez, “Parametric voice conver- [31] M. Kishimoto, T. Toda, H. Doi, S. Sakti, and S. Nakamura,
sion based on bilinear frequency warping plus amplitude scaling,” “Model training using parallel data with mismatched pause po-
IEEE Transactions on Audio, Speech and Language Processing, sitions in statistical esophageal speech enhancement,” in Signal
vol. 21, no. 3, pp. 556 – 566, 2013. Processing (ICSP), 2012 IEEE 11th International Conference on,
[14] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conver- vol. 1. IEEE, 2012, pp. 590–594.
sion using deep neural networks with layer-wise generative train- [32] D. Erro, E. Navas, and I. Hernáez, “Iterative MMSE estima-
ing,” IEEE/ACM Transactions on Audio, Speech and Language tion of vocal tract length normalization factors for voice trans-
Processing (TASLP), vol. 22, no. 12, pp. 1859–1872, 2014. formation,” in Thirteenth Annual Conference of the International
[15] S. H. Mohammadi and A. Kain, “An overview of voice conversion Speech Communication Association, 2012.
systems,” Speech Communication, vol. 88, pp. 65–82, 2017. [33] D. Erro, A. Alonso, L. Serrano, D. Tavarez, I. Odriozola, X. Sara-
[16] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice con- sola, E. del Blanco, J. Sánchez, I. Saratxaga, E. Navas et al., “Ml
version through vector quantization,” Journal of the Acoustical parameter generation with a reformulated mge training criterion-
Society of Japan (E), vol. 11, no. 2, pp. 71–76, 1990. participation in the voice conversion challenge 2016.” in Inter-
speech, 2016.
[17] L. M. Arslan, “Speaker transformation algorithm using segmental
codebooks (stasc) 1,” Speech Communication, vol. 28, no. 3, pp. [34] J. A. Gonzalez, L. A. Cheah, A. M. Gomez, P. D. Green, J. M.
211–226, 1999. Gilbert, S. R. Ell, R. K. Moore, and E. Holdsworth, “Direct speech
reconstruction from articulatory sensor data by machine learning,”
[18] A. Bonafonte, A. Kain, J. v. Santen, and H. Duxans, “Including IEEE/ACM Transactions on Audio, Speech, and Language Pro-
dynamic and phonetic information in voice conversion systems,” cessing, vol. 25, no. 12, pp. 2362–2374, 2017.
in Eighth International Conference on Spoken Language Process-
ing, 2004. [35] I. Sainz, D. Erro, E. Navas, I. Hernáez, J. Sanchez,
I. Saratxaga, and I. Odriozola, “Versatile Speech Databases
[19] C.-H. Lee, C.-H. Wu, and J.-C. Guo, “Pronunciation variation for High Quality Synthesis for Basque,” in 8th international
generation for spontaneous speech synthesis using state-based conference on Language Resources and Evaluation (LREC),
voice transformation,” in Acoustics Speech and Signal Process- 2012, pp. 3308–3312. [Online]. Available: http://www.lrec-
ing (ICASSP), 2010 IEEE International Conference on. IEEE, conf.org/proceedings/lrec2012/pdf/126 Paper.pdf
2010, pp. 4826–4829.
[20] H. Zen, Y. Nankaku, and K. Tokuda, “Continuous stochastic fea-
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
ture mapping based on trajectory hmms,” IEEE Transactions on
“The kaldi speech recognition toolkit,” in IEEE 2011 workshop
Audio, Speech, and Language Processing, vol. 19, no. 2, pp. 417–
on automatic speech recognition and understanding, no. EPFL-
430, 2011.
[21] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilis-
[37] S. P. Rath, D. Povey, K. Veselỳ, and J. Cernockỳ, “Improved fea-
tic transform for voice conversion,” IEEE Transactions on speech
ture processing for deep neural networks.” in Interspeech, 2013,
and audio processing, vol. 6, no. 2, pp. 131–142, 1998.
pp. 109–113.
[22] E. Helander, T. Virtanen, J. Nurminen, and M. Gabbouj, “Voice
[38] L. Serrano, D. Tavarez, I. Odriozola, I. Hernaez, and I. Saratxaga,
conversion using partial least squares regression,” IEEE Transac-
“Aholab system for albayzin 2016 search-on-speech evaluation,”
tions on Audio, Speech, and Language Processing, vol. 18, no. 5,
in IberSPEECH, 2016, pp. 33–42.
pp. 912–921, 2010.
126
IberSPEECH 2018
Sign Language Gesture Classification Using Neural Networks

Zuzanna Parcheta1 , Carlos-D. Martı́nez-Hinarejos2
1
Sciling S.L., Carrer del Riu 321, Pinedo, 46012, Spain
2
Universitat Politècnica de València, Camino de Vera, s/n, 46022, Spain
zparcheta@sciling.com, cmartine@dsic.upv.es
Abstract The article is organised as follows: Section 2 presents the

related work on sign language recognition, Section 3 details
Recent studies have demonstrated the power of neural net- the experimental setup (database, models, partitions, etc.); Sec-
works for different fields of artificial intelligence. In most fields, tion 4 details the experiment parameters and their correspond-
such as machine translation or speech recognition, neural net- ing results; finally, Section 5 offers conclusions and future work
works outperform previously used methods (Hidden Markov lines.
Models with Gaussian Mixtures, Statistical Machine Transla-
tion, etc.). In this paper, the efficiency of the LeNet convolu- 2. Related work
tional neural network for isolated word sign language recogni-
tion is demonstrated. As a preprocessing step, we apply several The first step to developing data based gesture recognition sys-
techniques to obtain the same dimension for the input that con- tems is data acquisition. For data acquisition there are de-
tains gesture information. The performance of these preprocess- vices available which use accelerometers [10], electromyogra-
ing techniques on a Spanish Sign Language dataset is evaluated. phy (Myo’s armband physical principle) [11], or infrared light
These approaches outperform previously obtained results based (Kinect’s and Leap Motion physical principle) [12, 13]. Also,
on Hidden Markov Models. to get gesture data it is possible to use cameras warn on an arm
Index Terms: gesture recognition, human-computer interac- [14] or data gloves [15]. Publications about gestures recogni-
tion, gesture classification, sign language, Leap Motion, Con- tion can be divided into two main groups depending on whether
volutional Neural Networks signs are static or dynamic.
There are quite a few studies about static sign recognition.
1. Introduction In [16], video capture and Multilayer Perceptron (MLP) neural
networks are used to recognise 23 static signs from the Colom-
Nowadays, technology has advanced to an unbelievable point bian Sign Language. In [17], Gaussian Mixture Models are ap-
in helping people with almost any kind of disability. People plied to recognise 5 static signs obtained from image captures.
who are deaf or hard of hearing often experience limitations in Some authors also consider dynamic gestures recognition.
their everyday life. However, assistive technology helps deaf In [18], the Microsoft Kinect is used to obtain features to iden-
people in their daily life. There are two main types of assis- tify 8 different signs using weighted Dynamic Time Warping
tive devices for deafness: 1) Devices to enhance listening, e.g. (DTW). The latter approach is mentioned in [8], where a combi-
frequency modulated (FM) systems [1], infrared (IR) systems nation of k-Nearest Neighbour classifiers and DTW allows the
[2], cochlear implants [3], and hearing aids [4] and 2) Devices recognition of 91 signs extracted by using the Leap Motion sen-
to convey information visually, e.g. alerting devices [5], cap- sor. This work was continued in [9] where different typologies
tioning [6], and real-time transcription systems [7]. Apart from of Continuous Density Hidden Markov Models were applied.
that, most deaf people use sign language as their preferred way [19] describes a system to recognise very simple dynamic ges-
to communicate. However, there is no tool to help them inter- tures which uses Normalised Dynamic Time Wrapping. In [20],
act with non-users of sign language. Fortunately, contemporary authors introduce a new method to obtain gesture recognition,
technology allows for developing systems of automatic transla- called Principal Motion Components (PMC). However, they do
tion from sign language into oral language. experiments with gestures not used in sign language.
To solve this problem it is necessary to create a tool which Neural networks have gained a lot of interest in recent
allows deaf people to communicate with non-users of sign lan- years. Consequently, deep architectures for gesture recognition
guage in a comfortable and fluid way. The researchers’ com- have appeared. In [21], authors explore convolutional and bidi-
mitment to the deaf community is to improve the quality of the rectional neural network architectures. In experiments, they use
automatic systems that are used for this task by exploring new the Montalbano2 dataset, which contains 20 Italian sign gesture
technologies and recognition patterns. categories. In [22], authors modified the Neural Machine Trans-
In this work, we explore Convolutional Neural Networks lation (NMT) system using the standard sequence to sequence
(CNN) to perform gesture classification for isolated words com- framework to translate sign language from videos to text. Also
ing from a sign language database generated using the Leap in [23], the deep architecture is applied to tasks such as gesture
Motion1 sensor. Previous work [8, 9] demonstrated that it is recognition, activity recognition and continuous sign language
possible to classify this dynamic gesture database using Hidden recognition. The authors employ bidirectional recurrent neural
Markov Models with a 10.6% error rate. In our current work, networks, in the form of Long Short Term Memory (LSTM)
this score is significantly improved to an error rate of 8.6%. units.
1 https://www.leapmotion.com/ 2 http://chalearnlap.cvc.uab.es/dataset/13/description/
127 10.21437/IberSPEECH.2018-27
In this work, Convolutional Networks are used to classify Table 1: Results with trim/zero padding preprocessing. Best
gestures from the Spanish sign language. Some improvements results marked with “*”.
in recognition accuracy are obtained with respect to results from
our previous studies [8, 9]. Max length
40 50 60 70 80 90 97 100
3. Experimental setup 5 12.0 13.1 12.0 12.5 11.6 12.7 12.6 11.4
3.1. Data preparation 10 11.0 11.0 10.1 11.6 10.6 11.8 11.4 11.2
15 10.8 11.2 9.8 11.3 9.8 11.4 11.1 10.3
The experimental data is the isolated gestures subset from the 20 10.5 10.5 10.5 10.9 9.4 11.1 10.8 9.9
Patience
“Spanish sign language db”3 , which is described in [8]. All 25 10.4 10.0 9.6 10.9 9.5 10.5 11.1 10.2
gestures in this database are dynamic. Data was acquired using 30 11.1 10.6 9.4 10.6 *9.3 11.9 10.2 10.3
the Leap Motion sensor. Gestures are described with sequences 35 10.9 10.7 10.0 10.5 9.4 11.0 10.3 10.2
of 21 variables per hand. This gives a sequence of 42 dimen- 40 10.7 10.4 9.9 10.5 *9.3 10.8 10.3 9.9
sional vector features per gesture. The isolated gesture subset is 45 11.2 10.1 10.4 10.5 9.7 11.0 10.3 10.1
formed of samples corresponding to 92 words; for each word, 50 11.2 10.1 10.3 10.5 9.6 10.5 10.3 10.2
40 examples performed by 4 different people were acquired,
giving a total number of 3,680 acquisitions. The dataset was
divided into 4 balanced partitions with the purpose of conduct-
The number of output features is fixed to 1,000. Finally, there
ing cross-validation experiments in the same fashion as those
is an output layer with 92 neurons corresponding to the number
described in [9].
of classes.
Data corresponding to each gesture is saved in a separated
text file. Each row in the file contained information (42 fea-
ture values) about the gesture in a determined frame. Gestures 4. Experiments and results
have different numbers of frames; therefore, a preprocess step is We conducted experiments using our LeNet network. In
required because we need fixed length data to train a neural net- each experiment we varied the value of patience p =
work. Three different methods were used to fix the number of {5, 10, 15, 20, 25, 30, 35, 40, 45, 50} and max length =
rows of each gesture to an equal length of matrix max length: {40, 50, 60, 70, 80, 90, 97, 100}. Patience (also called early
• Trim data to fixed length (keeping the last max length stopping) is a technique for controlling overfitting in neural net-
vectors) or pad with 0 (at the beginning) to get the same works, by stopping training before the weights have converged.
number of rows. We stop the training when the performance has stopped improv-
ing in a determined number of epochs. The original code and
• Our implementation of the trace segmentation technique LeNet configuration is dedicated to speech recognition, where
[25] the maximum length was fixed to 97 by default. That is the rea-
son why this value is used in the experiments. The experiments
• Linear interpolation of data using the interp1d
compare against the baseline provided by [9], which used Hid-
method from the scikit-learn library.
den Markov Models (HMM) to obtain a classification error rate
of 10.6±0.9. Confidence intervals were calculated by using the
3.2. Model bootstrapping method with 10,000 repetitions [27] and they are
In our experiments we use the LeNet network [26]. LeNet is all around the same value for all the experiments (±0.9). All
a convolutional network that has several applications, such as the results shown in the following are classification error rates
handwriting recognition or speech recognition. The architecture obtained through cross-validation using the same four partitions
of this network is characterised to ensure invariance to some stated in [9].
degree of shift, scale, and distortion. For each experiment we used the three different techniques
In our case, the input plane receives the gesture data, and of data preprocessing described above. Table 1 shows gesture
each unit in a layer receives inputs from a set of units located classification results using the trim/zero padding preprocessing
in a small neighbourhood in the previous layer. In Figure 1 the technique. We added a colour scale to make it easier to see
scheme of the LeNet architecture is shown. The first layer is the performance of classification depending on patience and
a 2 dimensional convolution layer which learns convolution fil- max length values. Dark colours are for bigger error rates and
ters, where each filter is 20 × 20 units. After that, the ReLU light colours are assigned to lower error rates. We can appreci-
activation function is applied and followed by a max-pooling ate that the best score is obtained with a patience of 30 or 40
with kernel size of 2. Once again, a convolution layer is ap- epochs and max length of 80 rows (9.3% error rate). In gen-
plied, but this time with 20 convolution filters. The ReLU ac- eral, the higher the patience, the lower the error, but only until a
tivation function is applied, previously applying max-pooling. certain value. With respect to max length, the behaviour does
A dropout layer with p = 0.5 is added and finally, a fully- not present a clear pattern.
connected layer is used. The number of input features depends In Table 2 we show most frequent confused gestures in
on the max length parameter, and it is calculated by following gesture classification using trim/zero padding as preprocessing
Equation (1). step. The confusions comes from global confusion matrix gen-
erated during cross-validation experiments.
max length − 12 In Table 3, trace-segmentation preprocessing results are
· 140 (1) shown. In this case, the best score is obtained with a patience of
4
45 epochs and max length of 100 rows (9.2% of error rate).
3 https://github.com/Sasanita/ This result is slightly better than the best score obtained with the
spanish-sign-language-db trim/zero padding technique. However, according to confidence
128
Figure 1: Schematic diagram of LeNet architecture [24].
Table 2: Most frequent confused gestures in gesture classifica- Table 4: Most frequent confused gestures in gesture classifica-
tion using trim/zero padding preprocessing. tion using trace-segmentation preprocessing.
Num of confusions Reference Recognition Num of confusions Reference Recognition

9 sign a lot 10 one no
9 thank you nineteen 10 sign a lot
8 thirteen fourteen 10 red eyes
8 sign regular 10 gray colour
7 you (male) we (male) 9 no one
7 one eleven 9 thank you study
7 red eyes 8 be born good morning
8 eighteen nineteen
7 eyes red
7 five hello
Table 3: Results with trace-segmentation preprocessing. Best
result marked with “*”.
Max length
lowest error rate is 8.6%, which is significantly better than the
40 50 60 70 80 90 97 100 HMM baseline (which did not happen with the other prepro-
5 12.5 12.6 13.2 11.9 11.5 11.1 12.4 11.4 cessing techniques). Therefore, the previous result from [9] is
10 11.4 10.8 11.5 10.5 12.6 11.7 10.7 11.1 improved by about 2% in an absolute manner. This result is
15 11.2 11.3 11.3 11.3 10.6 10.3 10.9 10.0 considered to be very satisfactory.
20 11.1 9.8 10.7 10.5 10.8 10.2 10.0 11.0
Patience
In Table 6 we show most frequent confused gestures in ges-

25 10.2 9.7 10.3 10.1 10.7 10.1 9.7 9.7 ture classification using interpolation as preprocessing step.
30 10.7 9.7 9.9 10.4 11.1 10.2 9.7 9.8 We can appreciate that some of confusions match in each of
35 10.7 9.3 9.9 10.5 10.7 10.3 9.3 9.6 confusion table e.g. “sign” confused with “a lot” sign. In most
40 10.1 9.7 10.0 10.1 10.0 9.5 9.6 9.7 of confused signs there are some similarity in shape of hands or
45 10.0 9.9 10.0 10.7 10.1 10.0 9.4 *9.2 trajectory during gesture e.g. sign “hello” and “five” have the
50 10.2 9.6 10.0 10.1 10.1 10.2 9.6 9.8 same shape of hand only that the sign “hello” make swinging
movement. We show image of both signs in Figure 24 . Exactly
the same thing happens with signs “one” and “no” where the
only difference between them is swinging movement.
intervals, neither trim/padding nor trace-segmentation prepro-
cessing provide statistically significant improvements with re-
spect to the best result obtained with HMM. With this tech- 5. Conclusion and future work
nique, the behaviour for patience is similar, but it seems that
In this work, we developed a classification system for the Span-
the higher the max length, the lower the error.
ish sign language. We used the LeNet convolutional network.
In Table 4 we show most frequent confused gestures in As a preprocessing step we employed three different methods.
gesture classification using trace-segmentation as preprocessing The data set used in this work is the “Spanish sign language db”
step. Some of confused gestures match with confusions from which contains data from 92 different gestures captured with
Table 2. the Leap Motion sensor. The results obtained are very satisfac-
The last experiments were conducted using the interpola- tory, since we were able to improve our baseline by lowering
tion technique as a preprocessing step. This time the best re-
sult was obtained with a patience of 25 epochs and 97 for 4 Images taken from online sign language dictionary
max length. These are definitely the best results, since the https://www.spreadthesign.com/es/ .
129
Table 5: Results with interpolation preprocessing. Best result
marked with “*”.
Max length
40 50 60 70 80 90 97 100
5 12.1 11.4 12.3 11.6 12.0 11.3 10.7 12.7
10 11.1 11.4 11.2 10.6 10.9 10.1 9.9 10.6 (a) ”hello” sign. (b) ”five” sign.
15 10.3 9.5 11.9 10.0 10.3 10.1 10.1 9.2
20 10.5 9.3 10.3 9.4 11.1 10.0 9.5 9.0
Patience
25 9.6 9.8 9.9 9.5 9.8 9.9 *8.6 9.6 Figure 2: Pair of confused signs.
30 10.0 10.1 10.3 9.9 9.6 9.7 8.9 9.1
35 9.7 10.0 9.8 9.6 9.8 9.2 8.7 9.0
40 10.0 9.7 10.0 9.6 9.6 9.2 8.9 9.0 [4] National Institute on Deafness and Other Communication Disor-
45 10.0 9.7 9.9 9.4 9.8 9.5 9.2 9.6 ders, “Hearing Aids,” https://bit.ly/1UwxUYN, 2013.
50 10.0 9.7 9.9 9.5 9.6 9.5 9.1 9.0 [5] S. Jones, “Alerting Devices,” https://bit.ly/2Eri6Y8, 2018.
[6] National Association of the Deaf, “Captioning for Access,” https:
Table 6: Most frequent confused gestures in gesture classifica- //bit.ly/2NCpLVi.
tion using interpolation preprocessing. [7] The Canadian Hearing Society, “Speech to text transcription
(CART Communication Access Realtime Translation),” https:
//bit.ly/2N2QhFV.
Num of confusions Reference Recognition [8] Z. Parcheta and C.-D. Martı́nez-Hinarejos, “Sign language ges-
ture recognition using HMM,” in Iberian Conference on Pattern
10 sign a lot Recognition and Image Analysis. Springer, 2017, pp. 419–426.
10 eighteen nineteen [9] C.-D. Martınez-Hinarejos and Z. Parcheta, “Spanish Sign Lan-
9 red eyes guage Recognition with Different Topology Hidden Markov Mod-
9 he brother els,” Proc. Interspeech 2017, pp. 3349–3353, 2017.
8 good morning be born [10] V. E. Kosmidou, P. C. Petrantonakis, and L. J. Hadjileontiadis,
7 one no “Enhanced sign language recognition using weighted intrinsic-
7 do not know no mode entropy and signer’s level of deafness,” IEEE Transactions
7 hello five on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 41,
no. 6, pp. 1531–1543, 2011.
[11] “MYO armband,” https://www.myo.com/.
[12] Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE multime-
the error rate by 2% to a level of 8.6%. We also, present ta- dia, vol. 19, no. 2, pp. 4–10, 2012.
bles containing most frequent confused gestures using different [13] L. E. Potter, J. Araullo, and L. Carter, “The leap motion controller:
methods of preprocessing. a view on sign language,” in Proceedings of the 25th Australian
As we obtained such a good result, we would like to con- computer-human interaction conference: augmentation, applica-
tinue working in this area. Now we want to work with sign tion, innovation, collaboration. ACM, 2013, pp. 175–178.
language sentence recognition to be able to create a system that [14] D. Kim, O. Hilliges, S. Izadi, A. D. Butler, J. Chen, I. Oikono-
recognises a signed sentence composed of several signs. For midis, and P. Olivier, “Digits: freehand 3d interactions anywhere
that, we will use Recurrent Neural Networks (RNN) commonly using a wrist-worn gloveless sensor,” in Proceedings of the 25th
used in machine translation and speech recognition. Also, we annual ACM symposium on User interface software and technol-
ogy. ACM, 2012, pp. 167–176.
will conduct experiments on isolated gestures classification to
compare the performance of LeNet and RNN. [15] T. Kuroda, Y. Tabata, A. Goto, H. Ikuta, M. Murakami et al.,
“Consumer price data-glove for sign language recognition,” in
Proc. of 5th Intl Conf. Disability, Virtual Reality Assoc. Tech., Ox-
6. Acknowledgements ford, UK, 2004, pp. 253–258.
Work partially supported by MINECO under grant DI- [16] J. D. Guerrero-Balaguera and W. J. Pérez-Holguı́n, “Fpga-based
15-08169, by Sciling under its R+D programme, and by translation system from colombian sign language to text,” Dyna,
vol. 82, no. 189, pp. 172–181, 2015.
MINECO/FEDER under project CoMUN-HaT (TIN2015-
70924-C2-1-R). The authors would like to thank NVIDIA for [17] F.-H. Chou and Y.-C. Su, “An encoding and identification ap-
their donation of Titan Xp GPU that allowed to conduct this proach for the static sign language recognition,” in Advanced
research. Intelligent Mechatronics (AIM), 2012 IEEE/ASME International
Conference on. IEEE, 2012, pp. 885–889.
[18] S. Celebi, A. S. Aydin, T. T. Temiz, and T. Arici, “Gesture Recog-
7. References nition Using Skeleton Data with Weighted Dynamic Time Warp-
[1] American Speech-Language-Hearing Association, “Guidelines ing,” in VISAPP, 2013, pp. 620–625.
for Fitting and Monitoring FM Systems,” https://bit.ly/2udMuOs, [19] Y. Zou, J. Xiao, J. Han, K. Wu, Y. Li, and L. M. Ni, “Grfid: A
2002. device-free rfid-based gesture recognition system,” IEEE Trans-
[2] M. Kaine-Krolak, , and M. E. Novak, “An Introduction to Infrared actions on Mobile Computing, vol. 16, no. 2, pp. 381–393, 2017.
Technology: Applications in the Home, Classroom, Workplace, [20] H. J. Escalante, I. Guyon, V. Athitsos, P. Jangyodsuk, and J. Wan,
and Beyond ...” https://bit.ly/2udMuOs, 1995. “Principal motion components for one-shot gesture recognition,”
[3] National Institute on Deafness and Other Communication Disor- Pattern Analysis and Applications, vol. 20, no. 1, pp. 167–182,
ders, “Cochlear Implants,” https://bit.ly/27R6aWd, 2016. 2017.
130
[21] L. Pigou, A. Van Den Oord, S. Dieleman, M. Van Herreweghe,
and J. Dambre, “Beyond temporal pooling: Recurrence and tem-
poral convolutions for gesture recognition in video,” International
Journal of Computer Vision, vol. 126, no. 2-4, pp. 430–439, 2018.
[22] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bow-
den, “Neural sign language translation,” CVPR 2018 Proceedings,
2018.
[23] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-
end sequence modelling with deep recurrent cnn-hmms,” in IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
2017.
[24] H. Dwyer, “Deep Learning With Dataiku Data Science Studio,”
https://bit.ly/2mfx8Wd.
[25] E. F. Cabral and G. D. Tattersall, “Trace-segmentation of isolated
utterances for speech recognition,” in 1995 International Confer-
ence on Acoustics, Speech, and Signal Processing, vol. 1, May
1995, pp. 365–368 vol.1.
[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the
IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[27] M. Bisani and H. Ney, “Bootstrap estimates for confidence inter-
vals in ASR performance evaluation,” in Proc. of ICASSP, vol. 1,
2004, pp. 409–412.
131
IberSPEECH 2018
Influence of tense, modal and lax phonation on the three-dimensional finite

element synthesis of vowel [A]
Marc Freixes, Marc Arnela, Joan Claudi Socoró, Francesc Alías, Oriol Guasch
GTM – Grup de Recerca en Tecnologies Mèdia, La Salle - Universitat Ramon Llull

Quatre Camins, 30, 08022 Barcelona, Spain
{marc.freixes,marc.arnela,joanclaudi.socoro,francesc.alias,oriol.guasch}@salle.url.edu
Abstract acoustic model is used. As shown in [6], a straightened vocal

One-dimensional articulatory speech models have long tract based on circular cross-sections prevents the onset of such
been used to generate synthetic voice. These models assume modes due to radial symmetry, in contrast to what occurs for
plane wave propagation within the vocal tract, which holds for realistic vocal tract geometries based on MRI data. Other vocal
frequencies up to ∼5 kHz. However, higher order modes also tract geometries simplifications were studied in that work, all
propagate beyond this limit, which may be relevant to produce of them showing large variations in the HFE while keeping a
a more natural voice. Such modes could be especially impor- similar behaviour for low frequencies. One can then assert that
tant for phonation types with significant high frequency energy the vocal tract shape is determinant for the HFE content of the
(HFE) content. In this work, we study the influence of tense, generated sound. However, the vocal tract shape is not the only
modal and lax phonation on the synthesis of vowel [A] through factor affecting the HFE. The type of phonation can also modify
3D finite element modelling (FEM). The three phonation types it, as shown for instance in [11] for sustained vowels with loud
are reproduced with an LF (Liljencrants-Fant) model controlled and soft phonation.
by the Rd glottal shape parameter. The onset of the higher In this work we study the effect of tense, modal and lax
order modes essentially depends on the vocal tract geometry. phonation on the synthesis of vowel [A], paying special atten-
Two of them are considered, a realistic vocal tract obtained tion to the HFE content. These three phonation types are repro-
from MRI and a simplified straight duct with varying circular duced using an LF (Liljencrants-Fant) model [12]. Although
cross-sections. Long-term average spectra are computed from this model cannot consider the interaction between the vocal
the FEM synthesised [A] vowels, extracting the overall sound tract and the vocal folds [13, 14], it has proved to be useful to
pressure level and the HFE level in the 8 kHz octave band. Re- explore the phonatory tense-lax continuum [15] by controlling
sults indicate that higher order modes may be perceptually rel- the Rd glottal shape parameter [16]. Regarding the vocal tract,
evant for the tense and modal voice qualities, but not for the lax we consider an MRI-based realistic geometry, and its simplified
phonation. counterpart considering circular cross-sections in a straightened
Index Terms: voice production, higher order modes, high fre- midline [6]. This allows us to somewhat "activate" and "deac-
quency energy, glottal source modelling, LF model, numerical tivate" the higher order modes. Different versions of vowel [A]
simulation, finite element method are generated by convolving the LF glottal source signals with
the vocal tract impulses responses obtained using a 3D acous-
tic model based on the Finite Element Method (FEM) [17]. In
1. Introduction order to analyse the relevance of the higher order propagation
For many years, works on articulatory speech synthesis have modes for the lax, modal and tense phonation, the long-term
considered a simplified one-dimensional (1D) representation of average spectra (LTAS) and the HFE levels of the synthesised
the vocal tract. This is built from the so-called vocal tract vowels are computed and compared.
area functions, which describe the cross-sectional area varia- The paper is structured as follows. The methodology used
tions along the vocal tract center midline (see e.g., [1]). Voice to study the production of vowel [A] with the three phonation
is then synthesised by simulating the propagation of acous- types and the two vocal tract geometries is explained in Sec-
tic waves within this 1D representation of the vocal tract (see tion 2. Next, the obtained results are discussed in Section 3.
e.g., [2, 3, 4]). However, 1D approaches assume plane wave Finally, conclusions and future work are presented in Section 4.
propagation, so they can only correctly approximate the acous-
tics of the vocal tract in the frequency range below 4-5 kHz.
Beyond this limit, not only planar modes get excited but also
2. Methodology
higher order propagation modes appear, which strongly change Figure 1 depicts the process followed to synthesise six ver-
the high frequency energy (HFE) content of the spectrum [5, 6] sions of vowel [A]. These were obtained by convolving three
compared to that from a 1D model. Although the high fre- glottal source signals with the FEM impulse responses of two
quency range has not received much attention in the literature, vocal tract geometries that produce this vowel. In particular,
some recent studies point out that the HFE may be relevant for and as mentioned before, we used the realistic vocal tract and
voice quality, speech localisation, speaker recognition and in- the simplified straightened simplification with circular cross-
telligibility (see [7] and references therein). sections from [6] (see Section 2.1), and computed their impulse
On the other hand, three-dimensional (3D) models do not responses h(t) using the FEM (see Section 2.2). The glottal
need to assume plane wave propagation, since they can di- source signals ug (t) were generated by means of an Rd con-
rectly deal with 3D vocal tract geometries to emulate the com- trolled LF model. The values Rd = 0.3, 1 and 2.7 were selected
plex acoustic field generated during voice production [8, 9, 10]. from the Rd range [0.3, 2.7] (see [16]) to reproduce a tense, a
However, higher order modes do not always appear even if a 3D modal, and a lax phonation, respectively (see Section 2.3).
132 10.21437/IberSPEECH.2018-28
120
110
100
H ( f ) [dB]
90
80
realistic
simplified
70
60
50
0 2 4 6 8 10 12
Frequency [kHz]
Figure 2: Vocal tract transfer function H(f ) for the vowel [A]
with the realistic and simplified vocal tract geometries.
then linearly interpolated. The two configurations are hereafter

referred as the realistic and the simplified vocal tracts.
Figure 1: Synthesis of vowel [A] with a realistic vocal tract ge- 2.2. Vocal Tract impulse response
ometry (above) and its simplified counterpart of circular cross-
sections in a straightened midline (below). Three phonation The impulse response of each vocal tract geometry was com-
types are considered to reproduce a tense (dashed red line), a puted using a custom finite element code that numerically
modal (solid black line) and a lax (dotted green line) voice pro- solves the acoustic wave equation,
duction. The output pressure signal p(t) is computed as the 2
convolution of the glottal source ug (t) with the vocal tract im- ∂tt p − c20 ∇2 p = 0, (1)
pulse response h(t) obtained from 3D FEM simulations.
combined with a Perfectly Matched Layer (PML) to account
for free-field propagation [17]. In Eq. (1) p(x, t) is the acoustic
2
pressure, ∂tt stands for the second order time derivative, and c0
For each vowel, the LTAS was computed as the Welch’s is the speed of sound which is set to the usual value of 350 m/s.
power spectral density estimate, with a 15 ms hamming win- A Gaussian pulse was introduced on the glottal cross-sectional
dow, 50% overlap and a 2048-point FFT. The overall energy area as an input volume velocity ug (t). This pulse is of the type
levels and the HFE levels in the 8 kHz octave band were also
extracted as in [11]. The 16 kHz octave band was not consid- 2
ug (t) = e−[(t−Tgp )/0.29Tgp ] [m3 /s], (2)
ered, since HFE changes in this frequency range were found
almost perceptually irrelevant in [11]. with Tgp = 0.646/fc and fc = 10 kHz. Wall losses were
considered by imposing a boundary admittance coefficient of
2.1. Vocal Tract Geometries µ = 0.005 on the vocal tract walls. A 20 ms simulation
was then performed capturing the acoustic pressure p0 (t) at a
Two vocal tract geometry simplifications of vowel [A] have
node located outside of the vocal tract, 4 cm away from the
been employed in this work, namely, the realistic configuration
mouth aperture center. The sampling frequency was set to
and the simplified straight vocal tract with circular shape (see
fs = 8000 kHz, which ensures a restrictive stability condition
Fig. 1). These geometries were obtained in [6] by simplifying
of the Courant-Friedrich-Levy type required by explicit numer-
the MRI-based vocal tract geometries in [18]. In a nutshell,
ical schemes (see [17] for details on the numerical scheme).
the procedure consisted in the following. First, the subglot-
A vocal tract transfer function H(f ) was computed from
tal tube, the face and the lips were removed from the original
each simulation to compensate for the slight energy decay in
geometry (see [10] for a detailed analysis of the lips influence
frequency of the Gaussian pulse. This is defined as
on simulations). Moreover, side branches such as the piriform
fossae and valleculae were occluded (see e.g. [9, 19] for their Po (f )
acoustic effects). Cross-sections were next extracted as typi- H(f ) = , (3)
Ug (f )
cally done to generate 1D area functions, but preserving their
shapes and locations in the vocal tract midline. The realistic with Po (f ) and Ug (f ) being the Fourier Transform of po (t) and
configuration was generated by linearly interpolating the result- ug (t), respectively. H(f ) was computed up to 12 kHz, to al-
ing cross-sections. As shown in [6], this simplification provides low the calculation of HFE level in the 8 kHz octave band [11].
very similar results to the original MRI-based vocal tract geom- The vocal tract transfer functions H(f ) for the realistic and the
etry without branches. simplified geometries of vowel [A] are shown in Fig. 2 (also
In the simplified straight vocal tract configuration, the reported in [6], but only up to 10 kHz). As can be observed,
cross-sectional shapes were modified to be that of a circle, pre- planar modes are mainly produced below 5 kHz giving place to
serving the same area. These circular cross-sections were lo- the first vowel formants. Beyond this value, higher order modes
cated in a straightened version of the vocal tract midline and can also propagate, resulting in the more complex spectrum of
133
Tense (R d =0.3)
Modal (R d =1)
Lax (R d =2.7)
0
0 5 10 15 20 25 30 35 40
Time [ms]
(a) Waveform
Figure 3: Glottal flow ug (t) and its time derivative u0g (t) ac- 0
cording to the LF model [12]. Tense (R d =0.3)
-20 Modal (R d =1)
Lax (R d =2.7)
-40
the realistic geometry. Note, however, that these modes do not
LTAS [dB]
appear in the spectrum of the simplified configuration. The ra- -60
dial symmetry of this geometry prevents their onset [5, 6].
-80
Finally, the inverse Discrete Fourier Transform was applied
to the vocal tract transfer functions H(f ) to obtain the vocal -100
tract impulse responses h(t) of the two geometries (see Fig. 1).
-120
2.3. Voice Source Signal
-140
An LF model [12] was used to produce the voice source sig- 0 2 4 6 8 10 12
nal. This model approximates the glottal flow ug (t) and its time Frequency [kHz]
derivative u0g (t) in terms of four parameters (Tp , Te , Ta , Ee ) (b) Long-term average spectrum
that describe its time-domain properties (see Fig. 3). The con-
trol of this model can be simplified with the single glottal shape Figure 4: Glottal source for a tense (Rd = 0.3), a modal
parameter Rd [16]. This is defined as (Rd = 1), and a lax (Rd = 2.7) phonation.
Td 1 U0 F0
Rd = = , (4)
T0 110 Ee 110 from the 3D FEM simulations of the realistic and simplified vo-
cal tract geometries. The six synthesised vowels are normalised
where Td is the declination time, T0 the period, and F0 the fun- with the same scaling factor to obtain reasonable sound pressure
damental frequency. The declination time Td corresponds to levels. This factor has been selected so as to produce 70 dBSPL
the quotient between the glottal flow peak U0 and the negative in the realistic geometry with a modal phonation (Rd = 1). The
amplitude of the differentiated glottal flow Ee . LTAS have then been computed for each audio.
In this work, we used the Kawahara’s implementation of
Figure 5 shows the obtained LTAS for the six generated
the LF model [20], which generates a free-aliasing excitation
vowels. As also appreciated in the vocal tract transfer functions
source signal. We adapted this model to our purposes, modify-
(see Fig. 2), small differences between geometries are produced
ing the sampling frequency from its original value of 44100 Hz
for frequencies below 5 kHz, whereas beyond this range higher
to 24 kHz. Moreover, we introduced the Rd glottal shape pa-
order modes propagate in the realistic case, thus inducing larger
rameter. This allows one to easily control the voice source with
deviations. This behaviour can be observed for the three phona-
a single parameter, which runs from Rd = 0.3 for a very ad-
tion types. Essentially the glottal source modifies the overall
ducted phonation, to Rd = 2.7 for a very abducted phonation
energy level and also introduces an energy decay in frequency
(see [16]). From the Rd range [0.3, 2.7] two extreme values
(compare Fig. 2 with Fig. 5). This decay, known as the spec-
plus a middle one were chosen. We used Rd = 0.3 to gen-
tral tilt, strongly depends on the phonation type. The laxer
erate a tense phonation, Rd = 2.7 for a lax production, and
the phonation the larger the spectral tilt [16]. Furthermore, the
Rd = 1 for a normal (modal) voice quality. With regard to F0,
voice source also affects the energy balance of the first harmon-
a pitch curve was obtained from a real sustained vowel lasting
ics (below ∼ 500 Hz). For instance, the lax phonation has the
4.4 seconds. This pitch contour was placed around 120 Hz to
lowest overall energy values among all phonation types. How-
generate all the source signals. Figure 4a shows four periods
ever, one can see that the first harmonic (close to 120 Hz) has
of the three simulated voice source waveforms. Moreover, the
larger amplitude levels than the rest of the spectrum, in contrast
LTAS of the glottal source signals are represented in Fig. 4b.
to what occurs for the other phonations.
As observed, the phonation type obviously changes the glottal
pulse shape, thus modifying the spectral energy distribution of HFE levels have been computed by integrating the power
the source signal. spectral density in the 8 kHz octave band, as in [11]. In ad-
dition, the overall energy levels have been calculated follow-
ing the same procedure but for the whole examined frequency
3. Results range.
Six versions of vowel [A] (see Fig. 1) have been generated using The obtained results are listed in Table 1. Note first that in
the three glottal source signals corresponding to a tense, a modal the realistic case with a modal phonation (Rd = 1) the over-
and a lax phonation, and the two impulse responses obtained all level is 70 dBSPL . Remember that this value was fixed to
134
80
vocal tract
Tense (R d =0.3) realistic
60 simplified
Modal (R d =1)
40
LTAS [dB]
20
0 Lax (R d =2.7)
-20
-40
1 2 3 4 5 6 7 8 9 10 11 12
Frequency [kHz]
Figure 5: Long-term average spectra (LTAS) of the FEM synthesised vowel [A] using the realistic and simplified vocal tract geometries
with a tense (Rd = 0.3), a modal (Rd = 1), and a lax (Rd = 2.7) phonation.
Table 1: Overall and High-Frequency Energy (HFE) levels 4. Conclusions

(in dB) obtained in the realistic and simplified vocal tract con-
figurations of vowel [A] with a tense (Rd = 0.3), a modal In this work we have studied the influence of tense, modal and
(Rd = 1), and a lax (Rd = 2.7) phonation. ∆ denotes the lax phonation on the 3D finite element synthesis of vowel [A],
difference between the two vocal tract geometries. considering a realistic and a simplified vocal tract geometry.
The 3D simulations behave very similarly for both geometries
Rd Geometry Overall ∆Overall HFE ∆HFE below 5 kHz, but significant differences appear beyond this fre-
realistic 82.2 41.4 quency because of the rising of higher order propagation modes.
0.3 1.2 5.8 It is worth mentioning that these modes only appear when using
simplified 83.4 47.3
realistic 70.0 14.9 the realistic vocal tract. They induce a reduction of the HFE lev-
1 1.3 5.9 els at the 8 kHz octave band from 5.6 to 5.9 dB, depending on
simplified 71.3 20.8
the phonation type. These differences may be perceptually rele-
realistic 63.5 1.0
2.7 1.4 5.6 vant, according to previous works in the literature. Specifically,
simplified 64.9 6.6
a realistic 3D vocal tract geometry would be required for an
accurate synthesis of vowel [A] through 3D FEM, when trying
to simulate a modal and a tense voice production. Conversely,
when a lax phonation is considered, the influence of higher or-
der propagation may be imperceptible, since the HFE levels are
compute the scaling factor used to normalise the audio files. very small. Therefore, a simpler 1D simulation would suffice in
The overall level variations for the other configurations will thus this case.
correspond to modifications either introduced by the vocal tract Future work will consider other Rd values and geometry
geometry or by the glottal source. As expected, the larger the simplifications as well as other vowels to complete the study.
Rd value (laxer phonation) the smaller the overall levels. Finally, we also plan to include aspiration noise in the LF model
to evaluate its impact on the HFE content of the numerical sim-
Far more interesting is to compare the results between ge- ulations.
ometries. The HFE levels decay between 5.6 dB and 5.9 dB
for the realistic vocal tract depending on the phonation type, 5. Acknowledgements
which only manifests as an overall level difference of 1.2 dB
The authors are grateful to Saeed Dabbaghchian for the de-
and 1.4 dB. The higher order modes tend to reduce the levels in
sign of the vocal tract geometry simplifications. This re-
the HFE content. According to [11], minimum difference limen
search has been supported by the Agencia Estatal de Investi-
scores of about 1 dB are given for normal-hearing listeners in
gación (AEI) and FEDER, EU, through project GENIOVOX
the 8 kHz octave band, so one may hypothesise that the higher
TEC2016-81107-P. The fourth author acknowledges the sup-
order modes may be perceptually relevant. However, depend-
port from the Obra Social “La Caixa" under grant ref. 2018-
ing on the phonation type the HFE could be too small to notice
URL-IR1rQ-021.
any difference. This seems to be the case of the lax phonation
(Rd = 2.7), which gives HFE levels of 1.0 dB and 6.6 dB, de-
pending on the geometry. We may then conjecture, that for this 6. References
phonation type no differences in the outputs from the two ge- [1] B. H. Story, I. R. Titze, and E. A. Hoffman, “Vocal tract area
ometries will be perceived. In other words, we would not notice functions from magnetic resonance imaging,” The Journal of the
the influence of higher order modes. Acoustical Society of America, vol. 100, no. 1, pp. 537–554, 1996.
135
[2] B. H. Story, “Phrase-level speech simulation with an airway mod- [19] H. Takemoto, S. Adachi, P. Mokhtari, and T. Kitamura, “Acoustic
ulation model of speech production,” Comput. Speech Lang., interaction between the right and left piriform fossae in generating
vol. 27, no. 4, pp. 989–1010, 2013. spectral dips,” The Journal of the Acoustical Society of America,
[3] P. Birkholz, “Modeling consonant-vowel coarticulation for artic- vol. 134, no. 4, pp. 2955–2964, 2013.
ulatory speech synthesis,” PLoS ONE, vol. 8, no. 4, p. e60603, [20] H. Kawahara, K.-I. Sakakibara, H. Banno, M. Morise, T. Toda,
2013. and T. Irino, “A new cosine series antialiasing function and its
[4] S. Stone, M. Marxen, and P. Birkholz, “Construction and eval- application to aliasing-free glottal source models for speech and
uation of a parametric one-dimensional vocal tract model,” singing synthesis,” in INTERSPEECH, 2017, pp. 1358–1362.
IEEE/ACM Transactions on Audio Speech and Language Process-
ing, vol. 26, no. 8, pp. 1381–1392, 2018.
[5] R. Blandin, M. Arnela, R. Laboissière, X. Pelorson, O. Guasch,
A. V. Hirtum, and X. Laval, “Effects of higher order propagation
modes in vocal tract like geometries,” The Journal of the Acousti-
cal Society of America, vol. 137, no. 2, pp. 832–8, 2015.
[6] M. Arnela, S. Dabbaghchian, R. Blandin, O. Guasch, O. Engwall,
A. Van Hirtum, and X. Pelorson, “Influence of vocal tract geome-
try simplifications on the numerical simulation of vowel sounds,”
The Journal of the Acoustical Society of America, vol. 140, no. 3,
pp. 1707–1718, 2016.
[7] B. B. Monson, A. J. Lotto, and B. H. Story, “Gender and vocal
production mode discrimination using the high frequencies for
speech and singing,” Frontiers in Psychology, vol. 5, p. 1239,
2014.
[8] T. Vampola, J. Horáček, and J. G. Švec, “FE modeling of human
vocal tract acoustics. Part I: Production of Czech vowels,” Acta
Acust. united with Acustica, vol. 94, no. 5, pp. 433–447, 2008.
[9] H. Takemoto, P. Mokhtari, and T. Kitamura, “Acoustic analysis of
the vocal tract during vowel production by finite-difference time-
domain method,” The Journal of the Acoustical Society of Amer-
ica, vol. 128, no. 6, pp. 3724–3738, 2010.
[10] M. Arnela, R. Blandin, S. Dabbaghchian, O. Guasch, F. Alías,
X. Pelorson, A. Van Hirtum, and O. Engwall, “Influence of lips on
the production of vowels based on finite element simulations and
experiments,” The Journal of the Acoustical Society of America,
vol. 139, no. 5, pp. 2852–2859, 2016.
[11] B. B. Monson, A. J. Lotto, and S. Ternström, “Detection of
high-frequency energy changes in sustained vowels produced by
singers,” The Journal of the Acoustical Society of America, vol.
129, no. 4, pp. 2263–2268, 2011.
[12] G. Fant, J. Liljencrants, and Q. Lin, “A four-parameter model
of glottal flow,” Speech Transmission Laboratory Quarterly
Progress and Status Report (STL-QPSR), vol. 26, no. 4, pp. 1–
13, 1985.
[13] T. Murtola, P. Alku, J. Malinen, and A. Geneid, “Parameteriza-
tion of a computational physical model for glottal flow using in-
verse filtering and high-speed videoendoscopy,” Speech Commu-
nication, vol. 96, pp. 67–80, 2018.
[14] B. D. Erath, M. Zañartu, K. C. Stewart, M. W. Plesniak, D. E.
Sommer, and S. D. Peterson, “A review of lumped-element mod-
els of voiced speech,” Speech Communication, pp. 667–690,
2013.
[15] A. Murphy, I. Yanushevskaya, A. N. Chasaide, and C. Gobl,
“Rd as a Control Parameter to Explore Affective Correlates of
the Tense-Lax Continuum,” in INTERSPEECH, 2017, pp. 3916–
3920.
[16] G. Fant, “The LF-model revisited. Transformations and frequency
domain analysis,” Speech Transmission Laboratory Quarterly
Progress and Status Report (STL-QPSR), vol. 36, no. 2-3, pp.
119–156, 1995.
[17] M. Arnela and O. Guasch, “Finite element computation of ellip-
tical vocal tract impedances using the two-microphone transfer
function method,” The Journal of the Acoustical Society of Amer-
ica, vol. 133, no. 6, pp. 4197–4209, 2013.
[18] D. Aalto, O. Aaltonen, R.-P. Happonen, P. Jääsaari, A. Kivelä,
J. Kuortti, J.-M. Luukinen, J. Malinen, T. Murtola, R. Parkkola,
J. Saunavaara, T. Soukka, and M. Vainio, “Large scale data ac-
quisition of simultaneous MRI and speech,” Applied Acoustics,
vol. 83, pp. 64–75, 2014.
136
IberSPEECH 2018
Exploring Advances in Real-time MRI for studies of European Portuguese

Conceição Cunha1 , Samuel Silva2 , António Teixeira2 , Catarina Oliveira3 , Paula Martins4 ,
Arun Joseph5 , Jens Frahm5
IPS, Ludwig-Maximilians-Universität, Germany

1
DETI/IEETA, Universidade de Aveiro, Portugal

2
3
ESSUA/IEETA, Universidade de Aveiro, Portugal
4
ESSUA/IBIMED/IEETA, Universidade de Aveiro, Portugal
5
Max Plank Institute for Biophysical Chemistry, Germany
cunha@phonetik.uni-muenchen.de, sss@ua.pt, ajst@ua.pt, coliveira@ua.pt, pmartins@ua.pt,
arun-antony.joseph@mpibpc.mpg.de, jfrahm@gwdg.de
Abstract this application was already successful in providing for the first
time information about the entire tongue contour instead of
Recent advances in real-time magnetic resonance imaging some points, as in EMA and it updated significantly the infor-
(RT-MRI) for speech studies, providing a considerable increase mation about the velum. The increase of temporal resolution,
in time resolution, potentially improve our ability to study the updates the quality of the analysis and allows the description
dynamic aspects of speech production. To take advantage of the of a wider set of sounds, including faster sounds as thrills and
sheer amount of the resulting data, automated methods can be more complex sounds as diphthongs and nasal vowels. The
used to select, process, and analyze the data, and previous work significant increase of the number of participants is crucial to
could tackle these challenges for an European Portuguese (EP) take apart individual from languages specific characteristics.
corpus acquired at 14 frames per second (fps). The state-of-the-art as summarized in [2] includes currently (1)
Aiming to further explore RT-MRI in the study of the dy- frame rates that already surpass 100 Hz; (2) databases for sev-
namic characteristics of EP sounds, e.g., nasal vowels and diph- eral languages were recorded at high frame rates (English [4],
thongs, we present a novel 50 fps RT-MRI corpus and assess German [5], French [5]); (3) use of a wide variety of RT-MRI
the applicability, in this new context, of our previous propos- analysis techniques, that can be classified in four classes [2]:
als for processing and analyzing these data to extract relevant basis decomposition or matrix factorization techniques at the
articulatory information. Importantly, at this stage, we were level of the raw or processed images, pixel or region-of-interest
interested in assessing if and to what extent the new data and (ROI) based, grid-based, and contour-based; (4) initial studies
the proposed methods are able to support and corroborate the regarding the dynamics of articulators and gestural timing rela-
articulatory analysis obtained from the previous corpus. Over- tionships.
all, and although this new corpus poses novel challenges, it was
Despite all these evolutions, particularly in extraction of in-
possible to process and analyze the 50 fps data. A comparison
formation from the images (e.g., [5, 6]), the number of studies
of automated analysis performed for the same sounds, for both
going beyond the extraction of contour or articulators is very
corpora (i.e., 14 fps and 50 fps), yields similar results, corrobo-
scarce. A representative example of working in the facilitation
rating previous results and demonstrating the envisaged replica-
of high level analysis is [7]. As frame rates rise, more work is
bility. Moreover, we updated the processing in order to be able
needed in these frameworks for quantitative systematic analy-
to analyze dynamic information and provide first insights on the
sis, that will be essential to make possible exploring the highly
temporal organization of complex sounds such as nasal vowels
increased amount of images.
and diphthongs.
Index Terms: speech production, dynamic, real-time magnetic Several studies used MRI for studying EP, as summarized
resonance, European Portuguese, processing and analysis. in Section 2. Compared to the state-of-the-art [2], these stud-
ies used a low frame rate, possibly providing not enough in-
formation to the adequate characterization of the investigated
1. Introduction sounds. Further limitations are: (1) the scarce amount of pub-
A major challenge in modern linguistics is to understand how lications and missing analysis of some sounds, namely rhotics
continuous and dynamic speech movements are related to per- and diphthongs; (2) preliminary character of the approaches to
ceptual categories like consonants and vowels. Physiological other sounds such as nasal vowels, due to the limited frame rate
studies have shown that phonological segments can be defined used and the amount of participants.
based on the synchronization of primary articulators in speech The main objectives of this paper are the following: (1)
(the lips, tongue, soft-palate, jaw, and vocal folds) with each adding Portuguese to the small set of languages studied using
other in time [1] using methods such as Eletromagnetic Articu- the last evolutions in RT-MRI; (2) contribute to a better under-
lography (EMA) and Magnetic Resonance Imaging (MRI). standing of the dynamical aspects in the production of speech
Many advances in real-time magnetic resonance imaging sounds; (3) assess previous results obtained with lower tempo-
(RT-MRI) resolutions (spatial and temporal) have been driven ral resolution; (4) profit from the improved temporal resolution
by the need to investigate phonetic and phonological phenom- to start covering sounds characterized by their dynamic nature
ena [2], such as vowel nasalization in Portuguese [3]. which could not be contemplated in previous studies, such as
In the beginning of RT-MRI application to the study of the diphthongs.
speech production the temporal resolution was quite low, but Even if essential for these objectives, the acquisition of
137 10.21437/IberSPEECH.2018-29
the novel data for new classes of sounds and more complex gestures in Portuguese [22, 21].
phonetic contexts with higher temporal resolution poses sev- Despite all these relevant contributions, the sounds of Por-
eral challenges on how to process and analyze the resulting tuguese are not yet well described: diphthongs and trills still
database. Our team has previously addressed this challenge, missing, nasal vowels need an improved characterization due to
for different data [8, 9] by proposing methods to process [6] the lower frame rate and the small amount of participants.
and analyze [7, 10] the image sequences to extract articulatory From the perspective of processing and quantitative infor-
data. However, the nature of the novel database raises several mation extraction a lot has been achieved recently for EP ex-
new questions, opportunities and challenges, including the ap- ploring, essentially, automatic techniques to deal with the quan-
plicability of previous analyses. tity and diversity of the information obtained, both from 3D
The remainder of this article is organized as follows: Sec- static [18] and real-time MRI [20, 6, 7, 10], However, the
tion 2 provides a short summary of related work in speech pro- problem of information extraction for speech production stud-
duction studies using RT-MRI and other techniques, focusing ies and development of relevant articulation models is far from
in studies addressing EP; Section 3 presents information resolved. The consideration of these imaging technologies poses
garding a novel RT-MRI database acquired for EP, from corpus challenges to extract articulatory-relevant information profiting
description to the methods considered for acquisition, process- from the full range of available data.
ing and analysis of the image sequences; in Section 4, illustra-
tive results are presented, including novel information regarding
diphthongs production; finally, the article concludes with a brief 3. Methods
summary of the contributions and ideas for future work. Gathering novel insights on the dynamic nature of complex
sounds, for EP, taking advantage of recent advances in real-time
2. Related Work MRI, involves several steps from corpus definition to articula-
tory analysis, as described in what follows.
Nowadays, speech production studies can be supported by
a wide range of technologies including imaging modalities
3.1. Corpus
(e.g., Ultrasound, MRI) and other instrumental techniques (e.g.,
EMA [11, 12]). In this context, real-time magnetic resonance The corpus consists of minimal pairs containing all stressed oral
imaging has been received particular relevance due to its non- [i, e, E, a, O, o, u] and nasal vowels [5̃, ẽ, ĩ, õ, ũ] in one and two
invasiveness, non-use of ionizing radiation, and the remarkable syllable words. Nasal diphthongs /5̃w, ẽj, 5̃j/ and the oral coun-
improvements in spatio-temporal resolution achieved in the last terparts /aw, aj, oj/ as well as /ej, ew, iw, ow, uj/ in mono-
few years [13, 2]. syllabic words were also included. Additional materials were
From the first attempts to acquire real-time imaging with recorded for further modeling of variability in the production of
MRI to date, much has been evolved in the area, achieving sam- nasality.
pling rates close to those obtained with EMA and Ultrasonogra- All words were randomized and repeated in two prosodic
phy. This has been possible by exploiting several technological conditions embedded in one of three carrier sentences alternat-
advancements that involve the use of high field strengths, more ing the verb as follows (Diga ’Say’—ouvi ’I heard’— leio ’i
powerful gradients, dedicated coils, non-Cartesian K-space tra- read’) as in ‘Diga pote, diga pote baixinho’ (’Say pot, Say pot
jectories, high degree of undersampled data and more efficient gently’). So far, this corpus has been recorded from twelve na-
image reconstruction algorithms allowing for high sampling tive speakers (8m, 4f) of EP. The tokens were presented from a
rates and improved image quality [5, 14]. timed slide presentation with blocks of 13 stimuli each. The sin-
For European Portuguese (EP), several of these techniques gle stimulus could be seen for 3 seconds and there was a pause
(e.g., EMA and MRI) have been used. These efforts include of about 60 seconds after each block of 13. The first three par-
not only data acquisition but also data analysis. Early stud- ticipants read 7 blocks in a total of 91 stimuli and the remaining
ies addressed dynamic aspects of nasal vowel production using nine participants had 9 blocks of 13 stimuli (total of 117 tokens).
EMA [15, 16, 17] and MRI (e.g. [9]
Internationally, several research groups have been using 3.2. RT-MRI Acquisition
MRI to gather information for different languages using differ-
ent approaches. Comprehensive reviews of these studies are RT-MRI recordings, as exemplified in Fig. 1, were conducted at
summarized in [13, 2, 7]. the Max Planck Institute for biophysical Chemistry, Göttingen,
The first MRI study for EP included 2D and 3D data re- Germany, using a 3 Tesla Siemens Magnetom Prisma Fit
garding static configuration of all EP vowels and consonants MRI System equipped with high performance gradients (Max
for one speaker [9]. A deeper study on EP laterals (3D) was ampl=80 mT/m; slew rate = 200 T/m/s).
conducted later with data from 7 participants [8, 18]. A study A standard 64–channel head coil was used with a mirror
of co-articulation resistance in EP was presented in [19]. First mounted on top of the coil. The speaker was lying down,
results of RT-MRI for EP were presented in 2012 [3]. At this in a comfortable position, and was instructed to read the re-
stage, a RT-MRI dataset which included nasal vowels, laterals, quired sentences. Real-time MRI measurements were based on
taps and trills was acquired with a frame rate of 14 fps. This a recently developed method, where highly under-sampled ra-
represented an important first step towards a better characteriza- dial FLASH acquisitions are combined with nonlinear inverse
tion of the dynamic aspects involved in the production of these reconstruction (NLINV) providing images at high spatial and
sounds of Portuguese [3]. The configuration of the vocal tract temporal resolutions [23]. Acquisitions were made at 50 fps,
during the production of nasal vs. oral vowels was investigated resulting in images as the ones presented in Fig. 1. Speech
using RT-MRI in [20]. More recently, RT-MRI was used for was synchronously recorded using an optical microphone (Dual
studying the temporal coordination of oral articulators and the Channel-FOMRI, Optoacoustics, Or Yehuda, Israel), fixed on
velum during the production of nasal vowels [21], providing ad- the head coil, with the protective pop-screen placed directly
ditional support for the delayed coordination of oral and nasal against the speaker’s mouth.
138
d i g 5 u
Figure 1: Selected frames for multiple sounds occurring as in ”Diga vu” (Say ’vu’).
Before their enrollment on the study, all volunteers pro-

vided informed written consent and filled an MRI screening
form in agreement with institutional rules. None of the par-
ticipants had any known language, speech or hearing problems
and were compensated for their participation.
3.3. Speech annotation

The speech recordings were preprocessed to filter MRI acqui-
sition noise and the target segments manually delimited by the
first author using Praat [24, 25]. The produced annotation in-
formation was used in Matlab for extraction of relevant images 14 fps corpus 50 fps corpus
and synchronized speech signal.
Figure 2: Results obtained for the comparison among vowels
3.4. Processing and Analysis [a] and [i] for two different RT-MRI corpora: a previous 14 fps
acquisition and the current 50 fps corpus.
In line with the semi-automated methods described in Silva et
al. [7], twenty eight images were manually annotated to train
two active appearance models [26] for oral and nasal configu-
rations of the vocal tract. For the work presented in this arti-
cle, the RT-MRI sequences for one of the speakers (male, 39
yo) were processed to extract sequences of vocal tract contours.
Figure 1 illustrates the overall outcome of the processing stage
by presenting a few selected image frames from the sentence
’Diga vu’ (say vu) depicting the identified vocal tract contours,
with the different articulators shown in different colors and line
types, and the corresponding audio annotation.
The static and dynamic analysis and comparison among vo-
cal tract configurations was performed by adopting the frame-
work proposed in Silva et al. [10] providing objective normal-
ized analysis and visualization. Accordingly, vocal tract con-
figurations are compared for seven different regions: velum Figure 3: Changes to the different articulators along the pro-
(VEL), tongue dorsum (TD), tongue back (TB), tongue tip (TT), duction of [ẽ].
lip protrusion (LP), lip aperture (LA) and pharynx (Ph). For
each, the comparison yields a score, from 1 (no difference) to 0
(strong difference), which is represented over the unitary circle from the new 50 fps corpus. The visual representation, in Fig. 2,
(Fig. 2) or graph (Fig. 3) for static and dynamic analysis, respec- should be interpreted as the answer to the question: ”What hap-
tively. For the sake of brevity, the reader is forwarded to [10] pens when I move from [a] to [i]?”. As can be observed, the dif-
for additional details regarding the adopted analysis framework. ference polygon, inscribed in the unitary circle, has grossly the
same shape, for both corpora, and depicts the same notable dif-
4. Results ferences, namely (dots in yellow and red circular coronas): the
tongue dorsum moving up, and an advance of the tongue tip and
Considering our initial goals, to start exploring the new corpus tongue back. Overall, the similarity between the diagrams (re-
and assess the applicability of the previously proposed process- sults) obtained for both corpora corroborates the reproducibility
ing and analysis methods to this new context, we processed the of the analysis methods among datasets.
data for one of the male speakers. Our aim was to explore two For the dynamic analysis of nasal vowels, Fig. 3 shows, as
important aspects: confirm overall previous findings, as a proof an example, the evolution of the vocal tract configuration along
of replicability, particularly for oral and nasal vowels, and ob- the production of [ẽ]: it is essentially characterized by velum
tain a first insight over EP diphthongs. movement (noticeable by a trajectory entering in the yellow
As an example, Fig. 2 shows the comparison between [a] zone of the graph at around 30%). Of note is, also, the vari-
and [i] performed for the previous 14 fps corpus and using data ation for the tongue dorsum and tongue tip, observed for the
139
end of the vowel and a possible consequence of coarticulation
effects. For instance, the influence of the carrier sentence’s sec-
ond ’diga’ (say), following the token, as in ’Diga vem, diga . . . ’
(Say come, say . . . ).
4.1. EP Diphthongs
Given the exploratory nature of this first effort regarding diph-
thongs, to have a first grasp of what is happening, we started by
oral diphthongs. As an example, Fig. 4 shows results for [aw]
Figure 6: Comparison, over time, of ’mão’ ([m5̃w̃], ’hand’) with

’pão’ ([p5̃w̃], ’bread’).
differently, in the initial part, since it is open, from the begin-

ning, in ’mão’.
5. Conclusion
The major contribution of this paper is the presentation of a
novel RT-MRI database for EP recorded with a frame rate of
Figure 4: Variation, over time, of the articulators during the 50 Hz, contributing to augment the very reduced amount of
production of the EP oral diphthong [aw], as in ’pau’ (stick). languages with such a valuable resource for speech production
studies. This database, after its completion and pre-processing
will be partially made available for other researchers. Addition-
as in ’pau’ (stick). The diphthong is covered by 10 images (10
ally, we demonstrate the applicability of previously proposed
points in the graph), including the initial [p] and the diphthong
methods for segmentation and analysis, by illustrating previous
(around 200 ms). Note that these representations only present
findings for oral and nasal vowels and performing a first explo-
lines, in the graph, for those regions where, along production,
ration of EP diphthongs. The application of this methodology
changes fall into the yellow and red stripes. It is clear an abrupt
to more data and speakers will enable a detailed description of
change in lip aperture (LA) at 20-30 % of diphthong duration,
nasal sounds in European Portuguese and a better understanding
from [p] to [a], and a reduction, close to the end, for the [w].
of their implementation in production.
The nasal counterpart of this diphthong, [5̃w̃], as in ’pão’ The work presented here can still profit from several im-
(bread), is analyzed in Fig. 5, showing a gradual variation of lip provements and provides the grounds for exploring new routes
aperture (LA), similar to the one observed in paw. Additionally, of speech production studies in EP. Even though the image qual-
there is movement of the tongue back, as opposed to the oral ity is better than our previous 14 fps corpus, the different na-
counterpart, probably due to the need of adjustment in the nasal ture of the corpus, with a large number of dental, alveolar and
passage, which is hinted by the changes also noted at the velum. palatal contacts (in the support words and sentences, e.g., [t]
before [5̃w̃] as in ’sotão’ – attic) poses new challenges to vocal
tract segmentation, with a few segmentations still requiring a
final manual revision, an aspect to improve as new speakers are
included, by training better models [6].
Finally, now that the grounds for work have been estab-
lished, future developments should be propelled by address-
ing concrete hypotheses regarding EP nasals, such as the one
of delayed coordination of oral and nasal gestures in Por-
tuguese [27, 21, 22].
6. Acknowledgements
This work is partially funded by the project ’Synchrone
Variabilität und Lautwandel im Europäischen Portugiesisch’,
with funds from the German Federal Ministry of Education
Figure 5: Variation, over time, of the articulators during the and Research, by IEETA Research Unit funding (UID/CEC-
production of the EP oral diphthong [5̃w̃], as in ’pão’ (bread). /00127/2013), by Portugal 2020 under the Competitiveness
and Internationalization Operational Program, and the Eu-
ropean Regional Development Fund through project SOCA
Our corpus and methods also enable investigating the com-
– Smart Open Campus (CENTRO-01-0145-FEDER-000010)
plex context of a diphthong after a nasal consonant, and, for
and project MEMNON (POCI-01-0145-FEDER-028976). We
instance, in comparison with other diphthongs. In Fig. 6, we
thank Philip Hoole for the scripts for noise supression and all
compare the production of ’mão’ ([m5̃w̃], ’hand’) with ’pão’
the participants of the experiment for their time and voice.
([p5̃w̃], ’bread’), showing that, as expected, the velum behaves
140
7. References [19] A. Teixeira, P. Martins, A. Silva, and C. Oliveira., “An MRI study
of consonantal coarticulation resistance in portuguese,” in Inter-
[1] L. Goldstein and M. Pouplier, “The temporal organization of national Seminar on Speech Production (ISSP’11), Montrreal,
speech,” The Oxford handbook of language production, p. 210, Canada, Jun. 2011.
2014.
[20] S. Silva, A. Teixeira, C. Oliveira, and P. Martins, “Segmenta-
[2] V. Ramanarayanan, S. Tilsen, M. Proctor, J. Töger, L. Goldstein, tion and analysis of vocal tract from midsagittal real-time mri,”
K. S. Nayak, and S. Narayanan, “Analysis of speech production in International Conference Image Analysis and Recognition.
real-time MRI,” Computer Speech & Language, 2018. Springer, 2013, pp. 459–466.
[3] A. Teixeira, P. Martins, C. Oliveira, C. Ferreira, A. Silva, and [21] A. R. Meireles, L. Goldstein, R. Blaylock, and S. Narayanan,
R. Shosted, “Real-time MRI for portuguese,” in International “Gestural coordination of brazilian portugese nasal vowels in CV
Conference on Computational Processing of the Portuguese Lan- syllables: A real-time MRI study.” in ICPhS, 2015.
guage. Springer, 2012, pp. 306–317.
[22] C. Oliveira, “Do grafema ao gesto. contributos linguı́sticos para
[4] S. Narayanan, A. Toutios, V. Ramanarayanan, A. Lammert,
um sistema de sı́ntese de base articulatória,” Ph.D. dissertation,
J. Kim, S. Lee, K. Nayak, Y.-C. Kim, Y. Zhu, L. Goldstein et al.,
Universidade de Aveiro, 2009.
“Real-time magnetic resonance imaging and electromagnetic ar-
ticulography database for speech production research (tc),” The [23] M. Uecker, S. Zhang, D. Voit, A. Karaus, K.-D. Merboldt, and
Journal of the Acoustical Society of America, vol. 136, no. 3, pp. J. Frahm, “Real-time mri at a resolution of 20 ms,” NMR in
1307–1311, 2014. Biomedicine, vol. 23, no. 8, pp. 986–994, 2010.
[5] M. Labrunie, P. Badin, D. Voit, A. A. Joseph, J. Frahm, [24] P. Boersma, “Praat, a system for doing phonetics by computer,”
L. Lamalle, C. Vilain, and L.-J. Boë, “Automatic segmentation Glot international, vol. 5, no. 9/10, pp. 341–347, 2001.
of speech articulators from real-time midsagittal MRI based on [25] P. Boersma and D. Weenink, “Praat: doing phonetics by com-
supervised learning,” Speech Communication, vol. 99, pp. 27 – puter [computer program], version 6.0.40,” 2018, retrieved 11
46, 2018. May 2018 from http://www.praat.org/.
[6] S. Silva and A. Teixeira, “Unsupervised segmentation of the vo- [26] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Ac-
cal tract from real-time MRI sequences,” Computer Speech and tive shape models-their training and application,” Computer vision
Language, vol. 33, no. 1, pp. 25–46, Sep. 2015. and image understanding, vol. 61, no. 1, pp. 38–59, 1995.
[7] ——, “Quantitative systematic analysis of vocal tract data,” Com- [27] C. Cunha, S. Silva, A. Teixeira, C. Oliveira, P. Martins, ,
puter Speech & Language, vol. 36, pp. 307 – 329, 2016. A. Joseph, and J. Frahm, “Analysis of nasal vowels and diph-
[8] A. Teixeira, P. Martins, C. Oliveira, and A. Silva, “Production and thongs in european portuguese,” in Workshop New Developments
modeling of the european portuguese palatal lateral.” in Compu- in Speech Sensing and Imaging (Labphon satellite event), Lisbon,
tational Processing of the Portuguese Language, PROPOR 2012, Jun. 2018.
Lecture Notes in Computer Science/LNAI, Vol. 7243, 2012.
[9] P. Martins, I. Carbone, A. Pinto, A. Silva, and A. Teixeira, “Eu-
ropean Portuguese MRI based speech production studies,” Speech
Communication, vol. 50, no. 11, pp. 925–952, 2008.
[10] S. Silva and A. J. S. Teixeira, “Critical articulators identification
from RT-MRI of the vocal tract,” in Proc. Interspeech, Stockholm,
Sweden, 2017, pp. 626–630.
[11] J. S. Perkell, M. H. Cohen, M. A. Svirsky, M. L. Matthies, I. Gara-
bieta, and M. T. Jackson, “Electromagnetic midsagittal articu-
lometer systems for transducing speech articulatory movements,”
The Journal of the Acoustical Society of America, vol. 92, no. 6,
pp. 3078–3096, 1992.
[12] P. Hoole and N. Nguyen, “Electromagnetic articulography,”
Coarticulation–Theory, Data and Techniques, Cambridge Studies
in Speech Science and Communication, pp. 260–269, 1999.
[13] A. D. Scott, M. Wylezinska, M. J. Birch, and M. E. Miquel,
“Speech MRI: Morphology and function,” Physica Medica,
vol. 30, no. 6, pp. 604 – 618, 2014.
[14] J. Frahm, S. Schätz, M. Untenberger, S. Zhang, D. Voit, K. D.
Merboldt, J. M. Sohns, J. Lotz, and M. Uecker, “On the temporal
fidelity of nonlinear inverse reconstructions for real-time MRI–the
motion challenge.” The Open Medical Imaging Journal, vol. 8, pp.
1–7, 2014.
[15] A. Teixeira and F. Vaz, “European Portuguese Nasal Vowels: An
EMMA study,” in 7th European Conference on Speech Communi-
cation and Technology, EuroSpeech - Scandinavia, vol. 2. Aal-
borg, Dinamarca: CPK/ISCA, Sep. 2001, pp. 1843–1846.
[16] C. Oliveira and A. Teixeira, “On gestures timing in european por-
tuguese nasals,” in ICPhS, 2007, pp. p. 405 – 408.
[17] S. Rossato, A. Teixeira, and L. Ferreira, “Les nasales du Portugais
et du Français : une étude comparative sur les données EMMA,”
in JEP’2006, Rennes, França, 2006.
[18] P. Martins, C. Oliveira, C. Ferreira, A. Silva, and A. Teixeira, “3D
MRI and semi-automatic segmentation techniques applied to the
study of european portuguese lateral sound,” in International Sem-
inar on Speech Production (ISSP’11), Montreal, Jun. 2011.
141
IberSPEECH 2018
A postfiltering approach for dual-microphone smartphones
Juan M. Martı́n-Doñas1 , Iván López-Espejo2 , Angel M. Gomez1 , Antonio M. Peinado1

1
Dept. de Teorı́a de la Señal, Telemática y Comunicaciones, Universidad de Granada, Spain
2
VeriDas | das-Nano, Spain
{mdjuamart,amgg,amp}@ugr.es, ilopez@das-nano.com
Abstract took into account the noise reduction provided by the beam-
former to obtain more accurate statistics. That work also ex-
Although beamforming is a powerful tool for microphone plores non-linear postfilters based on minimum mean square
array speech enhancement, its performance with small arrays, error (MMSE) estimation of the speech amplitude. Gannot et
such as the case of a dual-microphone smartphone, is quite lim- al. [11] modified the beamformer using a generalized sidelobe
ited. The goal of this paper is to study different postfiltering ap- canceler (GSC) structure and an optimal modified log-spectral
proaches that allow for further noise reduction. These postfilters amplitude (OMLSA) estimator [12] as a postfilter. Apart from
are applied to our previously proposed extended Kalman filter the above, a statistical analysis of dual-channel postfilters in
framework for relative transfer function estimation in the con- isotropic noise fields is presented in [13].
text of minimum variance distortionless response beamforming. In the case of dual-microphone smartphones, other works
We study two different postfilters based on Wiener filtering and have exploited the information of the secondary microphone to
non-linear estimation of the speech amplitude. We also pro- enhance the speech signal from the reference microphone. For
pose several estimators of the clean speech power spectral den- example, Jeub et al. [14] proposed an estimator of the noise
sity which exploit the speaker position with respect to the de- power spectral density (PSD) along with a modified single-
vice. The proposals are evaluated when applying speech en- channel Wiener filter that explicitly exploits the power level dif-
hancement on a dual-microphone smartphone in different noisy ference (PLD) of the speech signal between the microphones in
acoustic environments, in terms of both perceptual quality and close-talk (CT) conditions (when the loudspeaker of the smart-
speech intelligibility. Experimental results show that our pro- phone is placed at the ear of the user). Nelke et al. [15] devel-
posals achieve further noise reduction in comparison with other oped an alternative noise PSD estimator in far-talk (FT) condi-
related approaches from the literature. tions (when the user holds the device at a distance from her/his
Index Terms: Postfiltering, Extended Kalman filter, Minimum face). This method combines a single-channel speech presence
variance distortionless response, Dual-microphone speech, probability estimator and the coherence properties of the dual-
Smartphone channel target signal and background noise. The noise PSD
is employed to estimate the gain function to be applied on the
1. Introduction reference channel. Such an algorithm was extended for multi-
microphone devices in [16]. Nevertheless, all of these tech-
Multi-channel speech processing is widely employed in devices niques make assumptions about the noise field properties that
with several microphones to improve the noise reduction per- may not be accurate in practice, thereby leading to a limited
formance, yielding better speech quality and/or intelligibility. performance.
The most common techniques for multi-channel processing are
the beamforming algorithms, which apply a spatial filtering to Recently, we proposed an estimator of the relative trans-
the existing sound field [1, 2, 3]. Nevertheless, the performance fer function (RTF) between microphones based on an extended
of beamforming can be insufficient if a small number of micro- Kalman filter (eKF) framework [17]. Our method is capable of
phones is considered, as in the case of dual-microphone smart- tracking the RTF evolution using prior information on the chan-
phones [4, 5]. Thus, to improve the noise reduction perfor- nel and noise statistics without making any assumption on the
mance, the enhanced signal at the output of the beamformer clean speech signal statistics. In that work we evaluated the per-
might be further processed by using postfiltering methods [6]. formance of our estimator over MVDR beamforming applied
on a dual-microphone smartphone configuration in CT and FT
Several postfilters have been proposed in the recent years
conditions, showing improvements in estimation accuracy with
mainly based on multi-channel Wiener filtering, which can
respect to other relevant methods from the literature. Despite
be decomposed into minimum variance distortionless response
this, the speech enhancement performance is still limited as a
(MVDR) beamforming plus single-channel Wiener filtering [3].
result of using beamforming with only two microphones.
For multi-channel Wiener filtering, Zelinski [7] assumed a
spatially-white noise to estimate the speech and noise statistics. In this paper we evaluate the use of postfiltering techniques
Marro et al. [8] further improved the postfilter architecture and to overcome shortcomings of our eKF-based MVDR approach
also considered acoustic echo and reverberation. The previous for dual-microphone smartphones. We compare different al-
spatially-white noise assumption was substituted in [9] by as- gorithms and modify them to adapt them to our eKF-based
suming a diffuse noise field. The postfilter presented in [10] method. Also, we propose different clean speech PSD estima-
tors and make use of the available information about the RTF
This work has been supported by the Spanish MINECO/FEDER and noise to obtain the needed statistics for postfiltering. Our
Project TEC2016-80141-P and the Spanish Ministry of Education proposals are evaluated on a dual-microphone smartphone un-
through the National Program FPU (grant reference FPU15/04161). der several noisy acoustic environments in CT and FT condi-
142 10.21437/IberSPEECH.2018-30
tions, achieving improvements in terms of noise reduction per- 2.2. MVDR beamforming
formance in comparison with other state-of-the-art approaches.
The remainder of this paper is organized as follows. In Sec- Once the RTF is estimated, both noisy signals are combined
tion 2 we briefly revisit the eKF-based RTF estimation and its using MVDR beamforming, whose weights can be expressed
application to MVDR beamforming. Section 3 describes the (omitting indices f and t for clarity) as [3]
proposed postfiltering approaches and clean speech PSD esti-
mators for CT and FT conditions. Then, in Section 4 the ex- Φ−1
nn d
F= , (6)
perimental framework is presented along with our perceptual d Φ−1
H
nn d
quality and speech intelligibility results. Finally, conclusions
>
are summarized in Section 5. where (·)H indicates Hermitian transpose, d = 1, A21 is
the steering vector and Φnn is a noise spatial correlation matrix
>
2. Beamforming for dual-microphone (i.e., obtained from N = N1 , N2 ). We can also express
smartphones the PSD of the noise at the output of the beamformer as [10]
Before presenting the postfiltering approaches, it is worth- −1
while to review the RTF estimation and beamforming for dual- φv = dH Φ−1
nn d . (7)
microphone smartphones that we proposed in [17]. First, we
introduce formulation proposed for the eKF-based RTF esti- Finally, the enhanced signal for the reference microphone is es-
mation. Next, we describe the MVDR beamforming approach
timated as Xb1,mvdr = FH Y, with Y = Y1 , Y2 > .
for processing the dual-channel noisy speech signals using the
noise statistics and RTF estimations.
3. Postfiltering for dual-microphone
2.1. Extended Kalman filter-based RTF estimation smartphones
Let us consider the following additive distortion model for the The performance of MVDR beamforming when applied on
noisy speech signal in the short-time Fourier transform (STFT) smartphones with only two microphones is quite limited due to
domain, the reduced spatial information and the particular placement of
Ym (f, t) = Xm (f, t) + Nm (f, t), (1) the microphones [4]. Therefore, we analyze the use of post-
where Ym (f, t), Xm (f, t) and Nm (f, t) represent, respec- filtering for enhancing the signal at the output of the beam-
tively, noisy speech, clean speech and noise STFT coefficients former and further improve the noise reduction performance.
at the m-th microphone (m = 1, 2), f is the frequency bin We propose two different postfilters based on Wiener filtering
and t the frame index. Using the relative transfer function and the optimal modified log-spectral amplitude (OMLSA) es-
(RTF) A21 (f, t) = X X2 (f,t)
between both microphones, we can timator [12], and also address the estimation of the clean speech
1 (f,t)
write the speech distortion model for the secondary microphone PSD at the reference microphone needed by the postfilters. In
(m = 2) in terms of the reference microphone (m = 1) as addition, the postfiltering gains are further processed using the
musical noise reduction algorithm proposed in [18]. This post-
Y2 (f, t) = A21 (f, t) (Y1 (f, t) − N1 (f, t)) + N2 (f, t). (2) processing is applied to frequencies above 1 kHz. For ease of
We can also rewrite the previous complex variables as vec- notation, we drop the indices f and t henceforth.
(t) (t)
tors stacking their real and imaginary parts, yielding ym , a21
(t) 3.1. Wiener filtering
and nm (index f is omitted for clarity). For example, we define
the noisy speech vector for the m-th microphone as The multi-channel Wiener filter can be decomposed into an
(t) > MVDR beamformer followed by a single-channel Wiener filter
ym = Re(Ym (t)), Im(Ym (t)) , (3) defined as
where [·]> indicates transpose. Then, we set a dynamic model ξ
Gwf = , (8)
(t)
for a21 as follows, 1+ξ
(t) (t−1) where ξ = φx1 /φv is the a priori signal-to-noise ratio (SNR)
a21 = a21 + w(t) , (4)
and φx1 is the clean speech PSD at the reference microphone.
where w(t) models the variability of the RTF between consec- This Wiener filter is partially obtained from the enhanced sig-
utive frames. Also, we redefine (2), using the previous vector nal at the output of the beamformer. Better performance can be
notation, as achieved if the Wiener filter is fully calculated from the refer-
ence noisy signal when an overestimated noise is considered.
(t) (t) (t) (t) (t)
y2 = h a21 , n1 ; y1 + n2 Thus, we propose the following improved Wiener filter
h i
(t) (t) (t) (t) (t) (t)
= C y1 − n1 , D y1 − n1 a21 + n2 , bx
φ
(5) Giwf = 1
, (9)
b
φ x 1 + µφbv
1 0 0 −1
where C = and D = .
0 1 1 0
Assuming multivariate Gaussian variables and using the where φ bx is an estimate of the clean speech PSD (discussed in
1
models in (4) and (5), in [17] we proposed an MMSE estima- Subsection 3.3), φbv is an estimate of the noise PSD, taken as
(t)
tor of a21 using an extended Kalman filter (eKF) framework, φn1 (first element of the diagonal of Φnn ) in order to use an
which tracks the RTF using the observable noisy speech, esti- overestimated version of the noise, and µ is a factor which pro-
mated noise statistics and a priori information on the RTF statis- vides an increased overestimation. As a result, the clean speech
tics [17]. b1,iwf = Giwf X
signal is estimated as X b1,mvdr .
143
3.2. Optimal modified log-spectral amplitude estimator 4. Experimental evaluation
The OMLSA estimator proposed in [12] computes the postfilter 4.1. Experimental framework
gains as
To evaluate the proposed techniques, we simulated dual-channel
Gomlsa = GH1 pSPP GH0 1−pSPP , (10) noisy speech recordings on a Motorola Moto G smartphone.
where pSPP is the speech presence probability, GH0 is a constant We consider two different modes of use: close-talk (CT) and
gain when speech is absent and GH1 is the gain when speech is far-talk (FT). These modes can be easily identified using the
present, computed as proximity sensor included in the smartphone. Clean speech
signals were obtained from 18 speakers of the VCTK database
Z ! [19] downsampled to 16 kHz. We simulated recordings at eight
∞
1 e−t different noisy environments with different reverberations: car,
GH1 = Gwf exp dt , (11)
2 ξ
γ t street, babble, mall, bus, cafe, pedestrian street and bus station.
1+ξ
The noise signals were added at six different SNR levels from
where γ = |X b1,mvdr |2 /φv is the a posteriori SNR and Gwf -5 dB to 20 dB. Further details about this database can be found
was defined in (8). We modify this gain by substituting Gwf in [17].
by Giwf , defined in (9), which yields the improved OMLSA For STFT computation, we choose a 25 ms square-root
gain Giomlsa . Finally, the clean speech signal is estimated as Hann window with 75% overlap. The noisy speech spatial cor-
b1,iomlsa = Giomlsa X
b1,mvdr . relation matrix Φyy is estimated by a first-order recursive aver-
X
aging with an averaging constant of 0.9. The noise spatial cor-
relation matrix Φnn is estimated by recursive averaging during
3.3. Clean speech PSD estimators
time-frequency bins where speech is absent. Thus, we compute
The previous postfilters require an estimation of the clean the speech presence probability pSPP at each bin by using the
speech PSD, φx1 . We propose two different estimators for CT Multi-Channel Speech Presence Probability (MC-SPP) noise
and FT conditions, respectively, based on the noisy speech and tracking algorithm proposed in [20]. Finally, we use an over-
noise statistics and the estimated RTF between microphones. estimation factor µ = 4 and a speech absent gain GH0 = 0.05
Therefore, these estimators take advantage of the more accurate for postfiltering implementation.
RTFs obtained by our eKF-based approach.
For close-talk (CT) conditions, the estimator is based on the 4.2. Results
PLD between microphones [14], which can be computed as The two proposed postfilters, Wiener filtering (eKF-WF) and
OMLSA estimator (eKF-OMLSA), are evaluated in combina-
bPLD = max (φy − φy , 0),
∆φ (12) tion with MVDR beamforming, both using the eKF-based RTF
1 2
estimator outlined in Subsection 2.1. The obtained results are
where φy1 and φy2 are the noisy speech PSDs at the reference compared with those achieved by the noisy speech at the refer-
and secondary microphones, respectively. This estimator takes ence microphone, and MVDR beamforming with eKF and no
advantage of the more attenuated clean speech component at the postfiltering (eKF-MVDR) [17]. Also, we evaluate two other
secondary microphone. Assuming that the noise PSD is simi- state-of-the-art enhancement algorithms for dual-microphone
lar at both microphones so that its difference can be neglected smartphones, that is, the PLD-based Wiener filtering for close-
compared to ∆φ bPLD , it can be easily shown that the clean speech talk conditions of [14] and the speech presence probability and
PSD can be approximated as [14] coherence-based (SPPC) Wiener filtering for far-talk position
of [15]. The musical noise suppressor of [18] is also applied to
bPLD PLD and SPPC gains.
b(CT) ∆φ
φ x1 = . (13) The resulting enhanced signals are assessed in terms of per-
1 − |A21 |2
ceptual quality and speech intelligibility by means of the per-
ceptual evaluation of the speech quality (PESQ) [21] and short-
Unlike CT conditions, in far-talk (FT) conditions speech time objective intelligibility (STOI) [22] metrics, respectively.
power is similar at both microphones and the previous assump- Clean speech at the reference microphone is taken as a refer-
tions are inaccurate (i.e., noise PSD difference cannot be ne- ence for these performance metrics. The results for close-talk
glected compared to PLD between microphones) [14]. There- (CT) and far-talk (FT) conditions are shown in Tables 1 and 2,
fore, a better estimator is obtained by considering the distortion- respectively.
less properties of MVDR beamforming, which imply that the In close-talk conditions, it is shown that the proposed post-
clean speech PSD at the reference microphone is the same as filters outperform the other methods in terms of perceptual qual-
the one at the beamformer output. Thus, we estimate the clean ity, with both Wiener filtering and OMLSA approaches obtain-
speech PSD at both the beamformer output and the reference ing similar performance on average. While the Wiener filter-
microphone as ing approach achieves better PESQ results at low and medium
SNRs, the OMLSA approach yields higher PESQ scores at high
b(FT)
φ H
x1 = F (Φyy − Φnn ) F, (14) SNRs. It can also be seen that PLD is a better choice than
eKF-MVDR, but the addition of postfiltering after beamform-
where Φyy is a noisy speech spatial correlation matrix (i.e., cal- ing leads to better PESQ results. On the other hand, intelligibil-
culated from Y), whose diagonal elements are the noisy speech ity scores among the different evaluated techniques are similar,
PSDs φy1 and φy2 . Although the clean speech estimate could but MVDR beamforming without postfiltering obtains slightly
be obtained from a direct subtraction of the first diagonal ele- better ones. That means that the superior perceptual quality
ments of matrices Φyy and Φnn , the estimator defined in (14) achieved by postfiltering involves some speech distortion that
has the advantage of using all the channels in the estimation. slightly reduces intelligibility. In general, PLD and Wiener
144
Table 1: PESQ and STOI scores obtained for noisy and en- based on Wiener filtering and the OMLSA estimator, and also
hanced speech when using different dual-microphone enhance- proposed different clean speech PSD estimators for CT and FT
ment techniques in close-talk (CT) conditions. conditions in order to compute the needed statistics. The pro-
posed approaches were evaluated in terms of perceptual qual-
Metric Method SNR (dB) ity and speech intelligibility when they are used for enhanc-
-5 0 5 10 15 20 Avg. ing noisy speech signals from a dual-microphone smartphone
PESQ Noisy 1.10 1.12 1.18 1.31 1.52 1.80 1.34 in adverse acoustic environments. Our results show improve-
PLD 1.15 1.26 1.44 1.67 1.94 2.24 1.62 ments in terms of both perceptual quality and noise reduction
eKF-MVDR 1.11 1.16 1.25 1.40 1.62 1.92 1.41 of the enhanced signal while low speech distortion is intro-
eKF-WF 1.19 1.30 1.49 1.73 2.02 2.34 1.68 duced in comparison to a standalone MVDR beamformer. As
eKF-OMLSA 1.19 1.29 1.46 1.72 2.03 2.36 1.68 future work, we will extend this study on postfiltering to general
STOI Noisy 0.53 0.63 0.72 0.79 0.84 0.88 0.73 multi-microphone devices through our extended Kalman filter
PLD 0.54 0.64 0.73 0.80 0.85 0.89 0.74 approach.
eKF-MVDR 0.56 0.65 0.74 0.80 0.85 0.89 0.75
eKF-WF 0.53 0.63 0.72 0.80 0.85 0.89 0.74
eKF-OMLSA 0.50 0.59 0.70 0.79 0.85 0.89 0.72
6. References
[1] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Pro-
Table 2: PESQ and STOI scores obtained for noisy and en- cessing. Springer, 2008, vol. 1.
hanced speech when using different dual-microphone enhance- [2] K. Kumatani, J. McDonough, and B. Raj, “Microphone array pro-
ment techniques in far-talk (FT) conditions. cessing for distant speech recognition: From close-talking micro-
phones to far-field sensors,” IEEE Signal Processing Magazine,
vol. 29, no. 6, pp. 127–140, 2012.
Metric Method SNR (dB)
-5 0 5 10 15 20 Avg. [3] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A
PESQ Noisy 1.12 1.13 1.21 1.37 1.61 1.94 1.40 consolidated perspective on multimicrophone speech enhance-
ment and source separation,” IEEE/ACM Transactions on Audio
SPPC 1.18 1.20 1.34 1.55 1.80 2.04 1.52
Speech and Language Processing, vol. 25, no. 4, pp. 692–730,
eKF-MVDR 1.12 1.17 1.28 1.47 1.74 2.09 1.48 2017.
eKF-WF 1.25 1.33 1.51 1.80 2.17 2.56 1.77
eKF-OMLSA 1.24 1.34 1.53 1.80 2.14 2.49 1.76
[4] I. Tashev, S. Mihov, T. Gleghorn, and A. Acero, “Sound capture
system and spatial filter for small devices,” in Proc. Interspeech,
STOI Noisy 0.52 0.62 0.73 0.81 0.87 0.91 0.74
2008, pp. 435–438.
SPPC 0.41 0.52 0.63 0.72 0.78 0.82 0.65
eKF-MVDR 0.53 0.63 0.74 0.82 0.88 0.92 0.75
[5] I. López-Espejo, A. M. Peinado, A. M. Gomez, and J. A.
González, “Dual-channel spectral weighting for robust speech
eKF-WF 0.44 0.56 0.70 0.81 0.88 0.91 0.72
recognition in mobile devices,” Digital Signal Processing, vol. 75,
eKF-OMLSA 0.49 0.59 0.71 0.81 0.87 0.91 0.73 pp. 13–24, 2018.
[6] M. Parchami, W. P. Zhu, B. Champagne, and E. Plourde, “Recent
developments in speech enhancement in the short-time Fourier
postfiltering have a similar performance, while OMLSA shows transform domain,” IEEE Circuits and Systems Magazine, vol. 16,
slightly worse results, especially at low SNRs. Thus, eKF-WF no. 3, pp. 45–77, 2016.
seems the preferred strategy for close-talk conditions on aver- [7] R. Zelinski, “A microphone array with adaptive post-filtering for
age. noise reduction in reverberant rooms,” in Proc. ICASSP, 1988, pp.
Regarding far-talk conditions, likewise, the proposed post- 2578–2581.
filters obtain the best results in terms of perceptual quality, with [8] C. Marro, Y. Mahieux, and K. U. Simmer, “Analysis of noise re-
Wiener filtering being the best strategy for noise reduction, es- duction and dereverberation techniques based on microphone ar-
pecially at high SNRs. The SPPC method outperforms MVDR rays with postfiltering,” IEEE Transactions on Speech and Audio
beamforming with no postfiltering, but it does not achieve any Processing, vol. 6, no. 3, pp. 240–259, 1998.
improvements compared to eKF-WF and eKF-OMLSA. More- [9] I. A. McCowan and H. Bourlard, “Microphone array post-filter
over, SPPC introduces more speech distortion, yielding a poor based on noise field coherence,” IEEE Transactions on Speech
performance in terms of speech intelligibility. MVDR beam- and Audio Processing, vol. 11, no. 6, pp. 709–716, 2003.
forming with no postfiltering achieves the best STOI scores, [10] S. Lefkimmiatis and P. Maragos, “A generalized estimation ap-
as in CT conditions. On the other hand, the postfiltering approach for linear and nonlinear microphone array post-filters,”
proaches obtain similar results on average, although their per- Speech Communication, vol. 49, no. 7-8, pp. 657–666, 2007.
formance is worse at low SNRs. The comparison of both post- [11] S. Gannot and I. Cohen, “Speech enhancement based on the gen-
filters indicates that eKF-OMLSA achieves better intelligibility eral transfer function GSC and postfiltering,” IEEE Transactions
on average and, especially, at low SNRs. To sum up, both eKF- on Speech and Audio Processing, vol. 12, no. 6, pp. 561–571,
WF and eKF-OMLSA perform similarly, with Wiener filtering 2004.
achieving best perceptual quality and OMLSA better speech in- [12] I. Cohen and B. Berdugo, “Speech enhancement for non-
telligibility in FT conditions. stationary noise environments,” Signal Processing, vol. 81, no. 11,
pp. 2403–2418, 2001.
5. Conclusions [13] C. Zheng, H. Liu, R. Peng, and X. Li, “A statistical analysis of
two-channel post-filter estimators in isotropic noise fields,” IEEE
In this paper we have proposed a postfiltering approach to our Transactions on Audio, Speech and Language Processing, vol. 21,
RTF extended Kalman filter framework for dual-microphone no. 2, pp. 336–342, 2013.
smartphones. Our proposals make use of the more accurate esti- [14] M. Jeub, C. Herglotz, C. Nelke, C. Beaugeant, and P. Vary, “Noise
mated RTFs and noise statistics in order to obtain the gain func- reduction for dual-microphone mobile phones exploiting power
tion for noise reduction. We evaluated two different postfilters level differences,” in Proc. ICASSP, 2012, pp. 1693–1696.
145
[15] C. M. Nelke, C. Beaugeant, and P. Vary, “Dual microphone noise
PSD estimation for mobile phones in hands-free position ex-
ploiting the coherence and speech presence probability,” in Proc.
ICASSP, 2013, pp. 7279–7283.
[16] W. Jin, M. J. Taghizadeh, K. Chen, and W. Xiao, “Multi-channel
noise reduction for hands-free voice communication on mobile
phones,” in Proc. ICASSP, 2017, pp. 506–510.
[17] J. M. Martı́n-Doñas, I. López-Espejo, A. M. Gomez, and A. M.
Peinado, “An extended Kalman filter for RTF estimation in dual-
microphone smartphones,” in Proc. Eusipco, 2018, pp. 2488–
2492.
[18] T. Esch and P. Vary, “Efficient musical noise suppression for
speech enhancement systems,” in Proc. ICASSP, 2009, pp. 4409–
4412.
[19] J. Yamagishi. (2012) English multi-speaker corpus
for CSTR voice cloning toolkit. [Online]. Available:
http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
[20] M. Souden, J. Benesty, S. Affes, and J. Chen, “An integrated solu-
tion for online multichannel noise tracking and reduction,” IEEE
Transactions on Audio, Speech and Language Processing, vol. 19,
no. 7, pp. 2159–2169, 2011.
[21] “P.862.2: Wideband extension to recommendation P.862 for the
assessment of wideband telephone networks and speech codec,”
ITU-T Std. P.862.2, 2007.
[22] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al-
gorithm for intelligibility prediction of time-frequency weighted
noisy speech,” IEEE Transactions on Audio, Speech and Lan-
guage Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
146
IberSPEECH 2018
Speech and monophonic singing segmentation using pitch parameters

Xabier Sarasola, Eva Navas, David Tavarez, Luis Serrano, Ibon Saratxaga
AHOLAB University of the Basque Country (UPV/EHU), Bilbao, Spain

{xsarasola,eva,david,lserrano,ibon}@aholab.ehu.eus
Abstract frame level classification of polyphonic music, singing voice

and speech. The second approach uses long-term features for
In this paper we present a novel method for automatic seg- classification. In previous works, distribution of pitch values
mentation of speech and monophonic singing voice based only [9], pitch parameters [10] and note grammars [11], [12] are
on two parameters derived from pitch: proportion of voiced used. These long term features can be calculated for the whole
segments and percentage of pitch labelled as a musical note. file or for a window that slides through the audio file. The for-
First, voice is located in audio files using a GMM-HMM based mer case corresponds to speech/singing classification and the
VAD and pitch is calculated. Using the pitch curve, automatic later to segmentation. For the segmentation problem, the results
musical note labelling is made applying stable value sequence improve using long windows up to 1000 ms [13].
search. Then pitch features extracted from each voice island are
Bertsolaritza is a live improvisation poetry art from Basque
classified with Support Vector Machines. Our corpus consists
Country. In these live sessions a host introduces the singer and
in recordings of live sung poetry sessions where audio files con-
proposes the topic for the improvised verses that will be sung
tain both singing and speech voices. The proposed system has
a capella. The singers are professional verse creators, but not
been compared with other speech/singing discrimination sys-
professional singers. Many live sessions of bertsolaritza are
tems with good results.
recorded and saved together with the corresponding transcrip-
Index Terms: audio segmentation, voice discrimination,
tions by Bertso associations. All these recordings are very use-
singing voice, pitch
ful for analysis of the singing style and as training database for
bertso synthesis. Each recording includes speech from the host,
1. Introduction sung verses and overlapped applauses. Thus, a good segmenta-
There have been many studies dealing with the analysis and de- tion system is required to isolate the segments of interest.
velopment of technologies for speech. Lately there is also an Our goal is to create a tool to automatically segment the
interest in studying singing speech. Both areas are intimately bertsolaritza audio files to get the singing voice.
related, but it is not possible to apply directly the same tech- The rest of the paper is organized as follows. Section 2
niques to both problems. For instance, phone alignment sys- explains the proposed segmentation system using only two pa-
tems with good performance for speech do not get good re- rameters derived from pitch. Section 3 details the algorithm
sults for singing voice [1]. Also automatic recognition systems developed to assign a musical note label to each audio frame.
degrade their performance when dealing with singing speech Section 4 describes the results of the segmentation system and
[2]. Despite this different behaviour with speech technologies compares it to other speech/singing voice classification systems.
both types of voice share the same nature and therefore are very Finally, Section 5 presents the conclusions of the work.
close. In fact, singing voice is considered an extension of com-
mon speech and sometimes it is not easy to distinguish them. 2. Proposed segmentation system
Even for humans it is sometimes difficult to correctly discrimi-
nate speech from singing voices [3], [4]. Nevertheless, there is The main scheme of the proposed system can be seen in Figure
an agreement about some characteristics that behave differently 1. First voice is detected in the audio files using a voice activ-
between speech and singing [5], [6]: ity detection (VAD) algorithm based on Hidden Markov Models
and Gaussian Mixture Models (GMM-HMM). Pitch contour is
• Voiced/unvoiced ratio is higher in singing voice, as vow- calculated and for each voiced segment pitch parameters are ex-
els tend to be longer to accommodate to the duration of tracted. These parameters are classified as speech or singing by
notes. a previously trained Support Vector Machine (SVM).
• Dynamic ranges of energy and F0 in singing voice are
higher than in speech. 2.1. GMM-HMM based VAD
• Vibrato areas can appear in long sustained notes in The first stage of the proposed system divides the audio in three
singing voice. This phenomenon never occurs in speech. classes: voice, noise and silence. For this purpose we have used
• The pitch continuity is different in speech and singing a GMM frame-level classifier smoothed with an HMM as used
voice. Pitch in singing voice is more similar to a series in [14]. We have considered 13 MFCC values and ∆ and ∆2
of discrete values and on the contrary in speech it is a values with 25 ms window and 10 ms frame period. GMMs
continuously varying curve. are trained using Expectation Maximization (EM) for speech,
noise and silence. Frame-level classification usually creates
Two types of automatic speech/singing classification sys- noisy segmentation because in the transition segments fast label
tems have been proposed. The first scheme consists on using changes are produced. To avoid this, we have used an ergodic
short-time features of the signal, like for instance spectral in- HMM of three states, one per class. This HMM has predefined
formation in the form of MFCCs, to classify it at frame level transition probabilities with preference for remaining in the cur-
[7]. In [8], a wide range of short-time features are analysed for rent state. In our case, the transition probability of the HMM
147 10.21437/IberSPEECH.2018-31
NN F
PN = (3)
GMM‐HMM VAD NV F
segmentation where NV F is the total number of voiced frames, NN F
is the total number of frames labelled as a musical note and
VOICE
NT is the total number of frames, all of them calculated within
Pitch extraction the segment to be classified. The vector containing these two
& smoothing parameters is classified using a SVM.
3. Note detection algorithm

SMOOTHED PITCH CONTOUR
Note detection Our algorithm to label the pitch curve with musical notes dis-
cretises the F0 curve in semitones expressed in cents and then
NOTE LABELS searches for sequences of semitones that fulfil two conditions:
to have enough duration and less variation range than a thresh-
Calculation of
old. This algorithm is simpler than state-of-the-art algorithms
Pitch parameters
[16] but our lack of labelled data made us create a method with
PV & PN minimum supervision. First we map the F0 value to cent scale
with an offset to make all the possible values of f0c positive
Singing/speech according to expression (4).
SVM classifier
fo
f0c = 1200log2 (
) + 5800 (4)
fref
SINGING SPEECH where fref is 440 Hz, the frequency of A4 note.
To avoid possible instability due to vibrato, we apply a
smoothing to the F0 curve. The smoothing consists on calcu-
Figure 1: Structure of the proposed speech/singing voice seg- lating the local maxima and minima envelopes and taking the
mentation system average curve. The obtained smoothed pitch curve is rounded to
the closest semitone value to discretise the sequence, as shown
in Figure 2.
outside the diagonal is 0.0001. This way, fast transitions and
5600
small discontinuities are removed. To classify the segments, the
likelihood of observation provided by each model is calculated 5550
using expression (1).
5500
M
X
cents
5450
P (o|si ) = wij N (o|µij , Σij ) (1)
j=1 5400
where o is the MFCC vector, wij , µij , Σij are the weight, 5350
F0 in cent scale
mean and diagonal covariance of the component j of the state 5300 Semitone discretization
si and M is the number of Gaussian components. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Seconds
2.2. Singing/speech classification

Figure 2: Discretisation of F0 curve
We have considered each audio segment labelled as voice by
the GMM-HMM VAD as corresponding to a unique class: ei-
ther speech or singing. We do not consider the option of having 5600
speech and singing in the same voice island because of the char-
acteristics of our database. To evaluate a system that includes 5550
segmentation inside voice islands we would have to create and 5500

artificial audio files mixing our data. Therefore the problem is
reduced to a binary classification of each voice island. We pro- SL Detected note (5500 cents) SR
cents
5450
pose the use of two parameters derived from pitch to do the clas-
sification: proportion of voiced segments (P V ) and percentage 5400
of pitch labelled as a musical note (P N ). The pitch curve has 5350

been calculated using PRAAT autocorrelation method [15] with
a frame period of 10 ms. 5300
Voiced/unvoiced segments are obtained directly from the 0.0 0.1 0.2 0.3
Seconds
0.4 0.5 0.6 0.7
pitch curve and stable musical note segments are found using
our algorithm explained in Section 3. The features for classifi-
Figure 3: Detected musical note and new split sequences
cation are calculated according to expressions (2) and (3).
NV F Using the semitone discretised pitch curve we search for

PV = (2) notes using subsequence search techniques [17], [18]. We
NT
148
search the non-overlapping group of subsequences that fulfil the Table 1: Number of speakers and singers in Bertso database
predefined conditions of minimum length and maximum range
in amplitude expressed in equations 5 and 6. Singer Host Singer and host Total
Female 7 6 2 15
Len(s) ≥ L (5) Male 12 6 2 20
max(s) − min(s) ≤ R (6) Total 19 12 4
where s is the semitone subsequence, R is the maximum
amplitude range and L is the minimum length.
The algorithm is defined in the next steps: downsampled to 16000 Hz and converted to Windows PCM
files1 .
• Consider the whole pitch curve as a collection of We analysed the Bertso database and the singing voice and
K voiced sequences of contiguous semitones ex- speech segments never appear consecutively, i. e., there is al-
pressed in cents each one with its own length S = ways silence or noise between segments to classify. Therefore,
{S1 , S2 , ..., SK }. considering the structure of both databases, each segment be-
• Define R as the maximum allowed variation range and L longs only to a class. Additionally, we have studied the dura-
as the minimum length. tion of the segments produced by speakers and singers: singing
voice has longer durations than speech (mean duration of 3.69
• Search the longest subsequence in each Si (1 ≤ i ≤ K) and 1.51 seconds respectively in Bertso database and 3.87 and
that fills the conditions. 1.92 seconds in NUS database).
• Label the longest subsequence found as a musical note. The distribution of the proposed classifying features P V
Between the possible semitones in the sequence, the and P N in the databases can be seen in Figures 4 and 5. Speech
most frequent one is selected as label. is more scattered than singing voice, but taking into account
both parameters a good discrimination of both classes can be
• Split the remaining parts of the original sequence Si in achieved.
two new sequences: the subsequence in the left of the For the experiments, we split the Bertso database in 10 sub-
note found (SLi ) and the subsequence in the right of the sets for cross-validation tests. All the partitions considered in-
note found (SRi ) as shown in Figure 3. clude different singers in the train and test subsets. The NUS
• If any of the new generated subsequences fill the duration database is classified using the algorithms trained with Bertso
condition, Si in S is substituted by them and the process database.
begins again.
1.0
• When all sequences from S have been analysed the pro-
cess finishes. 0.8
In this work we have established a maximum range of 100 0.6

Speech
cents and a minimum length of 150 ms, a standard minimum
PN
Singing Voice
value in Western music [19]. 0.4
0.2
4. Experiments and results
0.0
4.1. Datasets
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
PV
As few publicly available database exist with speech and mono-
phonic singing, we have used an excerpt of our Bertsolaritza
Figure 4: Distribution of the classes in the Bertso database
database [20] to train the algorithms and the NUS Sung and
Spoken Lyrics Corpus [21] to test them. In the Bertso database
we manually labelled 20 audio files from 19 singers with a total
duration of 60 minutes and 40 seconds. These audio files con- 1.0
tain 32.8 minutes of singing voice and 2.87 minutes of speech.
0.8
The 20 files were selected to cover the variability of the origi-
nal Bertsolaritza database, considering recordings from differ- 0.6
ent decades and gender. In the NUS database, each singer has Speech
PN
Singing Voice
recorded a singing and spoken version of 4 songs. The total 0.4
number of different songs is 20 and 12 singers have made the
recordings, 6 males and 6 females. As each recording contains 0.2
either speech or singing voice, we used the VAD to obtain the
voice segments and labelled them with the type of the recording. 0.0
Table 1 shows the distribution of singers and hosts by gen- 0.2 0.4 0.6 0.8 1.0
PV
der in the Bertso database. In most cases the speakers either sing
or act as host, but some hosts give the topic for the improvised Figure 5: Distribution of the classes in NUS database
verses singing as well. These hosts appear in the recordings
both singing and speaking.
In the Bertso database the audio files originally were in 1 Examples of signals contained in the Bertso database can be ac-
mp3. Both databases had 44100 Hz samplerate and have been cessed at http://bdb.bertsozale.eus/en/web/bertsoa/view/1323t4.
149
4.2. Comparison with other methods Table 2: Results of the GMM-HMM VAD for different number
of Gaussian components
To compare our algorithm with other methods we had tested
them in our Bertso database and in NUS database. We have
Gaussians F-Score
selected methods that are suitable to work with segments of
different duration as it is the case of Bertso database. On the 2 0.965 +/- 0.005
one hand, we have trained GMM classifiers with the parame- 4 0.967 +/- 0.006
ters suggested in [12] (∆F 0) and [13] (DFT of F0 distribution). 8 0.969 +/- 0.007
On the other hand, we have also built a GMM classifier based 16 0.972 +/- 0.008
on MFCC parameters. These methods are explained with more 32 0.973 +/- 0.008
detail in the following subsections. 64 0.974 +/- 0.008
4.2.1. DFT of F0 distribution

As commented before, F0 provides very useful information to 4.4. Results of speech/singing classification
discriminate between speech and singing. The histogram of F0
gives information about the range and the distribution of F0 val- The metric in each test will be the unweighted F-score, defined
ues that for singing voice will be concentrated around the fre- as the average of F-score for each of the classes. This metric
quencies corresponding to musical notes. In [13] GMMs are is suitable when the classes are unbalanced as it is the case of
used to model the DFT of the F0 distribution and detect the devi- Bertso database. The results of the cross-validation classifica-
ation of the instantaneous pitch value from its mean. F0 values tion experiments are shown in Table 3. The proposed method
in Bertso database range from 75 to 500 Hz. 100 bin histogram gets the best results and can compete with spectrum based meth-
is calculated where each segment corresponds to approximately ods. The methods that use pitch derived parameters as ∆F 0 and
0.027 octaves. The histogram is normalized to have unit area DFT of F0 distribution get poorer results due to the short dura-
and then modelled using 8 component GMMs for singing voice tion of the segments to classify. Although in other works they
and speech. have proved useful, they are not suitable for the characteristics
of Bertso database. The experiment in the NUS database shows
4.2.2. Delta F0 similar results proving the validity of our method with profes-
sional singers and a different style. The GMM method results
The dynamics of F0 are another feature to consider in speech
got worse for NUS database, probably because the audio files
and singing voice discrimination. In [12] the ∆F 0 distribu-
used to generate the models contain both speech and singing
tion of voiced segments is modelled with GMMs to discrimi-
and the files in NUS database belong to one class. Therefore
nate speech and singing voice. We calculate the ∆F 0 using
the normalization process affects both databases differently.
a Savitsky-Golay filter [22] with a window of 50 ms. An his-
togram of 100 bins is made from -50 to 50 Hz. The distribution
of ∆F 0 in each voice segment is normalized to have unit area Table 3: Results of speech/singing classification
and then modelled with 16 component GMMs.
Method Precision Recall F-score
4.2.3. MFCC GMM
Bert. NUS Bert. NUS Bert. NUS
In [12] short-term spectrum features are used motivated by the
presence of an additional resonance characteristic of singing ∆F 0 0.78 0.76 0.83 0.74 0.80 0.74
speech addressed in [23]. We calculated 13 MFCC coefficients DFT-F0 0.73 0.77 0.77 0.76 0.75 0.77
and their ∆ with a frame period of 10 ms and a window of 25 GMM 0.95 0.75 0.85 0.66 0.89 0.64
ms applying a CMVN file-wise normalization. MFCC frames Proposed 0.91 0.89 0.93 0.89 0.92 0.89
of speech and singing voice are modelled using 32 component
GMMs. Each voice segment is assigned the class that gets the
higher sum of log-likelihood for all the frames of the segment.
We chose GMMs to model MFCC parameters because of the 5. Conclusions
high dimensionality of the parameters.
An efficient singing/speech discrimination system that is suit-
able for classifying short segments has been developed. It uses
4.3. Results of GMM-HMM VAD
only two parameters extracted from pitch to take the decision
The metric used to assess the VAD is the voice detection F-score and classify each segment. As a by-product a flexible algorithm
defined as indicated in equation 7. to label musical notes has also been produced. Currently we are
testing LSTMs to segment singing and speech taking advantage
2T P of their capacity to classify sequences of different lengths.
F = (7)
2T P + F P + F N
where T P is the duration speech classified as speech, F P 6. Acknowledgements
is the duration of non-speech classified as speech and F N is the
duration of speech classified as non-speech. This work has been partially supported by UPV/EHU (Ayu-
Table 2 shows the results for different number of Gaussian das para la Formación de Personal Investigador), the Spanish
components. All of them get good results, over 0.96, and the Ministry of Economy and Competitiveness with FEDER sup-
number of components does not affect the performance signifi- port (MINECO/FEDER, UE) (RESTORE project, TEC2015-
cantly. We have selected the VAD with 32 components for the 67163-C2-1-R) and by the Basque Government under grant
classification experiments. KK-2018/00014 (BerbaOla).
150
7. References [21] Z. Duan, H. Fang, B. Li, K. C. Sim, and Y. Wang, “The nus sung
and spoken lyrics corpus: A quantitative comparison of singing
[1] A. Loscos, P. Cano, and J. Bonada, “Low-delay singing voice and speech,” in 2013 Asia-Pacific Signal and Information Pro-
alignment to text.” in ICMC, 1999. cessing Association Annual Summit and Conference, Oct 2013,
[2] A. Mesaros and T. Virtanen, “Automatic recognition of lyrics in pp. 1–9.
singing,” EURASIP Journal on Audio, Speech, and Music Pro-
[22] A. Savitzky and M. J. E. Golay, “Smoothing and differentiation
cessing, vol. 2010, pp. 1–11, 2010.
of data by simplified least squares procedures.” Analytical Chem-
[3] D. Deutsch, R. Lapidis, and T. Henthorn, “The speech-to-song istry, vol. 36, no. 8, pp. 1627–1639, 1964.
illusion,” Acoustical Society of America Journal, vol. 124, pp.
[23] J. Sundberg, “The acoustics of the singing voice,” Scientific Amer-
2471–2471, 2008.
ican, vol. 236, no. 3, pp. 82–91, 1977.
[4] S. Falk and T. Rathcke, “On the speech-to-song illusion: Evidence
from german,” in Speech Prosody 2010-Fifth International Con-
ference, 2010.
[5] S. R. Livingstone, K. Peck, and F. A. Russo, “Acoustic differences
in the speaking and singing voice,” Proceedings of Meetings on
Acoustics, vol. 19, no. 35080, 2013.
[6] J. Merrill and P. Larrouy-Maestri, “Vocal features of song and
speech: Insights from Schoenberg’s Pierrot lunaire,” Frontiers in
Psychology, vol. 8:1108, 2017.
[7] W. Chou and L. Gu, “Robust singing detection in speech/music
discriminator design,” in Acoustics, Speech, and Signal Process-
ing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International
Conference on, vol. 2. IEEE, 2001, pp. 865–868.
[8] B. Schuller, B. J. B. Schmitt, D. Arsić, S. Reiter, M. Lang, and
G. Rigoll, “Feature selection and stacking for robust discrimi-
nation of speech, monophonic singing, and polyphonic music,”
IEEE International Conference on Multimedia and Expo, ICME
2005, vol. 2005, pp. 840–843, 2005.
[9] D. Gerhard, “Pitch-based acoustic feature analysis for the discrim-
ination of speech and monophonic singing,” Canadian Acoustics,
vol. 30, no. 3, pp. 152–153, 2002.
[10] B. Schuller, G. Rigoll, and M. Lang, “Discrimination of
speech and monophonic singing in continuous audio streams
applying multi-layer support vector machines,” 2004 IEEE
International Conference on Multimedia and Expo (ICME)
(IEEE Cat. No.04TH8763), pp. 1655–1658, 2004. [Online].
Available: http://ieeexplore.ieee.org/document/1394569/
[11] W. H. Tsai and C. H. Ma, “Speech and singing discrimination
for audio data indexing,” Proceedings - 2014 IEEE International
Congress on Big Data, BigData Congress 2014, pp. 276–280,
2014.
[12] Y. Ohishi, M. Goto, K. Itou, and K. Takeda, “Discrimination
between Singing and Speaking Voices,” Interspeech, pp. 1141–
1144, 2005.
[13] B. Thompson, “Discrimination between singing and speech in
real-world audio,” 2014 IEEE Workshop on Spoken Language
Technology, SLT 2014 - Proceedings, pp. 407–412, 2014.
[14] T. Hain and P. C. Woodland, “Segmentation and classification of
broadcast news audio,” in Fifth International Conference on Spo-
ken Language Processing, 1998.
[15] P. Boersma, “Accurate short-term analysis of the fundamental fre-
quency and the harmonics-to-noise ratio of a sampled sound,”
in Proceedings of the institute of phonetic sciences, vol. 17, no.
1193. Amsterdam, 1993, pp. 97–110.
[16] M. Mauch, C. Cannam, R. Bittner, G. Fazekas, J. Salamon, J. Dai,
J. Bello, and S. Dixon, “Computer-aided melody note transcrip-
tion using the tony software: Accuracy and efficiency,” 2015.
[17] Y. S. Moon and J. Kem, “Fast normalization-transformed subse-
quence matching in time-series databases,” vol. E90-D, no. 12,
2007, pp. 2007–2018.
[18] O. K. Kostakis and A. G. Gionis, “Subsequence Search in Event-
Interval Sequences,” 2015, pp. 851–854.
[19] A. S. Bregman, Auditory scene analysis: The perceptual organi-
zation of sound. MIT press, 1994.
[20] X. Sarasola, E. Navas, D. Tavarez, D. Erro, I. Saratxaga, and
I. Hernaez, “A singing voice database in Basque for statistical
singing synthesis of bertsolaritza.” in LREC, 2016.
151
IberSPEECH 2018
Self-Attention Linguistic-Acoustic Decoder

Santiago Pascual1 , Antonio Bonafonte1 , Joan Serrà2
1
Universitat Politècnica de Catalunya, Barcelona, Spain
2
Telefónica Research, Barcelona, Spain
santi.pascual@upc.edu
Abstract current unit [16] modules.

In this work, we propose a new acoustic model, based
The conversion from text to speech relies on the accurate map- on part of the Transformer network [17]. The original Trans-
ping from linguistic to acoustic symbol sequences, for which former was designed as a sequence-to-sequence model for ma-
current practice employs recurrent statistical models like recur- chine translation. Typically, in sequence-to-sequence problems,
rent neural networks. Despite the good performance of such RNNs of some sort were applied to deal with the conversion be-
models (in terms of low distortion in the generated speech), tween the two sequences [18, 19]. The Transformer substitutes
their recursive structure tends to make them slow to train and to these recurrent components by attention models and positioning
sample from. In this work, we try to overcome the limitations of codes that act like time stamps. In [17], they specifically intro-
recursive structure by using a module based on the transformer duce the self-attention mechanism, which can relate elements
decoder network, designed without recurrent connections but within a single sequence without an ordered processing (like
emulating them with attention and positioning codes. Our re- RNNs) by using a compatibility function, and then the order
sults show that the proposed decoder network is competitive is imposed by the positioning code. The main part we import
in terms of distortion when compared to a recurrent baseline, from that work is the encoder, as we are dealing with a mapping
whilst being significantly faster in terms of CPU inference time. between two sequences that have the same time resolution. We
On average, it increases Mel cepstral distortion between 0.1 and however call this part the decoder in our case, given that we are
0.3 dB, but it is over an order of magnitude faster on average. decoding linguistic contents into their acoustic codes. We em-
Fast inference is important for the deployment of speech syn- pirically find that this Transformer network is as competitive as
thesis systems on devices with restricted resources, like mobile a recurrent architecture, but with faster inference/training times.
phones or embedded systems, where speaking virtual assistants
This paper is structured as follows. In section 2, we de-
are gaining importance.
scribe the self-attention linguistic-acoustic decoder (SALAD)
Index Terms: self-attention, deep learning, acoustic model, we propose. Then, in section 3, we describe the followed exper-
speech synthesis, text-to-speech. imental setup, specifying the data, the features, and the hyper-
parameters chosen for the overall architecture. Finally, results
1. Introduction and conclusions are shown and discussed in sections 4 and 5, re-
Speech synthesis makes machines generate speech signals, and spectively. The code for the proposed model and the baselines
text-to-speech (TTS) conditions speech generation on input lin- can be found in our public repository 1 .
guistic contents. Current TTS systems use statistical models
like deep neural networks to map linguistic/prosodic features 2. Self-Attention Linguistic-Acoustic
extracted from text to an acoustic representation. This acoustic Decoder
representation typically comes from a vocoding process of the
speech waveforms, and it is decoded into waveforms again at To study the introduction of a Transformer network into a TTS
the output of the statistical model [1]. system, we employ our previous multiple speaker adaptation
To build the linguistic to acoustic mapping, deep network (MUSA) framework [14, 20, 21]. This is a two-stage RNN
TTS models make use of a two-stage structure [2]. The first model influenced by the work of Zen and Sak [13], in the sense
stage predicts the number of frames (duration) of a phoneme that it uses unidirectional LSTMs to build the duration model
to be synthesized with a duration model, whose inputs are lin- and the acoustic model without the need of predicting dynamic
guistic and prosodic features extracted from text. In the second acoustic features. A key difference between our works and [13]
stage, the acoustic parameters of every frame are estimated by is the capacity to model many speakers and adapt the acoustic
the so-called acoustic model. Here, linguistic input features are mapping among them with different output branches, as well
added to the phoneme duration predicted in the first stage. Dif- as interpolating new voices out of their common representation.
ferent works use this design, outperforming previously existing Nonetheless, for the current work, we did not use this multiple
statistical parametric speech synthesis systems with different speaker capability and focused on just one speaker for the new
variants in prosodic and linguistic features, as well as perceptual architecture design on improving the acoustic model.
losses of different kinds in the acoustic mapping [3, 4, 5, 6, 7]. The design differences between the RNN and the trans-
Since speech synthesis is a sequence generation problem, re- former approaches are depicted in figure 1. In the MUSA
current neural networks (RNNs) are a natural fit to this task. framework with RNNs, we have a pre-projection fully-
They have thus been used as deep architectures that effectively connected layer with a ReLU activation that reduces the sparsity
predict either prosodic features [8, 9] or duration and acoustic of linguistic and prosodic features. This embeds the mixture xt
features [10, 11, 12, 13, 14]. Some of these works also investi- of different input types into a common representation ht in the
gate possible performance differences using different RNN cell
types, like long short-term memory (LSTM) [15] or gated re- 1 https://github.com/santi-pdp/musa tts
152 10.21437/IberSPEECH.2018-32
form of one vector per time step t. Hence, the transformation 3. Experimental Setup
RL → RH is applied independently at each time step t as
3.1. Dataset
For the experiments we use utterances of speakers from the
ht = max(0, Wxt + b),
TCSTAR project dataset [22]. This corpora includes sentences
and paragraphs taken from transcribed parliamentary speech
where W ∈ RH×L , b ∈ RH , xt ∈ RL , and ht ∈ RH . Af- and transcribed broadcast news. The purpose of these text
ter this projection, we have the recurrent core formed by an sources is twofold: enrich the vocabulary and facilitate the se-
LSTM layer of size H and an additional LSTM output layer. lection of the sentences to achieve good prosodic and phonetic
The MUSA-RNN output is recurrent, as this prompted better coverage. For this work, we choose the same male (M1) and fe-
results than using dynamic features to smooth cepstral trajecto- male (F1) speakers as in our previous works. These two speak-
ries in time [13]. ers have the most amount of data among the available ones.
Based on the Transformer architecture [17], we propose a Their amount of data is balanced with approximately the fol-
pseudo-sequential processing network that can leverage distant lowing durations per split for both: 100 minutes for training, 15
element interactions within the input linguistic sequence to pre- minutes for validation, and 15 minutes for test.
dict acoustic features. This is similar to what an RNN does, but
discarding any recurrent connection. This will allow us to pro- 3.2. Linguistic and Acoustic Features
cess all input elements in parallel at inference, hence substan- The decoder maps linguistic and prosodic features into acous-
tially accelerating the acoustic predictions. In our setup, we do tic ones. This means that we first extract hand-crafted features
not face a sequence-to-sequence problem as stated previously, out of the input textual query. These are extracted in the la-
so we only use a structure like the Transformer encoder which bel format, following our previous work in [20]. We thus have
we call a linguistic-acoustic decoder. a combination of sparse identifiers in the form of one-hot vec-
The proposed SALAD architecture begins with the same tors, binary values, and real values. These include the identity
embedding of linguistic and prosodic features, followed by a of phonemes within a window of context, part of speech tags,
positioning encoding system. As we have no recurrent struc- distance from syllables to end of sentence, etc. For more detail
ture, and hence no processing order, this positioning encod- we refer to [20] and references therein.
ing system will allow the upper parts of the network to locate For a textual query of N words, we will obtain M label
their operating point in time, such that the network will know vectors, M ≥ N , each with 362 dimensions. In order to inject
where it is inside the input sequence [17]. This positioning code these into the acoustic decoder, we need an extra step though.
c ∈ RH is a combination of harmonic signals of varying fre- As mentioned, the MUSA testbed follows the two-stage struc-
quency: ture: (1) duration prediction and (2) acoustic prediction with
2i
the amount of frames specified in first stage. Here we are only
ct,2i = sin t/10000 H working with the acoustic mapping, so we enforce the duration
2i
with labeled data. For this reason, and similarly to what we did
ct,2i+1 = cos t/10000 H in previous works [14, 21], we replicate the linguistic label vec-
tor of each phoneme as many times as dictated by the ground-
where i represents each dimension within H. At each time-step truth annotated duration, appending two extra dimensions to the
t, we have a unique combination of signals that serves as a time 362 existing ones. These two extra dimensions correspond to
stamp, and we can expect this to generalize better to long se- (1) absolute duration normalized between 0 and 1, given the
quences than having an incremental counter that marks the po- training data, and (2) relative position of current phoneme in-
sition relative to the beginning. Each time stamp ct is summed side the absolute duration, also normalized between 0 and 1.
to each embedding ht , and this is input to the decoder core. We parameterize the speech with a vocoded representa-
The decoder core is built with a stack of N blocks, depicted tion using Ahocoder [23]. Ahocoder is an harmonic-plus-noise
within the dashed blue rectangle in figure 1. These blocks are high quality vocoder, which converts each windowed waveform
the same as the ones proposed in the decoder of [17], but we frame into three types of features: (1) mel-frequency cepstral
only have self-attention modules to the input, so it looks more coefficients (MFCCs), (2) log-F0 contour, and (3) voicing fre-
like the Transformer encoder. The most salient part of this type quency (VF). Note that F0 contours have two states: either they
of block is the multi-head attention (MHA) layer. This applies follow a continuous envelope for voiced sections of speech, or
h parallel self-attention layers, which can have a more versatile they are 0, for which the logarithm is undefined. Because of
feature extraction than a single attention layer with the possibil- that, Ahocoder encodes this value with −109 , to avoid numeri-
ity of smoothing intra-sequential interactions. After the MHA cal undefined values. This result would be a cumbersome output
comes the feed-forward network (FFN), composed of two fully- distribution to be predicted by a neural net using a quadratic re-
connected layers. The first layer expands the attended features gression loss. Therefore, to smooth the values out and normal-
into a higher dimension dff , and this gets projected again to ize the log-F0 distribution, we linearly interpolate these con-
the embedding dimensionality H. Finally, the output layer is tours and create an extra acoustic feature, the unvoiced-voiced
a fully-connected dimension adapter such that it can convert the flag (UV), which is the binary flag indicating the voiced or un-
hidden dimensions H to the desired amount of acoustic outputs, voiced state of the current frame. We will then have an acoustic
which in our case is 43 as discussed in section 3.2. As stated ear- vector with 40 MFCCs, 1 log-F0, 1 VF, and 1 UV. This equals
lier, we may slightly degrade the quality of predictions with this a total number of 43 features per frame, where each frame
output topology, as recurrence helps in the output layer captur- window has a stride of 80 samples over the waveform. Real-
ing better the dynamics of acoustic features. Nonetheless, this numbered linguistic features are Z-normalized by computing
can suffice our objective of having a highly parallelizable and statistics on the training data. In the acoustic feature outputs,
competitive system. all of them are normalized to fall within [0, 1].
153
Figure 1: Transition from RNN/LSTM acoustic model to SALAD. The embedding projections are the same. Positioning encoding
introduces sequential information. The decoder block is stacked N times to form the whole structure replacing the recurrent core.
FFN: Feed-forward Network. MHA: Multi-Head Attention.
Table 1: Different layer sizes of the different models. Emb: lin- Concerning the training setup, all models are trained with
ear embedding layer, and hidden size H for SALAD models in batches of 32 sequences of 120 symbols. The training is in a
all layers but FFN ones. HidRNN: Hidden LSTM layer size. dff : so-called stateful arrangement, such that we carry the sequen-
Dimension of the feed-forward hidden layer inside the FFN. tial state between batches over time (that is, the memory state
in the RNN and the position code index in SALAD). To achieve
Model Emb HidRNN dff this, we concatenate all the sequences into a very long one and
chop it into 32 long pieces. We then use a non-overlapped slid-
Small RNN 128 450 -
ing window of size 120, so that each batch contains a piece per
Small SALAD 128 - 1024
sequence, continuous with the previous batch. This makes the
Big RNN 512 1300 -
models learn how to deal with sequences longer than 120 out-
Big SALAD 512 - 2048
side of train, learning to use a conditioning state different than
zero in training. Both models are trained for a maximum of
300 epochs, but they trigger a break by early-stopping with the
3.3. Model Details and Training Setup validation data. The validation criteria for which they stop is
the mel cepstral distortion (MCD; discussed in section 4) with
We have two main structures: the baseline MUSA-RNN and a patience of 20 epochs.
SALAD. The RNN takes the form of an LSTM network for
their known advantages of avoiding typical vanilla RNN pitfalls Regarding the optimizers, we use Adam [25] for the RNN
in terms of vanishing memory and bad gradient flows. Each of models, with the default parameters in PyTorch (lr = 0.001,
the two different models has two configurations, small (Small β1 = 0.9, β2 = 0.999, and = 10−8 ). For SALAD we use a
RNN/Small SALAD) and big (Big RNN/Big SALAD). This in- variant of Adam with adaptive learning rate, already proposed
tends to show the performance difference with regard to speed in the Transformer work, called Noam [17]. This optimizer is
and distortion between the proposed model and the baseline, based on Adam with β1 = 0.9, β2 = 0.98, = 10−9 and a
but also their variability with respect to their capacity (RNN learning rate scheduled with
and SALAD models of the same capacity have an equivalent
number of parameters although they have different connexion
topologies). Figure 1 depicts both models’ structure, where lr = H −0.5 · min(s−0.5 , s · w−1.5 )
only the size of their layers (LSTM, embedding, MHA, and
FFN) changes with the mentioned magnitude. Table 1 sum-
marizes the different layer sizes for both types of models and where we have an increasing learning rate for w warmup train-
magnitudes. ing batches, and it decreases afterwards, proportionally to the
Both models have dropout [24] in certain parts of their inverse square root of the step number s (number of batches).
structure. The RNN models have it after the hidden LSTM We use w = 4000 in all experiments. The parameter H is the
layer, whereas the SALAD model has many dropouts in differ- inner embedding size of SALAD, which is 128 or 512 depend-
ent parts of its submodules, replicating the ones proposed in the ing on whether it is the small or big model as noted in table 1.
original Transformer encoder [17]. The RNN dropout is 0.5, We also tested Adam on the big version of SALAD, but we did
and SALAD has a dropout of 0.1 in its attention components not observe any improvement in the results, so we stick to Noam
and 0.5 in FFN and after the positioning codes. following the original Transformer setup.
154
Table 2: Male (top) and female (bottom) objective results.
A: voiced/unvoiced accuracy.
Model #Params MCD [dB] F0 [Hz] A [%]

Small RNN 1.17 M 5.18 13.64 94.9
Small SALAD 1.04 M 5.92 16.33 93.8
Big RNN 9.85 M 5.15 13.58 94.9
Big SALAD 9.66 M 5.43 14.56 94.5
Small RNN 1.17 M 4.63 15.11 96.8
Small SALAD 1.04 M 5.25 20.15 96.4
Big RNN 9.85 M 4.73 15.44 96.9
Big SALAD 9.66 M 4.84 19.36 96.6 Figure 2: F0 contour histograms of ground-truth speech, bi-
gRNN and bigSALAD for male speaker.
4. Results
In order to assess the distortion introduced by both models, we
took three different objective evaluation metrics. First, we have
the MCD measured in decibels, which tells us the amount of
distortion in the prediction of the spectral envelope. Then we
have the root mean squared error (RMSE) of the F0 prediction
in Hertz. And finally, as we introduced the binary flag that spec-
ifies which frames are voiced or unvoiced, we measure the accu-
racy (number of correct hits over total outcomes) of this binary
classification prediction, where classes are balanced by nature.
These metrics follow the same formulations as in our previous
works [14, 20, 21].
Table 2 shows the objective results for the systems detailed
in section 3.3 over the two mentioned speakers, M and F. For
both speakers, RNN models perform better than the SALAD
ones in terms of accuracy and error. Even though the small- Figure 3: Inference time for the four different models with re-
est gap, occurring with the SALAD biggest model, is 0.3 dB spect to generated waveform length. Both axis are in seconds.
in the case of the male speaker and 0.1 dB in the case of the
female speaker, showing the competitive performance of these Table 3: Maximum inference latency with RANSAC fit.
non-recurrent structures. On the other hand, Figure 3 depicts
the inference speed on CPU for the 4 different models synthe- Model Max. latency [s]
sizing different utterance lengths. Each dot in the plot indicates
a test file synthesis. After we collected the dots, we used the Small RNN 63.74
RANSAC [26] algorithm (Scikit-learn implementation) to fit a Small SALAD 4.715
linear regression robust to outliers. Each model line shows the Big RNN 64.84
latency uprise trend with the generated utterance length, and Big SALAD 5.455
RNN models have a way higher slope than the SALAD models.
In fact, SALAD models remain pretty flat even for files of up
to 35 s, having a maximum latency in their linear fit of 5.45 s 5. Conclusions
for the biggest SALAD, whereas even small RNN is over 60 s.
We have to note that these measurements are taken with Py- In this work we present a competitive and fast acoustic model
Torch [27] implementations of LSTM and other layers running replacement for our MUSA-RNN TTS baseline. The proposal,
over a CPU. If we run them on GPU we notice that both systems SALAD, is based on the Transformer network, where self-
can work in real time. It is true that SALAD is still faster even in attention modules build a global reasoning within the sequence
GPU, however the big gap happens on CPUs, which motivates of linguistic tokens to come up with the acoustic outcomes.
the use of SALAD when we have more limited resources. Furthermore, positioning codes ensure the ordered processing
We can also check the pitch prediction deviation, as it is the in substitution of the ordered injection of features that RNN
most affected metric with the model change. We show the test has intrinsic to its topology. With SALAD, we get on average
pitch histograms for ground truth, big RNN and big SALAD over an order of magnitude of inference acceleration against the
in figure 2. There we can see that SALAD’s failure is about RNN baseline on CPU, so this is a potential fit for applying text-
focusing on the mean and ignoring the variance of the real dis- to-speech on embedded devices like mobile handsets. Further
tribution more than the RNN does. It could be interesting to work could be devoted on pushing the boundaries of this system
try some sort of short-memory non-recurrent modules close to to alleviate the observed flatter pitch behavior.
the output to alleviate this peaky behavior that makes pitch flat-
ter (and thus less expressive), checking if this is directly re- 6. Acknowledgements
lated to the removal of the recurrent connection in the output
This research was supported by the project TEC2015-69266-P
layer. Audio samples are available online as qualitative results
(MINECO/FEDER, UE).
at http://veu.talp.cat/saladtts .
155
7. References [14] S. Pascual and A. Bonafonte, “Multi-output RNN-LSTM for mul-
tiple speaker speech synthesis and adaptation,” in Proc. 24th Eu-
[1] H. Zen, “Acoustic modeling in statistical parametric speech ropean Signal Processing Conference (EUSIPCO). IEEE, 2016,
synthesis–from HMM to LSTM-RNN,” 2015. pp. 2325–2329.
[2] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech
synthesis using deep neural networks,” in 2013 IEEE Interna- [15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
tional Conference on Acoustics, Speech and Signal Processing Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
(ICASSP). IEEE, 2013, pp. 7962–7966. [16] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalu-
[3] H. Lu, S. King, and O. Watts, “Combining a vector space rep- ation of gated recurrent neural networks on sequence modeling,”
resentation of linguistic context with a deep neural network for arXiv preprint arXiv:1412.3555, 2014.
text-to-speech synthesis,” Proc. ISCA SSW8, pp. 281–285, 2013. [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
[4] Y. Qian, Y. Fan, W. Hu, and F. K. Soong, “On the training aspects Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
of deep neural network (DNN) for parametric TTS synthesis,” in in Proc. Advances in Neural Information Processing Systems
2014 IEEE International Conference on Acoustics, Speech and (NIPS), 2017, pp. 6000–6010.
Signal Processing (ICASSP). IEEE, 2014, pp. 3829–3833. [18] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
[5] Q. Hu, Z. Wu, K. Richmond, J. Yamagishi, Y. Stylianou, and learning with neural networks,” in Proc. Advances in Neural In-
R. Maia, “Fusion of multiple parameterisations for DNN-based formation Processing Systems (NIPS), 2014, pp. 3104–3112.
sinusoidal speech synthesis with multi-task learning,” in Proc. IN- [19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
TERSPEECH, 2015, pp. 854–858. lation by jointly learning to align and translate,” arXiv preprint
[6] Q. Hu, Y. Stylianou, R. Maia, K. Richmond, J. Yamagishi, and arXiv:1409.0473, 2014.
J. Latorre, “An investigation of the application of dynamic sinu- [20] S. Pascual, “Deep learning applied to speech synthesis,” Master’s
soidal models to statistical parametric speech synthesis.” in Proc. thesis, Universitat Politècnica de Catalunya, 2016.
INTERSPEECH, 2014, pp. 780–784.
[21] S. Pascual and A. Bonafonte Cávez, “Multi-output RNN-LSTM
[7] S. Kang, X. Qian, and H. Meng, “Multi-distribution deep be-
for multiple speaker speech synthesis with a-interpolation model,”
lief network for speech synthesis,” in Proc. 2013 IEEE Interna-
in Proc. ISCA SSW9. IEEE, 2016, pp. 112–117.
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2013, pp. 8012–8016. [22] A. Bonafonte, H. Höge, I. Kiss, A. Moreno, U. Ziegenhain,
[8] S. Pascual and A. Bonafonte, “Prosodic break prediction with H. van den Heuvel, H.-U. Hain, X. S. Wang, and M.-N. Garcia,
RNNs,” in Proc. International Conference on Advances in Speech “Tc-star: Specifications of language resources and evaluation for
and Language Technologies for Iberian Languages. Springer, speech synthesis,” in Proc. LREC Conf, 2006, pp. 311–314.
2016, pp. 64–72. [23] D. Erro, I. Sainz, E. Navas, and I. Hernáez, “Improved HNM-
[9] S.-H. Chen, S.-H. Hwang, and Y.-R. Wang, “An RNN-based based vocoder for statistical synthesizers.” in Proc. INTER-
prosodic information synthesizer for mandarin text-to-speech,” SPEECH, 2011, pp. 1809–1812.
IEEE Transactions on Speech and Audio Processing, vol. 6, no. 3, [24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
pp. 226–239, 1998. R. Salakhutdinov, “Dropout: a simple way to prevent neural net-
[10] S. Achanta, T. Godambe, and S. V. Gangashetty, “An investigation works from overfitting,” The Journal of Machine Learning Re-
of recurrent neural network architectures for statistical parametric search, vol. 15, no. 1, pp. 1929–1958, 2014.
speech synthesis,” in Proc. INTERSPEECH, 2015. [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
[11] Z. Wu and S. King, “Investigating gated recurrent networks for mization,” arXiv preprint arXiv:1412.6980, 2014.
speech synthesis,” in Proc. International Conference on Acous-
[26] M. A. Fischler and R. C. Bolles, “Random sample consensus: a
tics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp.
paradigm for model fitting with applications to image analysis and
5140–5144.
automated cartography,” Communications of the ACM, vol. 24,
[12] R. Fernandez, A. Rendel, B. Ramabhadran, and R. Hoory, no. 6, pp. 381–395, 1981.
“Prosody contour prediction with long short-term memory, bi-
directional, deep recurrent neural networks.” in Proc. INTER- [27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-
SPEECH, 2014, pp. 2268–2272. Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic
differentiation in PyTorch,” in NIPS Workshop on The Future of
[13] H. Zen and H. Sak, “Unidirectional long short-term memory re- Gradient-based Machine Learning Software & Techniques (NIPS-
current neural network with recurrent output layer for low-latency Autodiff), 2017.
speech synthesis,” in Proc. International Conference on Acous-
tics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp.
4470–4474.
156
IberSPEECH 2018
Japañol: a mobile application to help improving Spanish pronunciation by

Japanese native speakers
Cristian Tejedor-Garcı́a,Valentı́n Cardeñoso-Payo,David Escudero-Mancebo
Grupo de Investigación ECA-SIMM. Universidad de Valladolid

valentin.cardenoso@uva.es
Abstract
In this document, we describe the mobile application Japañol1 ,
a learning tool which helps pronunciation training of Spanish
as a foreign language (L2) at a segmental level. The tool has
been specifically designed to be used by native Japanese peo-
ple, and implies a branch of a previous CAPT gamified tool
TipTopTalk!. In this case, a predefined cycle of actions related
to exposure, discrimination and production is presented to the
user, always under the minimal-pairs approach to pronunciation
training. It incorporates freely available ASR and TTS and pro-
vides feedback to the user by means of short video tutorials, to
reinforce learning progression.
Index Terms: computer-assisted pronunciation training, Figure 1: Components of the CAPT system.
speech recognition, human-computer interaction, computa-
tional para-linguistics.
While the freedom of movement on game-oriented tools
1. Introduction leads users to maximize their score by repeating those tasks
The way we teach and learn foreign languages is adapting to the they found easy, the continued use of the tool seemed to gener-
technologies. The development of new modes to engage peo- ate stagnation. Complementary, learning-oriented tools should
ple in learning, as Computer-Assisted Pronunciation Training focus on users’ difficulties, offering guided and corrective feed-
(CAPT), Computer-Assisted Language Learning (CALL) and back and achieving better effectiveness and efficiency pedagog-
Mobile-Assisted Language Learning (MALL), allow improving ical results [11]. This motivated this alternate version of the
linguistic skills anytime and anywhere [1]. Also, these systems tool, which provides a fixed cycle of well known and balanced
can give useful information on how learners perform and im- learning activities, and which has been applied initially to an
prove their pronunciation [2]. While there are many software experiment which results are presented in this same conference
tools that rely on speech technologies for providing to users L2 [12].
pronunciation training in the field of Computer Assisted Pro-
nunciation Training (CAPT), Japañol [3], distinctively incorpo- 2. System’s description
rates a well designed cycle of all the relevant activities related
to pronunciation training: exposure, discrimination, production Figure 1 shows the architecture diagram of components in
and mixed mode. Japañol. It is a native Android application built from scratch and
This application represents an evolution of previous serious can be run using low-cost resources available at language labo-
games [4, 5, 6] designed for pronunciation training of L2 by ratories such as computers, tablets, speakers and microphones.
non-native. All of them rely on the minimal pairs methodology It uses Google speech technology, such as Text-To-Speech of-
[7] and are within the context of research projects related to the fline tool and Automatic Speech Recognition web service for
development and testing of software tools and games for foreign Android, that offers a N-best list of probable results for each
language learning (TIN2014-59852-R and VA050G18). utterance. Japañol keeps record of chronological events and re-
sults of the users with the system in log files. Both audio and log
Previous versions were based on the free selection by users
files are stored in a server, through a set of web services, in or-
of exposure, discrimination and production tasks of, mainly, En-
der to be later analyzed to extract results and conclusions. Lists
glish or Spanish minimal-pairs, in order to get achievements
of Minimal pair words are available in a database accessible to
and increase points in leader-boards. With that approach, we
Japañol.
were able to assess user’s pronunciation level in a L2 [8]. We
have also analyzed how the introduction of corrective feedback
[9, 10] increased pronunciation improvement among users after 3. Using the tool
the first stages of use.
Most of current CAPT systems offer isolated pronunciation or
1 This discrimination activities as part of the training exercises . Very
work has been partially supported by the Ministerio de
Economı́a y Empresa (MINECO) and the European Regional De-
few combine these different modes as we do in our learning ap-
velopment Fund FEDER under project (TIN2014-59852-R) and by plication. In Japañol we follow a learning methodology based
Consejerı́a de Educación of Junta de Castilla y León under project on the Theory, Exposure, Discrimination, Pronunciation and
(VA050G18). Mixed modes.
157
5. Acknowledgements
Special thanks to the people of the Language Center of Univer-
sidad de Valladolid for their support for the evaluation campaign
with Japanese students. We also thank members of the research
team for the active support: Enrique Cámara, César González
Ferreras and Mario Corrales Astorgano from Universidad de
Valladolid, and Antonio Rı́os Mestre and Marı́a Machuca Ayuso
from Universitat Autònoma de Barcelona. Special thanks also
to Takuya Kimura, from University of Seisen in Tokyo, who
provided Japanese versions of the screens and conducted the
experiment with Japanese students at his university.
6. References
[1] R. I. Thomson and T. M. Derwing, “The effectiveness of L2 pro-
nunciation instruction: A narrative review,” Applied Linguistics,
vol. 36, no. 3, p. 326, 2014.
[2] W. Li and D. Mollá-Aliod, “Computer processing of oriental lan-
Figure 2: Standard flow to complete a lesson in Japañol. guages. language technology for the knowledge-based economy,”
Lecture Notes in Computer Science, vol. 5459, 2009.
[3] Japañol Project. (2017) Eca-simm japañol project. [Online].
Available: https://eca-simm.uva.es/es/proyectos/capt/japanol/
The activities are organized as a sequence of lessons, each [4] C. Tejedor-Garcı́a, V. Cardeñoso-Payo, E. Cámara-Arenas,
devoted to a specific segmental pronunciation difficulty associ- C. González-Ferreras, and D. Escudero-Mancebo, “Measuring
ated to a minimal pair. In each lesson, a brief and clear expla- pronunciation improvement in users of CAPT tool TipTopTalk!”
nation about the problem and valid pronunciations is provided, Interspeech, pp. 1178–1179, September 2016.
in the form of audiovisual material. Then, an exposure mode is [5] C. Tejedor-Garcı́a, D. Escudero-Mancebo, E. Cámara-Arenas,
entered, in which the user can listen to reference realizations of C. González-Ferreras, and V. Cardeñoso-Payo, “Improving L2
each valid utterance. After exposure, a discrimination mode is production with a gamified computer-assisted pronunciation train-
faced in which a sequence of 10 pairs of distinct words (part of a ing tool, TipTopTalk!” IberSPEECH 2016: IX Jornadas en Tec-
nologı́as del Habla and the V Iberian SLTech Workshop events,
minimal pair) is presented while the user is required to selected
pp. 177–186, 2016.
which one corresponds best with the listened utterance (gener-
ated using the TTS). A minimum number of 6 success hits has [6] ——, “TipTopTalk! mobile application for speech training using
minimal pairs and gamification,” IberSPEECH 2016: IX Jornadas
to be obtained in order to proceed to the next step. If not, the en Tecnologı́as del Habla and IV Iberian SLTech Workshop events,
user is suggested to return to exposure mode again before facing pp. 425–432, 2016.
a new discrimination challenge for the same lesson. Once the
[7] C. Tejedor-Garcı́a, V. Cardeñoso-Payo, E. Cámara-Arenas,
minimum required number of right answers has been given, or C. González-Ferreras, and D. Escudero-Mancebo, “Playing
after a number of 5 tries to avoid discouraging the user, a pro- around minimal pairs to improve pronunciation training,” IFCASL
nunciation mode is entered. In this mode, the user has to say, in 2015, 2015.
sequence, 10 different words selected from the list associated to [8] C. Tejedor-Garcı́a and D. Escudero-Mancebo, “Uso de pares
the minimal pair in the lesson. The recorded speech is submit- mı́nimos en herramientas para la práctica de la pronunciación del
ted to Google Speech ASR and is accepted as valid only when español como lengua extranjera,” Revista de la Asociación Euro-
the first item of the N-best list provided by the ASR matches the pea de Profesores de Español. El español por el mundo, no. 1, pp.
target word. A minimum number of 6 correct pronunciations is 355–363, 2018.
required to pass. If the attempt fails, the user is recommended [9] D. Escudero-Mancebo, E. Cámara-Arenas, C. Tejedor-Garcı́a,
to return to exposure mode before attempting again. C. González-Ferreras, and V. Cardeñoso-Payo, “Implementation
and test of a serious game based on minimal pairs for pronuncia-
Finally, a mixed mode activity is required for each lesson. tion training,” SLaTE, pp. 125–130, 2015.
In this mode, a sequence of 10 random discrimination and pro- [10] A. Rauber, C. Tejedor-Garcı́a, V. Cardeñoso-Payo, E. Cámara-
duction tasks are presented and a 60% success is again required Arenas, C. González-Ferreras, D. Escudero-Mancebo, and
to proceed. This mode resembles the added difficulty found to A. Rato, “TipTopTalk!: A game to improve the perception and
switch from discrimination to production in a normal conversa- production of L2 sounds,” Abstracts of New Sounds Aarhus, 8th
tion. A diagram showing the sequence of activities to complete International Conference on Second Language Speech, p. 160,
a lesson is shown in Figure 2. 2016.
[11] C. Tejedor-Garcı́a, D. Escudero-Mancebo, C. González-Ferreras,
E. Cámara-Arenas, and V. Cardeñoso-Payo, “Evaluating the Ef-
4. Activities in the demonstration ficiency of Synthetic Voice for Providing Corrective Feedback in
a Pronunciation Training Tool Based on Minimal Pairs,” SLaTE,
2017.
The demonstration will consist on an interactive session show-
ing all different modes in the client application (see 3). People [12] C. Tejedor-Garcı́a, D. Escudero-Mancebo, E. Cámara-Arenas,
C. González-Ferreras, and V. Cardeñoso-Payo, “Improving pro-
will be able to ask for help during the presentation. At the be- nunciation of spanish as a foreign language for l1 japanese speak-
ginning, all attending people can download the application with ers with japañol capt tool,” IberSPEECH 2018: X Jornadas en
a given URL or taking a photo of a QR picture. Once down- Tecnologı́as del Habla and the V Iberian SLTech Workshop, pp.
loaded, the demonstration begins logging into the application iii–lll, 2018.
before entering the menu of lessons.
158
IberSPEECH 2018
Towards the Application of Global Quality-of-Service Metrics in Biometric

Systems
J. M. Espı́n1 , R. Font1 , J. F. Inglés-Romero1 and C. Vicente-Chicote2
Biometric Vox S.L., Spain

1
2
Universidad de Extremadura, Quercus Software Engineering Group, EPCC, Spain
{jm.espin,roberto.font,juanfran.ingles}@biometricvox.com, cristinav@unex.es
Abstract and more importantly, real-world systems normally have other

quality aspects often disregarded. For example, in an Automatic
Performance metrics, such as Equal Error Rate or Detection Speaker Verification (ASV) system, the number of failed access
Cost Function, have been widely used to evaluate and com- attempts may not only depend on the False Rejection Rate, but
pare biometric systems. However, they seem insufficient when also on usability considerations, such as users misunderstand-
dealing with real-world applications. First, these systems tend ing the instructions and trying to authenticate with an incorrect
to include an increasing number of subsystems, e.g. aimed at passphrase on a text-dependent system. In the same way, the
spoofing detection or information management. As a result, the security of the system as a whole does not rest exclusively on
aggregation of new capabilities (and their interactions) makes the False Acceptance Rate, but also on some design decisions,
the evaluation of the overall performance more complex. Sec- e.g. the number of failed attempts before blocking a particular
ond, performance metrics only offer a partial view of the system user.
quality in which non-functional properties, such as user expe- Properties such as usability or security are generally re-
rience, efficiency or reliability, are generally ignored. In this ferred to as non-functional properties. They can be considered
paper, we introduce RoQME, an Integrated Technical Project as particular quality aspects of a system and represent an essen-
funded by the EU H2020 RobMoSys project. RoQME aims tial part of any software solution. Unfortunately, non-functional
at providing software engineers with methods and tools to deal properties often get missed when designing most software sys-
with system-level non-functional properties, enabling the spec- tems. In this vein, the RoQME project [6], which is currently in
ification of global Quality-of-Service (QoS) metrics. Although progress, aims at enabling the modeling (at design time) and es-
the project is in the context of robotics software, the paper timation (at runtime) of non-functional properties, particularly
presents potential applications of RoQME to enrich the way in in the field of robotics. This paper introduces potential applica-
which performance is evaluated in biometric systems, focusing tions of the RoQME project to biometric systems, in particular,
specifically on Automatic Speaker Verification (ASV) systems the paper focuses on ASV systems as a first attempt to address
as a first step. the problem. To the best of our knowledge, the evaluation of
Index Terms: speaker recognition, automatic speaker verifica- biometric systems in terms of global Quality-of-Service (QoS)
tion, non-functional properties metrics, defined on non-functional properties, has not been ad-
dressed before.
1. Introduction
Assessing the performance of biometric systems usually in-
2. The RoQME project
volves: (1) selecting or creating an evaluation corpus, (2) defin- The RoQME Integrated Technical Project (ITP), funded by EU
ing a testing protocol, (3) applying the defined corpus and pro- H2020 RobMoSys project [7], aims at providing software en-
tocol to the system under consideration, and (4) computing a gineers with a model-driven tool-chain allowing them to: (1)
number of performance metrics. These metrics are generally model system-level non-functional properties in terms of con-
aimed at capturing the ability of the system to produce discrim- textual information; and (2) generate ready-to-use code to eval-
inative and well-calibrated scores. The development of com- uate QoS metrics, defined on the properties previously speci-
mon corpora, protocols and metrics has been one of the major fied. Regarding the former, RoQME will provide engineers with
driving forces of the steady progress experienced in biometrics a high-level language to model context variables (e.g., battery
authentication over the last decades. This is the case, in the level) and, from them, relevant context patterns (e.g., “the bat-
field of speaker recognition, of initiatives such as the ongoing tery level drops more than 1% per minute”). The detection of a
Speaker Recognition Evaluation series [1, 2], organized by the context pattern will be considered an observation in a Bayesian
National Institute of Standards and Technology, or the Speakers Network, which is a probabilistic model used in RoQME to ex-
in the Wild [3] and the RedDots Challenge [4] initiatives. press the dynamics of a non-functional property (e.g., power
The increasing complexity of biometric systems makes consumption).
conventional performance metrics noticeably insufficient. Once the model containing the relevant properties, contexts
Among other reasons, this is due to the interdependencies of and observations is specified, RoQME automatically generates
the numerous modules that current applications involve (e.g., the code that will be responsible for estimating the degree of
spoofing detection, information management, etc.). The joint fulfillment of each non-funtional property in terms of QoS met-
operation of these modules makes it difficult to assess to what rics. Basically, the implementation will consist of: (1) a context
extent they contribute to the overall performance of the system. monitor that receives raw contextual data and produces context
This fact has not been acknowledged until very recently [5] and events (e.g., changes in the battery level); (2) an event proces-
there is still a lot of ground to cover in this line. In addition, sor that searches for the event patterns specified in the RoQME
159
model and, when found, produces observations (e.g., battery is behaviors. At runtime, RoQME can provide system adminis-
draining too fast); and, finally (3) a probabilistic reasoner that trators with real-time QoS indicators about the degree of fulfill-
computes (based on Bayesian inference) a numeric estimation ment in usability, resource utilization or stability, to name just a
for each metric (e.g. a value of 0.89 can be understood as the few examples. This information can then be used by the engi-
probability of being optimal in terms of power consumption). neers to improve the system or adjust its configuration.
Although the RoQME project is focused on the robotics do-
main, the modeling tools being developed are designed to be ex- 3.3. Self-adaptation
tensible and application domain agnostic. Further details about
Speaker verification systems could use the QoS metrics pro-
the RoQME project can be found in [8, 9].
vided by RoQME to automatically adjust its own configuration
(or software) in order to optimize its performance under chang-
3. Applications to biometric systems ing circumstances. For instance, this approach could be applied
This section describes some potential applications of RoQME to estimate the voiceprint quality. This would allow the system
to biometric systems, in particular, to ASV systems. As stated to dynamically change its operation, e.g., to ask the user to pro-
before, although the RoQME project is focused on the robotics vide additional voice samples if his/her voiceprint quality falls.
domain, we believe that it explores an issue of great relevance Furthermore, RoQME QoS metrics could play an important role
for many other software systems. in an unsupervised strategy for adaptating voiceprints, aimed at
gradually improving their quality as the system is used. This ap-
3.1. Benchmarking proach would make the system more robust, e.g., against aging
effects.
Using performance metrics, such as Equal Error Rate or De-
tection Cost Function, to quantify the goodness of the results
(either scores or binary decisions) is an essential instrument
4. Acknowledgements
to evaluate and contrast different algorithms and technologies. RoQME has received funding from the European Union’s Hori-
However, when it comes to the integral quality of a real-world zon 2020 research and innovation programme under grant
system, these metrics fail to capture some important aspects, agreement N. 732410, in the form of financial support to third
including: parties of the RobMoSys project.
• The interaction of all the different subsystems and their
combined effect on False Acceptance and False Rejec- 5. References
tion ratios. [1] NIST Speaker Recognition Evaluations. [Online]. Available:
• The effect of usability and accessibility considerations https://www.nist.gov/itl/iad/mig/speaker-recognition
on Failure to Acquire, Failure to Enroll, and False Rejec- [2] J. González-Rodrı́guez, “Evaluating automatic speaker recognition
tion ratios. For example, if the instructions on the screen systems: An overview of the NIST speaker recognition evaluations
are not clear enough, the user might be unable to provide (1996-2014),” Loquens, vol. 1, no. 1, 2014.
a valid voice sample. In this case, the performance of the [3] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The 2016
system would be affected before any sample reaches the speakers in the wild speaker recognition evaluation,” in 16th An-
nual Conference of the International Speech Communication Asso-
verification process. ciation (INTERSPEECH), 2016.
• The impact of certain system configurations and deci- [4] K. Lee, A. Larcher, G. Wang, P. Kenny, N. Brümmer, D. A. van
sions at design/deployment time, such as the thresholds Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero, B. Ma, H. Li,
of the different subsystems or how many authentication T. Stafylakis, M. J. Alam, A. Swart, and J. Perez, “The reddots
attempts are granted to verify an user. Regarding the lat- data collection for speaker recognition,” in 15th Annual Conference
ter, it is obvious that a system with no limits on the num- of the International Speech Communication Association (INTER-
SPEECH), 2015.
ber of failed authentication attempts would be less secure
than an identical system with appropriate limits. [5] T. Kinnunen, K.-A. Lee, H. Delgado, N. W. D. Evans, M. Todisco,
M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-DCF: a detec-
We propose to integrate non-functional properties (i.e., security, tion cost function for the tandem assessment of spoofing counter-
usability, reliability, etc.) into the system quality evaluation. In measures and automatic speaker verification,” in ODYSSEY 2018 –
this sense, RoQME would allow software engineers to bench- The Speaker and Language Recognition Workshop, 2018.
mark complex biometric systems, in terms of their overall per- [6] RoQME Integrated Technical Project. (2018-2019). RoQME: Deal-
formance, given a number of QoS metrics. Basically, engineers ing with non-functional properties through global Robot Quality-
of-Service Metrics. [Online]. Available: http://robmosys.eu/roqme/
would start specifying the required non-functional properties by
using the high-level modeling language provided by RoQME. [7] RobMoSys EU H2020 Project. (2017-2020). RobMoSys: Com-
posable Models and Software for Robtics Systems - Towards an
After that, the resulting models will automatically generate the EU Digital Industrial Platform for Robotics. [Online]. Available:
code of a software component. At runtime, this component will http://robmosys.eu
continuously update the QoS metrics associated with the spec- [8] C. Vicente-Chicote, J. F. Inglés-Romero, and J. Martı́nez, “A
ified properties. Finally, this information would be considered component-based and model-driven approach to deal with non-
together with other performance metrics (e.g. Equal Error Rate) functional properties through global qos metrics,” in 5th Interna-
to compare different systems, configurations, etc. tional Workshop on Interplay of Model-Driven and Component-
Based Software Engineering (ModComp), in conjunction with
3.2. Dynamic assessment MODELS’18, 2018.
[9] J. F. Inglés-Romero, J. M. Espı́n, R. Jiménez-Andreu, R. Font, and
Monitoring QoS metrics over time would allow engineers to de- C. Vicente-Chicote, “Towards the use of quality of service met-
tect any anomaly or deviation from the expected behavior and rics in reinforcement learning: A robotics example,” in 5th Inter-
foresee future problems. In this sense, RoQME could be use- national Workshop on Model-driven Robot Software Engineering
ful for the early identification of malfunctioning or undesired (MORSE), in conjunction with MODELS’18, 2018.
160
IberSPEECH 2018
Incorporation of a Module for Automatic Prediction of Oral Productions

Quality in a Learning Video Game
David Escudero-Mancebo, Valentin Cardeñoso-Payo1
1
descuder@infor.uva.es, valen@infor.uva.es
Abstract • An app for mobile devices has been programmed to col-

lect trainers annotations in a transparent way for players.
This document presents the research project TIN2017-88858-
C2-1-R of the Spanish Government. Antecedents and goals of • An evaluation template has been devised including a set
the project are presented. Current status, recent achievements of criteria to be evaluated by the trainers. This template
and collaborations after the first year development are described has been designed in collaboration with voice therapists
in the paper. from the centers Colegio Tórtola de Educación Especial
Index Terms: Computer Assisted Pronunciation Training, of Fundación Personas Valladolid and ASDOVA Down
Prosody, Down syndrome speech, Learning video games. Syndrome Association Valladolid.
• An inter-rater consistency evaluation process is being

1. Antecedents run for assessing the evaluation template and guaranty
This project continues the research line opened in 2014 with the the consistency of parallel evaluation to be performed
project La piedra mágica1 supported by Resercaixa and con- during the sort term recording activity.
tinued in 2016 with the project Pradia 2 supported by BBVA
Humanidades Digitales. The first of these projects served to de- 4. Achievements
velop a video game for Down syndrome adolescents to train oral
communication, in particular prosody and pragmatics [1]. In the Main achievement in this first year of the project development
second project, the video game routines had been enriched and a is the publication [2] which reports on results derived from data
clinical evaluation of user’s performance was tested with formal obtained in previous projects. The video game was also pre-
evaluation. sented in [3]. The PhD thesis of Mario Corrales, framed in
this project, was presented in XXXIV Congreso SEPLN Sevilla
2. Goals of the project Septiembre 2018 (the SEPLN society granted the student to at-
tend the conference). In the current conference we also present
The video games developed in previous projects require the per- advanced on the rating of quality of DS voice from data [4]
manent assistance of a personal trainers (a teacher, a therapist,
a relative. . . ) to control and monitor the progress of the game
users. The goal of the current project is programing a mod- 5. Collaborations
ule to increase the software capabilities so that users can play Apart from the already reported collaborations with Fundación
in an autonomous way supervised by an intelligent tutor. This Personas Valladolid and ASDOVA, we have open several na-
intelligent system will be responsible to decide on the user to tional and international promising collaborations.
repeat or continue with the training activities giving feedback
On one hand, we are collaborating with the research group
for him/her to be more competent in oral communication and
of Pastora Martı́nez from the Department of Psychology of
prosodic skills.
UNED Madrid. She is co-author of several relevant publica-
Three stages have been defined: 1) corpus collection of
tions in the field of Down syndrome and prosody [5, 6]. She
audio of DS people playing with the video game; the DS au-
has provided as with her paired corpus of typical-DS speech,
dios will be rated by professional voice therapists; correcting
obtained while applying the PEPS-C test [7].
advices of the therapists will be also monitored. 2) computa-
tional models of quality of DS turns will be trained from the On the other hand, we submitted an Erasmus+ project
audio; a knowledge data base will be compiled with the thera- (strategic partnership) in collaboration with different research
pist corrective orders. 3) An expert system is using the compiled groups and education centers in Portugal, Ireland, Italy, France
information to decide about the user activities in real time. and Bosnia. The project was finally rejected but the evaluation
gives encouraging feedback for future resubmissions.
3. Current status
6. Acknowledgments
The following activities have been performed according to the
programed schedule: Special thanks to Yolanda Martı́n de San Pablo from Fundación
Personas, Alfonso Rodrı́guez from ASDOVA and Jesús Gómez
• The software has been adapted to record in real time
from CEIP Urueña for the collaboration during the evaluation
annotations from trainers about the quality of the video
template definition and during the inter-rater consistency tests.
game users’ oral turns.
Thank’s to all the team for their active support: Valle Flores,
1 http://prado.uab.es/recercaixa/ Carlos Vivaracho, César González, Lourdes Aguilar and Mario
2 http://www.pradia.net Corrales.
161
7. References
[1] C. González-Ferreras, D. Escudero-Mancebo, M. Corrales-
Astorgano, L. Aguilar-Cuevas, and V. Flores-Lucas, “Engaging
adolescents with down syndrome in an educational video game,”
International Journal of Human–Computer Interaction, vol. 33,
no. 9, pp. 693–712, 2017.
[2] M. Corrales-Astorgano, D. Escudero-Mancebo, and C. González-
Ferreras, “Acoustic characterization and perceptual analysis of the
relative importance of prosody in speech of people with down syn-
drome,” Speech Communication, vol. 99, pp. 90–100, 2018.
[3] F. Adell, L. Aguilar, M. Corrales-Astorgano, and D. Escudero-
Mancebo, “Proceso de innovación educativa en educación espe-
cial: Enseñanza de la prosodia con fines comunicativos con el
apoyo de un videojuego educativo,” I Congreso Internacional en
Humanidades Digitales, p. in press.
[4] C.-A. Mario, P. MartÃnez-Castilla, D. Escudero-Mancebo,
L. Aguilar, C. González-Ferreras, and V. Cardenoso-Payo, “To-
wards an automatic evaluation of the prosody of people with down
syndrome,” Iberspeech 2018, p. in press.
[5] P. Martı́nez-Castilla and S. Peppé, “Developing a test of prosodic
ability for speakers of iberian spanish,” Speech Communication,
vol. 50, no. 11-12, pp. 900–915, 2008.
[6] S. J. Peppé, P. Martı́nez-Castilla, M. Coene, I. Hesling, I. Moen,
and F. Gibbon, “Assessing prosodic skills in five european lan-
guages: Cross-linguistic differences in typical and atypical pop-
ulations,” International journal of speech-language pathology,
vol. 12, no. 1, pp. 1–7, 2010.
[7] S. Peppé and J. McCann, “Assessing intonation and prosody in chil-
dren with atypical language development: the peps-c test and the
revised version,” Clinical Linguistics & Phonetics, vol. 17, no. 4-5,
pp. 345–354, 2003.
162
IberSPEECH 2018
Silent Speech:
Restoring the Power of Speech to People whose Larynx has been Removed
Jose A. Gonzalez1 , Phil D. Green2 , Damian Murphy3 , Amelia Gully3 , and James M. Gilbert4
1
University of Malaga, Spain
2
University of Sheffield, U.K.
3
University of York, U.K.
4
University of Hull, U.K.
j.gonzalez@uma.es
Abstract
Every year, some 17,500 people in Europe and North America
lose the power of speech after undergoing a laryngectomy, nor-
mally as a treatment for throat cancer. Several research groups
have recently demonstrated that it is possible to restore speech
to these people by using machine learning to learn the trans-
formation from articulator movement to sound. In our project
articulator movement is captured by a technique developed by
our collaborators at Hull University called Permanent Magnet
Articulography (PMA), which senses the changes of magnetic
field caused by movements of small magnets attached to the
lips and tongue. This solution, however, requires synchronous
PMA-and-audio recordings for learning the transformation and, Figure 1: Permanent Magnet Articulography (PMA) system.
hence, it cannot be applied to people who have already lost their Upper-left and lower-left: placement of magnets used to cap-
voice. Here we propose to investigate a variant of this tech- ture the movements of the lips and tongue. Right: PMA headset
nique in which the PMA data are used to drive an articulatory consisting of micro-controller, battery and magnetic sensors for
synthesiser, which generates speech acoustics by simulating the detecting the variations of the magnetic field generated by the
airflow through a computational model of the vocal tract. The magnets.
project goals, participants, current status, and achievements of
the project are discussed below.
Index Terms: speech restoration, silent speech interfaces, per- techniques (typically deep neural networks [13]). For captur-
manent magnet articulography, articulatory synthesis, magnetic ing articulator movement we use Permanent Magnet Articulog-
resonance imaging raphy (PMA) [14, 15, 16, 17], a technology developed by our
collaborators at the University of Hull in which small magnets
are attached to the lips and tongue and the magnetic field gen-
1. Introduction erated when the articulators move is captured by sensors close
A total laryngectomy is a clinical procedure in which the voice to the mouth (see Fig. 1 for a picture of the PMA system).
box is surgically removed most commonly as a treatment for The parameters of the transformation for converting articulator
throat cancer. This procedure not only leaves the subject muted, movement into speech are currently estimated from simultane-
but it is also known to cause social isolation, feelings of loss of ous recordings of audio and PMA signals acquired before the
identity and can lead to clinical depression [1, 2, 3]. Current person loses her/his voice. Some audio samples produced by
available methods for speaking after a laryngectomy include the the proposed restoration method can be found at https://
electro-larynx, a hand-held device which produces an unnatural, www.jandresgonzalez.com/is2017. As can be seen,
electronic voice; oesophageal speech, which is difficult to mas- the samples are mostly intelligible and the speaker identity is
ter, and the voice prosthesis, which is considered to be the cur- clearly preserved.
rent gold standard, but has a short life time (4 to 8 weeks) due A limitation of the above approach for speech restoration
candida growth, thus requiring regular hospital visits for valve is that simultaneous speech-and-sensor recordings are required
replacement [4, 5, 6]. Other available methods such as the Al- for estimating the mapping between articulator movement and
ternative and Augmentative Communication (AAC) devices [7], its acoustics. Thus, this makes this method unsuitable for per-
where the user types words and the device synthesises them, are sons who have already lost their voice. The aim is of this project
also limited by their slow manual input and, therefore, are not is, thus, to investigate a novel approach that would make simul-
suitable for any other than short conversations. taneous recordings unnecessary. The idea is to predict, in real
As an alternative to existing speech restoration methods, time, the position of the speech articulators from the PMA sig-
we are investigating a new way to restore speech to those who nals. This is a non-trivial problem as the magnetic field arriving
are unable to speak [8, 9, 10, 11, 12]. The idea is to trans- at the sensors is a composite of the fields generated by all the
form measurements of the lips and tongue movements obtained magnets attached to the articulators. From the estimated vo-
using magnetic sensing into audible speech using a speaker- cal tract shapes speech can finally be synthesised by simulating
dependent transformation, implemented by machine learning airflow propagation through the vocal tract using well-known,
163 10.21437/IberSPEECH.2018-33
established articulatory synthesis methods [18]. knowledge the support of NVIDIA Corporation with the dona-
In the next sections, the detailed objectives of the project, tion of the Titan X Pascal GPU used for this research.
the participants, and its current status are described in detail.
6. References
2. Project objectives [1] A. Byrne, M. Walsh, M. Farrelly, and K. O’Driscoll, “Depression
As previously mentioned, the goal of this project is to investi- following laryngectomy. A pilot study,” Brit J Psychiat., vol. 163,
gate and develop a new method for speech restoration based on no. 2, pp. 173–176, 1993.
the PMA capturing technique and machine learning, but with- [2] D. S. A. Braz, M. M. Ribas, R. A. Dedivitis, I. N. Nishimoto,
out the need of parallel speech and sensor recordings for train- and A. P. B. Barros, “Quality of life and depression in patients
ing the machine learning techniques. We attempt to do this by, undergoing total and partial laryngectomy,” Clinics, vol. 60, no. 2,
instead, learning an alternative transformation which will map pp. 135–142, 2005.
the articulatory data captured by the PMA device into a physi- [3] H. Danker, D. Wollbrück, S. Singer, M. Fuchs, E. Brähler, and
cal model of the vocal tract (e.g. 1D or 2D representation of the A. Meyer, “Social withdrawal after laryngectomy,” Eur Arch Oto-
vocal tract). Then, we will be able to generate audible speech Rhino-L, vol. 267, no. 4, pp. 593–600, 2010.
from the estimated vocal tract shapes by using well-known ar- [4] S. R. Ell, A. J. Mitchell, and A. J. Parker, “Microbial coloniza-
ticulatory synthesis methods. tion of the groningen speaking valve and its relationship to valve
The detailed objectives of this project are: failure,” Clin Otolaryngol Allied Sci., vol. 20, no. 6, pp. 555–556,
1995.
• To train a direct transformation from PMA data to vocal
[5] S. R. Ell, “Candida: the cancer of silastic,” J Laryngol Otol, vol.
tract shapes used by the articulatory synthesiser.
110, no. 03, pp. 240–242, 1996.
• To personalise the synthesiser so that the speech gener-
[6] J. M. Heaton and A. J. Parker, “Indwelling tracheo-oesophageal
ated sounds like the users original voice. voice prostheses post-laryngectomy in sheffield, uk: a 6-year re-
With regard to the latter point, we expect to use MRI images view,” Acta Otolaryngol., vol. 114, no. 6, pp. 675–678, 1994.
of the subject’s vocal tract to personalise the synthesiser. In this [7] M. Fried-Oken, L. Fox, M. T. Rau, J. Tullman, G. Baker, M. Hin-
way, the acoustics generated by the synthesiser will resemble dal, N. Wile, and J.-S. Lou, “Purposes of AAC device use for per-
the user’s original voice. sons with ALS as reported by caregivers,” Augment Altern Com-
mun., vol. 22, no. 3, pp. 209–221, 2006.
3. Partners [8] J. A. Gonzalez, L. A. Cheah, J. Bai, S. R. Ell, J. M. Gilbert,

R. K. M. 1, and P. D. Green, “Analysis of phonetic similarity in a
There are four partners involved in this project: silent speech interface based on permanent magnetic articulogra-
phy,” in Proc. Interspeech, 2014, pp. 1018–1022.
• University of Sheffield: Jose A. Gonzalez (principal in-
vestigator; now at the University of Malaga), Phil D. [9] J. A. Gonzalez, P. D. Green, R. K. Moore, L. A. Cheah, and
J. M. Gilbert, “A non-parametric articulatory-to-acoustic conver-
Green and Roger K. Moore. sion system for silent speech using shared gaussian process dy-
• University of York: Damian Murphy, Helena Daffern namical models,” in UK Speech, 2015, p. 11.
and Amelia Gully. [10] J. A. Gonzalez, L. A. Cheah, J. M. Gilbert, J. Bai, S. R. Ell, P. D.
• University of Hull: James M. Gilbert and Lam A. Cheah. Green, and R. K. Moore, “A silent speech system based on perma-
nent magnet articulography and direct synthesis,” Comput Speech
• University of Leeds: Andy Bulpitt and Duane Carey. Lang., vol. 39, pp. 67–87, 2016.
[11] J. A. Gonzalez, L. A. Cheah, A. M. Gomez, P. D. Green, J. M.
4. Current status Gilbert, S. R. Ell, R. K. Moore, and E. Holdsworth, “Direct speech
reconstruction from articulatory sensor data by machine learning,”
Firstly, we have been able to record a database consisting of IEEE/ACM Transactions on Audio, Speech, and Language Pro-
parallel recordings of MRI, PMA and speech data for 4 sub- cessing, vol. 25, no. 12, pp. 2362–2374, 2017.
jects. For this pilot study, we decided to record simple mate- [12] J. A. Gonzalez Lopez, L. A. Cheah, P. D. Green, J. M. Gilbert,
rial, mainly consonant-vowel (CV) and vowel-consonant-vowel S. R. Ell, R. K. Moore, and E. Holdsworth, “Evaluation of a silent
syllables. At the same time, we have been working on improv- speech interface based on magnetic sensing and deep learning
ing the quality of speech generated by our articulatory synthe- for a phonetically rich vocabulary,” in Proceedings of the Annual
siser. In this regard, we described in [19] a dynamic 3D digital Conference of the International Speech Communication Associa-
waveguide mesh (DWM) vocal tract model capable of move- tion, Interspeech. ISCA, 2017, pp. 3986–3990.
ment to produce diphthongs. In [20], we further investigated [13] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
on estimating a physical model of the vocal tract (a 2D model vol. 521, no. 7553, p. 436, 2015.
in this case) from the speech waveform, rather than magnetic [14] M. J. Fagan, S. R. Ell, J. M. Gilbert, E. Sarrazin, and P. M. Chap-
resonance imaging data. As an advantage, this method provides man, “Development of a (silent) speech recognition system for
a clear relationship between the model and the size and shape patients following laryngectomy,” Med Eng Phys., vol. 30, no. 4,
of the vocal tract, offering considerable flexibility in terms of pp. 419–425, 2008.
speech characteristics such as age and gender. Finally, we have [15] J. M. Gilbert, S. I. Rybchenko, R. Hofe, S. R. Ell, M. J. Fagan,
been also working on optimizing the appearance and usability R. K. Moore, and P. Green, “Isolated word recognition of silent
of the PMA system. speech using magnetic implants and sensors,” Med Eng Phys.,
vol. 32, no. 10, pp. 1189–1197, 2010.
5. Acknowledgements [16] L. A. Cheah, J. Bai, J. A. Gonzalez, S. R. Ell, J. M. Gilbert, R. K.
Moore, and P. D. Green, “A user-centric design of permanent mag-
This work was supported by the White Rose university con- netic articulography based assistive speech technology,” in Proc.
sortium through a Collaboration Fund grant. We gratefully ac- BioSignals, 2015, pp. 109–116.
164
[17] L. A. Cheah, J. Bai, J. A. Gonzalez, J. M. Gilbert, S. R. Ell,
P. D. Green, and R. K. Moore, “Preliminary evaluation of a silent
speech interface based on intra-oral magnetic sensing,” in Proc.
BioDevices, 2016, pp. 108–116.
[18] J. Mullen, D. M. Howard, and D. T. Murphy, “Waveguide physi-
cal modeling of vocal tract acoustics: flexible formant bandwidth
control from increased model dimensionality,” IEEE Trans. Audio
Speech Lang. Process., vol. 14, no. 3, pp. 964–971, 2006.
[19] A. J. Gully, H. Daffern, and D. T. Murphy, “Diphthong synthesis
using the dynamic 3d digital waveguide mesh,” IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing, vol. 26,
no. 2, pp. 243–255, Feb 2018.
[20] A. J. Gully, T. Yoshimura, D. T. Murphy, K. Hashimoto,
Y. Nankaku, and K. Tokuda, “Articulatory text-to-speech
synthesis using the digital waveguide mesh driven by a deep
neural network,” in Proc. Interspeech, F. Lacerda, Ed. ISCA,
2017, pp. 234–238. [Online]. Available: http://www.isca-speech.
org/archive/Interspeech 2017/abstracts/0900.html
165
IberSPEECH 2018
RESTORE Project: REpair, STOrage and REhabilitation of speech

Inma Hernáez1 , Eva Navas1 , Jose Antonio Municio2 , Javier Gómez2
UPV/EHU
1
2
HUC-Biocruces
inma.hernaez@ehu.eus, eva.navas@ehu.eus
joseantonio.municiomartin@osakidetza.eus, jgomezsuarez@seorl.net
Abstract tems are designed to help the user to quickly build messages
to be spoken out loud (with a keyboard or more sophisticated
RESTORE is a project aimed to improve the quality of commu- input devices). Commercial AAC systems have usually a lim-
nication for people with difficulties producing speech, provid- ited choice of the synthetic voice: they are usually high quality
ing them with tools and alternative communication services. At voices sounding like a very healthy young person. Neverthe-
the same time, progress will be made at the research of tech- less, statistics show a reality in which a great majority of the
niques for restoration and rehabilitation of disordered speech. people affected are elderly people, whose real voice would not
The ultimate goal of the project is to offer new possibilities in match with the prosthetic one. Similarly, there is a lack of chil-
the rehabilitation and reintegration into society of patients with dren’s voices, and according to the data in Spain there are 32
speech pathologies, especially those laryngectomised, by de- 700 children between 6 and 15 years affected by this disability.
signing new intervention strategies aimed to favour their com- RESTORE is a project aimed to improve the quality of
munication with the environment and ultimately increase their communication for people with difficulties producing speech,
quality of life. providing them with tools and alternative communication ser-
Index Terms: alaryngeal voice, oesophageal speech, speaking vices. At the same time, progress will be made at the research
aids, voice rehabilitation, statistical parametric speech synthe- of techniques for restoration and rehabilitation of disordered
sis, voice bank speech. The following goals were proposed for the project:
1. Introduction 1. Implementation of a donors voice bank and creation of

According to the Spanish Statistical Institute (Instituto Nacional the service of personalised synthetic speech for people
de Estadı́stica) (data from 2012), there are more than 410 000 with oral disabilities in general and people who have
people with ’disability to produce spoken messages’ in Spain. been laryngectomised, in particular at Cruces University
This kind of disability, if severe, produces social isolation since Hospital.
the communication with the environment is seriously affected 2. Design and development of a Serious Game as a multi-
so as to hamper personal relationships and generate problems of media tool to support the speech rehabilitation process
integration at the workplace. The origin and displaying forms for people with voice disorders.
of the oral disabilities is varied. They might be caused by trau-
matic events such as stroke or surgery like Total Laryngectomy 3. Improvement of the intelligibility of alaryngeal voices
(TL) or also by degenerative diseases (such as ELA or Parkin- (voices from people who have been laryngectomised)
son) which deteriorate the functioning of the motor speech sys- and dysartric voices.
tem. This project is concerned with two specific groups of af-
fected persons: on the one hand, people with a laryngectomy The Project coordinates the research work of two groups,
and on the other hand, people with dysarthria suffering from one from the field of Signal Processing and Engineering and the
poor articulation of phonemes due to neurological injury of the other from the fields of Otorhinolaryngology and Speech Ther-
motor speech system. apy. The Voice Banking service proposed is a groundbreak-
In the Rehabilitation Service of the Cruces University Hos- ing service for the country, since there are only a few experi-
pital (the main public Hospital of the area of Bilbao), almost ences with the same aim in USA (VocalID1 , ModelTalker2 ) and
6000 speech therapy sessions are performed annually. Among Europe (SpeakUnique3 ) [3]. This kind of service offers per-
them, around 50 patients are treated after a total laryngec- sonalised synthetic speech that can be used in existing AAC
tomy. Even though nowadays laryngeal cancer is treated with devices. In our proposal the communication is achieved with
chemotherapy and radiotherapy to preserve the laryngeal func- a personalised voice that might be similar to the patients own
tion, in some cases total laryngectomy must be performed [1]. voice in cases of progressive diseases or laryngectomies. Be-
In 2014 laryngeal cancer had an incidence in Spain of 3182 per sides, the design and development of a voice training tool is
100 000 inhabitants with a mortality of 1.3% and a 5-year preva- pursued in order to incentive and facilitate the speech rehabili-
lence of 112 per 1000 inhabitants [2]. After total laryngectomy tation process, a real hard and complex task depending on the
patients must attend a number of speech therapy sessions to ac- specific pathology.
quire a new voice, generally called alaryngeal voice. In the In this paper we describe the main tasks of the project and
Cruces University Hospital all patients start the rehabilitation briefly sketch the main results up to date.
treatment 2 or 3 weeks after hospital discharge.
People affected by ELA are frequent users of Text to Speech 1 www.vocalid.co
Conversion systems, usually integrated in AAC systems (Alter- 2 www.modeltalker.com
native and Augmentative Communication). These AAC sys- 3 www.speakunique.org
166 10.21437/IberSPEECH.2018-34
2. Tools to support the speech therapist 3. Personalisation of the synthetic voice
Total Laryngectomy surgery completely removes the larynx of One of the goals of RESTORE project is to improve the al-
the patient while separating the airway from the mouth, nose ready existing ZureTTS voice bank web portal4 making it more
and oesophagus. Consequently, patients who undergo a TL can flexible and allowing to provide a personalised TTS service to
not produce speech sounds in a conventional manner because people with oral disabilities.
their vocal cords have been removed. The rehabilitation process
of a patient starts immediately upon confirmation of the surgery. 3.1. Voice bank and web portal
Through a pre-surgical interview an orientation framework is In the previous version of the voice bank web portal, each voice
offered both to the patient and his or her family where they will donor had to record 100 sentences to get his or her personalised
receive information about: voice. This is not a problem for healthy people, but many pa-
tients are not able to record such a long corpus. Therefore the
• The anatomical and physiological changes that result
original 100 sentences corpus has been divided into three cor-
from surgery
pora of 33, 33 and 34 sentences. Each donor can choose how
• The way to communicate during the period immediately many corpora to record and the personalised voice will be pro-
following surgery duced with the available speech material. Besides, two new
languages have been incorporated to the portal: Gaelic and the
• The speech therapy sessions that will follow surgery Navarro-Lapurdian dialect of Basque.
The main objective of the rehabilitation after TL is to return 3.2. Recording protocol for pre-laryngectomised people
to the patients the possibility of oral communication for reinte-
gration into their social, work and personal life. After surgical To be able to generate a personalised voice for laryngectomees,
intervention, additionally to medical treatment and the impor- the ideal situation is to have recordings of the patient made prior
tance of tracheostomy protection and care of the tracheal can- to the surgery. If this is not the case, recordings made by close
nula the patient will be informed about the therapy process to family members can be used to produce a personalised synthetic
acquire alaryngeal voice. There are three possibilities for voice voice for the patient, as voices are usually similar among family
rehabilitation: members of the same gender [4] [5].
The first step to get these recordings is to introduce the
• Oesophageal speech recording procedure in the hospital protocol. This protocol has
established the following criteria to select patients that can take
• Tracheoesophageal speech part in the project:
• Use of an electrolarynx 1. Patients older than 18 years old with a TL programmed.
Oesophageal speech (ES) is preferred by medical doctors be- 2. Close family members with voices similar to the patient.
cause it does not require a voice prosthesis, but it is also most 3. Any patient older than 18 years old without any speech
effortful and difficult to acquire. Tracheoesophageal speech is pathology who comes to the otorhinolaryngology service
the most successful method and also produces the most under- of the Cruces University Hospital for any reason.
standable speech, but requires a voice prosthesis placed during Once these criteria have been fixed, the protocol has been sent
total laryngectomy or later in a secondary puncture. Finally, the to the Basque Ethics Committee for approval.
electrolarynx is an external vibrating handheld device which is
placed to the neck or the face. The vibrating sound is modulated 3.3. Personalisation for dysarthric voices
by the movements of the articulators to produce understandable
speech. It produces a robotic voice and it is sometimes used One person with ELA make use of the ZureTTS portal to obtain
also as a backup secondary method. a synthetic voice. However, the voice was already affected by
In Cruces University Hospital laryngectomised patients the disease and resulting synthetic voice also showed the same
start rehabilitation after 2 or 3 weeks after hospital discharge problems of the original voice. The main issue was the rhythm,
with the aim of learning to produce oesophageal speech. The which was very slow, with very long vowels and very frequent
patient attends around 50 rehabilitation sessions during a period long pauses, thus confirming other studies [6][7]. Also, some
of 4 months. If the final speaking method is tracheoesophageal, consonants were poorly realised. The slow prosody provoked
the average learning period is only 5 days. the malfunctioning of the automatic alignment algorithm thus
In RESTORE we have developed an interactive video contributing to the low quality of the synthetic voice. Never-
aimed at helping the patient during and after this rehabilitation theless, even when a new specific alignment method adapted to
period. This video considers the main difficulties faced by the the multiple pauses was applied, the synthetic voice was not of
laryngectomised patient and proposes exercises and advices to the desired quality, mainly because of the long phones. To over-
overcome them. Using a comic style representation of a food come it, prosody transplantation from a healthy voice was per-
market, the main character represents the patient itself, going formed with good results. This voice was offered to the user, in-
through the different market stalls, in each of which he or she stead of the one automatically provided by the system. We also
will practice a new rehabilitation exercise. Video recordings of tried to palliate the pronunciation problems applying the adap-
real sessions with a speech therapist are also included, as well tation techniques using only vowels described in [8][9], but the
as short interviews with laryngectomised persons that have suc- new synthetic voice, although with an improved pronunciation
ceed in the rehabilitation process and share their own feelings lost the personality of the speaker. We are also experimenting
and experiences. with model surgery on some phones [10], but the improvements
are subtle.
Clinical evaluation of the developed tool is currently taking
place with laryngectomised patients. 4 aholab.ehu.eus/zureTTS
167
at modifying oesophageal speech in such a way as to improve
the performance of a state of the art ASR system with these
modified signals as input. To achieve this goal we decided to
experiment with voice conversion (VC) algorithms. In this sec-
tion we summarize the efforts done within the project in this
direction.
4.1. ASLABI Database

To our knowledge, there are not publicly available databases
for oesophageal Spanish speech. There are studies pub-
lished concerning the quality and characteristics of ES, some
or them for Spanish, but they use their own recording data,
mostly developed for the purpose of the specific research
[14][15][16][17]. Additionally, most VC algorithms make use
of parallel databases. The availability of recorded voices for the
100 phonetically balanced sentences of ZureTTS for healthy
speech made us decide the recording of the same set of sen-
tences for oesophageal speakers. To make the recordings, we
contacted the local Association ASociación de LAringectomiza-
dos de BIzkaia who showed an enormous interest in the project
and collaborated with enthusiasm. A total of 32 persons went
Figure 1: Catalogue of voices in ZureTTS voice bank portal. to make the recordings to the acoustically isolated room in our
Faculty. The database also includes one Basque session.
3.4. Synthetic voices catalogue 4.2. Improving the intelligibility of the oesophageal voice
If the person with oral disabilities is not able to record the cor- We have tried several strategies to improve the intelligibility of
pus and there is no family member with a similar voice, he or the oesophageal speech. First, we tried a classical GMM based
she is still able to get a customised synthetic voice, by selecting voice conversion, using parallel data. This was followed by
the one of his or her preference among all the donated voices. DNN based approaches, using LSTM and more recently also
To make the selection of a personalised voice easier, a cat- including a WaveNet vocoder.
alogue with the available voices has been included in the WEB There are several specific problems that must be faced when
portal. A subjective evaluation where listeners qualified all processing and converting oesophageal signals:
the synthetic voices according to several attributes was devel- • Oesophageal signals lack the regular periodicity (intona-
oped. These attributes were: white - hoarse voice, sweet - dom- tion) typical of laryngeal signals. Although they have a
inant voice, warm - high-pitched voice, clean - nasal voice and certain periodicity at certain segments, the fundamental
monotonous - expressive voice. A bidimensional representa- frequency is very low (around 80Hz) and very irregu-
tion of the voices using the two most discriminative dimensions lar. Usual F0 calculation algorithms generate many er-
(sweet-dominant and white-hoarse) has been integrated in the rors and do not result in a realistic measure of the local
portal, as shown in Figure 1. periodic segments. Thus a specific F0 detection algo-
The voices can be easily modified in tone, rhythm, inten- rithm has been used.
sity and vocal tract length to get a synthetic voice that pleases
the user. The customised voices can be obtained from the web • The rhythm in general, the duration of syllables and
portal in standard format for Android OS, iOS and Windows, the duration of the phones inside them, vary signifi-
so they can be directly used by Augmentative and Alternative cantly in relation to healthy speech. Additionally, noises
Communication Devices. and pauses are inserted in between words and syllables
even inside the syllable. Therefore, the alignment of
healthy and oesophageal parallel sentences becomes a
4. Voice conversion tricky task.
In the production of oesophageal speech the pharyngo- • Many phones (mainly corresponding to consonants) in
oesophageal segment is used as a substitutive vibrating element the sounds stream are not present in the signal or they
for the vocal folds. Due to the nature of the intervention, the are realised in a completely different way. This fact
air used to create the vibration of the oesophagus can not come also complicates the parallel alignment task and gener-
from the lungs and the trachea as happens during normal speech ates many recognition errors.
production. Instead, the air is swallowed from the mouth and in-
• In general, a fundamental frequency curve must be esti-
troduced in the oesophagus, being then expelled in a controlled
mated for the converted spectrum (except for the case of
way while producing the vibration. These huge differences in
using WaveNet). Simple conversion of the source F0 val-
the production mechanisms lead to a diminution of naturalness
ues as usually done in the VC field is not feasible, which
and intelligibility [11][12][13]. As a consequence, the com-
opens a wide range of research possibilities.
munication with others is hindered. Moreover, these less intel-
ligible voices are an added problem for the automatic speech To evaluate the intelligibility of the resulting converted sig-
recognition algorithms that are becoming ubiquitous in the hu- nals we have used a Kaldi-based ASR system [18] trained with
man computer interaction technologies. One of the goals of this material described in [19]. This approach was selected because
project is the development of techniques and algorithms aimed it allows us to control the exact processing operations followed
168
during the recognition (such us the use of transformations like ders,” Folia Phoniatrica et Logopaedica, vol. 53, no. 1, pp. 1–18,
fMLLR) as well as basic aspects of the recognition process such 2001.
as the lexicon and the language model. This turned out to be [7] J. R. Green, Y. Yunusova, M. S. Kuruvilla, J. Wang, G. L.
very important with our reduced set of 100 phonetically rich Pattee, L. Synhorst, L. Zinman, and J. D. Berry, “Bulbar and
sentences, containing many very unlikely words, proper names speech motor assessment in als: Challenges and future direc-
etc. The procedure and results of the different ASR tests are tions,” Amyotrophic Lateral Sclerosis and Frontotemporal Degen-
eration, vol. 14, no. 7-8, pp. 494–500, 2013.
described in [20][21].
[8] A. Alonso, D. Erro, E. Navas, and I. Hernaez, “Speaker Adap-
tation using only Vocalic Segments via Frequency Warping,” in
5. Discussion and future work INTERSPEECH 2015, 2015, pp. 2764–2768.
This project represents an effort to promote modern technolog- [9] ——, “Study of the effect of reducing training data in speech
ical advances in the area of speech processing among the group synthesis adaptation based on frequency warping,” in Advances
of people with oral disabilities. In particular we have put spe- in Speech and Language Technologies for Iberian Languages,
A. Abad, A. Ortega, A. Teixeira, C. Garcı́a Mateo, C. D.
cial emphasis to introduce the benefits of the advances in speech Martı́nez Hinarejos, F. Perdigão, F. Batista, and N. Mamede, Eds.
synthesis in the rehabilitation process of the laryngectomised Cham: Springer International Publishing, 2016, pp. 3–13.
people.
[10] A. Pierard, D. Erro, I. Hernaez, E. Navas, and T. Dutoit, “Surgery
In relation to the intelligibility of oesophageal speech, we of speech synthesis models to overcome the scarcity of training
plan to improve the techniques to evaluate not only human intel- data,” vol. 10077 LNAI, pp. 73–83.
ligibility but also the effort employed by the listener (’listening
[11] B. Weinberg, “Acoustical properties of esophageal and tracheoe-
effort’). sophageal speech,” Laryngectomee rehabilitation, pp. 113–127,
At the time of writing this paper, only two laryngectomised 1986.
persons have been recorded for the voice bank previous to TL. [12] T. Most, Y. Tobin, and R. C. Mimran, “Acoustic and perceptual
It must be taken into account that the elapsed time between characteristics of esophageal and tracheoesophageal speech pro-
the communication of the need to do the TL surgery and the duction,” Journal of communication disorders, vol. 33, no. 2, pp.
surgery itself is very short (usually of a few days). So there is 165–181, 2000.
not the necessary time for the medical team to explain the pa- [13] T. Drugman, M. Rijckaert, C. Janssens, and M. Remacle, “Tra-
tient the future benefits of making the recordings. Additionally, cheoesophageal speech: A dedicated objective acoustic assess-
these patients have their voice already very harsh and speak with ment,” Computer Speech & Language, vol. 30, no. 1, pp. 16–31,
difficulties (it is usually the reason why they have contacted 2015.
the unit). This is why we consider the catalogue of synthetic [14] R. Ishaq and B. G. Zapirain, “Esophageal speech enhancement us-
voices, possibly including voices of the patient’s relatives as a ing modified voicing source,” in Signal Processing and Informa-
very good alternative. We hope that this voice bank pilot exper- tion Technology (ISSPIT), 2013 IEEE International Symposium
iment will continue growing and expanding the service to other on. IEEE, 2013, pp. 000 210–000 214.
hospitals after the end of the present project. [15] A. Mantilla, H. Pérez-Meana, D. Mata, C. Angeles, J. Alvarado,
and L. Cabrera, “Recognition of vowel segments in spanish
esophageal speech using hidden markov models,” in Computing,
6. Acknowledgements 2006. CIC’06. 15th International Conference on. IEEE, 2006,
pp. 115–120.
This project has been founded by the Spanish Ministry of Econ-
omy and Competitiveness with FEDER support (RESTORE [16] A. Mantilla, H. Perez-Meana, D. Mata, C. Angeles, J. Alvarado,
and L. Cabrera, “Analysis and recognition of voiced segments of
project, TEC2015-67163-C2-1-R and TEC2015-67163-C2-2-
esophageal speech,” in Electronics, Robotics and Automotive Me-
R). chanics Conference, 2006, vol. 2. IEEE, 2006, pp. 236–244.
[17] A. Mantilla-Caeiros, M. Nakano-Miyatake, and H. Perez-Meana,
7. References “A pattern recognition based esophageal speech enhancement sys-
tem,” Journal of applied research and technology, vol. 8, no. 1,
[1] J. P. Rodrigo, F. López, J. L. Llorente, C. Álvarez-Marcos, and
pp. 56–70, 2010.
C. E. Suarez, “Results of total laryngectomy as treatment for lo-
cally advanced laryngeal cancer in the organ-preservation era.” [18] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
Acta otorrinolaringologica espanola, vol. 66 3, pp. 132–8, 2015. N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
[2] S. Sociedad Española de Oncologı́a Médica, “Las cifras del cáncer
en España 2014,” 2014.
[3] C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: De- [19] L. Serrano, D. Tavarez, I. Odriozola, I. Hernaez, and I. Saratxaga,
sign, collection and data analysis of a large regional accent speech “Aholab system for albayzin 2016 search-on-speech evaluation,”
database,” in 2013 International Conference Oriental COCOSDA in IberSPEECH, 2016, pp. 33–42.
held jointly with 2013 Conference on Asian Spoken Language Re-
search and Evaluation (O-COCOSDA/CASLRE), Nov 2013, pp. [20] L. Serrano, D. Tavarez, X. Sarasola, S. Raman, I. Saratxaga,
1–4. E. Navas, and I. Hernaez, “LSTM based voice conversion for la-
ryngectomees,” in Proc. IberSPEECH 2018, 2018.
[4] H. S. Feiser and F. Kleber, “Voice similarity among brothers: ev-
idence from a perception experiment,” in Proc. 21st Annual Con- [21] S. Raman, I. Hernaez, E. Navas, and L. Serrano, “Listening to la-
ference of the International Association for Forensic Phonetics ryngectomees: A study of intelligibility and self-reported listen-
and Acoustics (IAFPA), 2012. ing effort of spanish oesophageal speech,” in Proc. IberSPEECH
2018, 2018.
[5] I.-S. Ahn and M.-J. Bae, “On a similarity analysis to family
voice,” Advanced Science Letters, vol. 24, no. 1, pp. 744–746,
2018.
[6] W. G, J. JY, L. JS, K. RD, and K. JF, “Acoustic and intelligibility
characteristics of sentence production in neurogenic speech disor-
169
IberSPEECH 2018
Corpus for Cyberbullying Prevention

Asunción Moreno, Antonio Bonafonte, Igor Jauk, Laia Tarrés, Victor Pereira
Innovation and Technology Center of the Universitat Politècnica de Catalunya
asuncion.moreno@upc.edu, antonio.bonafonte@upc.com, ij.artium@gmail.com,

laiatarresb@gmail.com, vjp1994@msn.com
SafeToNet (STN) was the winner through the "Safeguarding

Abstract children online" project.
Cyberbullying is the use of digital media to harass a person or STN proposed a service to get ahead of a child harassing
group of people, through personal attacks, disclosure of another person via mobile and alert their parents. The service
confidential or false information, among other means. That is includes warning about sexually improper photographs, and
to say, it is considered cyberbullying, or cyber-aggression to identification of bullying.
everything that is done through electronic communication The Pilot Project began in July 2018. The Project partners are:
devices with the intended purpose of harming or attacking a d-LAB, SafeToNet (London, UK), Orange and the Innovation
person or a group. and Technology Center of the Universitat Politècnica de
In this paper we present a starting project to prevent Catalunya (CIT-UPC).
cyberbullying between kids and teenagers. The idea is to create The Project consists of several phases:
a prevention system. A system which is installed in the mobile
of a kid and, if a harassment is detected, some advice is given • Generation of a database in Spanish and Catalan for
to the child. In case of serious or repeated behavior the parents the training of cyberbullying prevention systems.
are alerted. • Adaptation of the System currently available in
The focus of this paper is to describe the characteristics of the English to Catalan and Spanish.
database to be used to train the system • Test in several Catalan schools with children between
12 and 14 years old chosen in a way that provides a
diverse socio-economic environment to cover
cultural varieties.
Index Terms: bullying, cyberbullying, databases, children
• Validation and improvement based on the system test
on Orange employees and their family members.
1. Introduction
This paper deals with the generation of a database in Spanish
The fight against bullying and cyberbullying among children and Catalan for the training of cyberbullying prevention
and adolescents is becoming a priority. A study by Save the systems.
Children published by EL PAÏS 1 reveals that 1 out of 10
children during ESO has suffered harassment or cyberbullying 2. Database specifications
in Spain.
Two databases have been created, one in Catalan and another
The most frequent types of harassment are direct or indirect
in Spanish spoken in Spain. Each database consists of 140,000
insults, spreading rumors, damage to property, physical
posts, of which 100,000 are labeled by two people and 40,000
damage, exclusion and threats.
by a single person.
According to the Ministry of Health, INE and OMS, in 2015,
The posts are labeled according to their content in 7 categories:
the number of suicides of children under 15 years was 12 and
aggression, anxiety, depression, distress, sexuality, use of
in the range of 15 to 29 years was 247. On the other hand, the
substances, and violence. For each category, 5 levels of concern
number of suicide attempts of children under 15 years old was
have been established. 1: post nothing worrying, 5: post
273, of whom the majority, 234, were girls.
extremely worrying.
d-LAB is a program of Mobile World Capital Barcelona whose
The annotation has been made through a platform available in
objective is to carry out Calls for Proposals that serve to
STN in which there is an automatic system to present the posts
promote responses to social problems through the
to be scored and an efficient annotation system based on
implementation of collaborative pilot projects between private
keyboard or mouse interchangeably.
companies and public entities.
Earlier this year, d-LAB launched a call entitled "Fighting
cyberbullying through mobile technologies" 2. The company
1https://elpais.com/elpais/2016/02/18/media/1455822566_899 2 https://d-lab.tech/challenge-3/
475.html
170
3. Recruiting of labelers and technical 5. Quality Assessment
support To calculate the consistency between the annotators we used
17 people were selected and hired for the labelling task. Most three indices: Cohen's kappa index, Accuracy, and Cronbach's
of the annotators are students. They have different alpha index.
backgrounds. Following recommendations from SafeToNet,
5.1. Cohen's kappa index
most of the labelers are psychology students. There are 4 male
and 13 female annotators, all of them between 20 and 30 years The Cohen's kappa index [1] between two annotators labeling
old. Table 1 shows the background summary of the annotators: the same data measures the consistency between the
annotations and compares them with the case of the annotation
CODE BACKGROUND being random. To calculate the consistency between two
F1 Philosophy student annotators, the common messages labeled by both of them are
F2 Teacher degree searched and the results of the annotation are compared for each
F3 Psychology student characteristic. The Cohen's kappa index is calculated as:
F4 Criminology student 𝑝𝑝0 − 𝑝𝑝𝑒𝑒
F5 Biomedicine student 𝜅𝜅 =
1 − 𝑝𝑝𝑒𝑒
F6 Psychology student
Where 𝑝𝑝0 is the relative agreement between raters (accuracy),
F7 Social Communication student
and 𝑝𝑝𝑒𝑒 Is the probability of chance agreement. Cohen's kappa
F8 Psychologist degree
F9 Biology and Neuroscience student
index was calculated on binarized categories: (category A: level
F10 Statistics and economy student of concern 1 or 2; category B: level of concern 3, 4, or 5)
F11 Children's education
F12 Psychology student 5.2. Cronbach's alpha coefficient
F13 Engineering student The Cronbach alpha coefficient has been calculated per each
M1 Engineering student category (y) as
M2 Psychology student
𝐾𝐾 ∑2𝑖𝑖=1 𝜎𝜎𝑦𝑦𝑦𝑦
2
M3 Physical activity and sports degree 𝛼𝛼𝑦𝑦 = �1 − �
𝐾𝐾 − 1 2
M4 Engineering student 𝜎𝜎𝑡𝑡
Table 1. Code and Background of the annotators Where K: number of items (K=2: 2 annotators)
2
𝜎𝜎𝑦𝑦𝑦𝑦 variance of each item (i.e vector of length nposts of category
Annotators were contracted on a part time basis (20 y of tagger i)
hours/week). They work from home. They have some 𝜎𝜎𝑡𝑡2 variance of the total (i.e. vector: t=y1+y2 of length nposts)
flexibility to manage their dedication to the project during each
week. However, their daily dedication to the project can never 6. Results
be higher than 5 hours. This dedication was set to avoid
tiredness and lack of concentration during the annotation
procedure. index aggression anxiety depression distress sexuality substance violence
Accuracy 0,90 0,98 0,96 0,84 0,97 0,99 0,97
Cohen's kappa 0,70 0,62 0,63 0,58 0,84 0,80 0,70
A training week (16th-20th July) was organized at UPC Cronbach's alpha 0,76 0,53 0,61 0,59 0,80 0,84 0,71
premises. A person from SafeToNet was in charge of the Table 2. Mean indexes for all the categories
training during the first three days. Annotators were instructed,
one category at a time, with several examples chosen from real The table shows a lower inter agreement in anxiety and distress.
posts. The last two days, students had a simulation of real work This is a consequence of no direct translation of distress within
using the annotation platform reaching the labeling of 400 posts Spanish and Catalan. The annotators had difficulties to
each day. Those posts were carefully chosen to show distinguish both categories.
simultaneously several categories and several concern levels.
Another re-training week was necessary at the mid part of the
project to consensus inter-annotator agreement. 7. Discussion
4. Compilation and selection of posts The data collection will be finished on October 15th so that the
first prototype will be ready one month later. The results of the
Posts were selected from several sources such as twitter, complete project will be presented in Mobile World Congress
teenagers' chats, blogs, forums, medical consultation web sites, 2019.
etc. Data was manually or automatically downloaded, cleaned,
formatted and selected. Posts were chosen to have a minimum 8. Acknowledgements
number of characters (without counting –not discarding-
@names and internet addresses) of 50 and a total maximum of We want to thank d-LAB and STN for the trust placed in us to
280. Spanish as spoken in Latin America posts were discarded carry out the project.
when possible. As expected, Catalan data was harder to collect.
The number of information in internet in Spanish is huge 9. References
compared against Catalan websites, blogs and consulters. In
[1] J. Cohen. A Coefficient of Agreement for Nominal Scales.
addition, Catalan speakers are bilingual, and it is very common
https://doi.org/10.1177/001316446002000104
to find Spanish posts in Catalan sites. Spanish and Catalan can [2] L.J. Cronbach Coefficient alpha and the internal structure of tests.
be naturally mixed even in the same post. Psychometrika, Sep 1951, vol 16, Issue 3, pp 297-334
171
IberSPEECH 2018
EMPATHIC, Expressive, Advanced Virtual Coach to Improve Independent

Healthy-Life-Years of the Elderdy
M. I. Torres1, G. Chollet2, C. Montenegro3, J. Tenorio-Laranga4, O. Gordeeva5, A. Esposito6, N.
Glackin2, S. Schlögl7, O. Deroo5, B. Fernández-Ruanova4, D. Petrovska-Delacrétaz8, R. Santana3,
M. S. Korsnes10, F. Lindner9,D. Kyslitska11, M. Reiner12, R. Santana3, G. Cordasco6, A. Férnandez11
M. Aksnes10, R. Justo1
1Speech Interactive Group, Universidad del País Vasco UPV/EHU, Spain
2Intelligent Voice, United Kingdom
3Intelligent System Group, Universidad del País Vasco UPV/EHU, Spain
4Osatek, Spain
5Acapela Group, Belgium
6Universita degli Studi della Campania Luigi Vanvitelli, Italy
7Management Center Innsbruck, Austria
8Institut Mines Télécom, France
9Tunstall Healthcare, Sweden
10Oslo Universitetssykehus HF, Norway
11E-Seniors, France
12Technion, Israel
manes.torres@ehu.eus, coordinator@empathic-project.eu
Abstract machine learning. They will develop tools to detect

emotional cues, model users’ emotional status, translate
The EMPATHIC Research & Innovation project researchs, coaching plans into actions, user-adapt spoken dialogue
innovates, explores and validates new paradigms and and personalise talking agents
platforms, laying the foundation for future generations of • Telecare services, a senior association and a hospital
Personalised Virtual Coaches to assist elderly people living interested in testing and validating EMPATHIC
independently at and around their home. • Companies interested in providing and developing
The project uses remote non-intrusive technologies to technology for the project and commercialising the
extract physiological markers of emotional states in real-time products and derived services
for online adaptive responses of the coach, and advances
Through measurable end-user validation, to be performed in 3
holistic modelling of behavioural, computational, physical and
different countries (Spain, Norway and France) with 3 distinct
social aspects of a personalised expressive virtual coach. It
languages and cultures (plus English for R&D), the proposed
develops causal models of coach-user interactional exchanges
methods and solutions will ensure usefulness, reliability,
that engage elders in emotionally believable interactions
flexibility and robustness
keeping off loneliness, sustaining health status, enhancing
quality of life and simplifying access to future telecare Index Terms: speech recognition, natural language
services. understanding, spoken dialog systems, human-computer
interaction, natural language generation, text to speech
EMPATHIC proposes multidisciplinary research and
conversion, emotional features from speech and language,
development, involving:
emotional voice
• Geriatrician, Neuroscientist, Psychiatric, health and
social work specialists, knowledgeable in age related
conditions and the aims of a coaching program in
1. Objectives
maintaining independence, functional capacity, and ➢ OBJ1: Design a virtual coach, to engage the healthy-
monitoring physical, cognitive, mental and social well- senior user and reach pre-set benefits, measured through
being to implement the individual coaching goals project-defined metrics, to enhance well-being through
• Psychologist, Neuroscientist and Computer Science awareness of personal physical status, by improving diet
experts for detection and identification of the emotional and nutritional habits, by developing more physical
status of the user exercise and by social activity
• Engineers and Computer Scientists in speech and
language technologies, biometrics, image analysis, and
172
➢ OBJ2: Involve end-users and to reach a degree of fit to and continuous interaction with the end-users, the
their personalised needs and requirements, derived by the technologies to consider individual user profile,
coach, which will enhance their well-being including cultural facts and interaction history, the
➢ OBJ3: Supply the coach with Incorporate non-intrusive, current emotional status of the user and the coach
privacy-preserving, empathic, expressive interaction strategies at each decision of the dialogue manager,
technologies at each text generated by the Natural Language
Generator, at each inflexion of the TTS and at each
➢ OBJ4: Validate the coach efficiency and effectiveness movement of the personalised visual agent.
across 3 distinct European societies (Norway, Spain, and
France), with 200 to 250 subjects – who will be involved
from the start Technological Goals (Tg) and Actions:
➢ OBJ5: Evaluate/validate the effectiveness of EMPATHIC 1. Develop a simulated virtual coach and acquire an
designs against relevant user’s personalised acceptance initial corpus of dialogues. A set of annotated
and affordance criteria (such as the ability to adapt to dialogues will be designed and obtained through a
users’ underlying mood) assessed through the Key Wizard-of-Oz (WoZ) technology to fulfil the initial
Performance Indicators (KPI) listed in Section 1.1.2 end-users and data requirements of Scientific Goals
➢ OBJ6: Drive the developed methodology and tools to #2, #3 and #4.
industry acceptance and open-source access identifying 2. Integrate and provide a proof-of-concept of the
appropriate evaluation criteria to improve the technology running on different devices
“specification-capture-design-implementation” software 3. Validation through Field trials. EMPATHIC will test
engineering process of implementing socially-centred representative realistic use cases for different user
ICT products. profiles in three different countries
Scientific Goals (Sc) and Research Actions: 2. An example

1. Provide automatic personalised advice guidance Virtual Coach: So, Pablo, ¿Have you ever eaten consistently 2
(through the coach) having a direct impact in or 3 pieces of fruit?
empowering elder users into a wide of advanced
ICT keeping improving their quality of life and level User: When my wife was healthier, she used to take care of
their independent independency living status of the buying the fruit. Thus it was easier for me to eat it.
people as the age. EMPATHIC researches a) the Virtual Coach: And in any other time of your life?
identification and assessment of main cues related to User: Long time ago, when I lived alone, I used to take care of
physical, cognitive, mental and social well-being b) buying the fruit myself, and I ate it more frequently.
defining personalised, psychologically motivated
and acceptable coach plans and strategies c) Virtual Coach: What does this information suggest to you
translating professional coach behaviour into actions about your objective of eating 2 or 3 pieces of fruit a day?
of the Intelligent EMPATHIC-VC. User: Well… That it’s something that basically depends on
2. Identifying non-intrusive technologies to detect the me.
individual’s emotional and health status. of the Virtual Coach: So, can you see anything you could do to get
person through non-intrusive technologies. closer to your objective?
EMPATHIC uses emotional information from eyes, User: Uhmm… I should start thinking how I am going to
face, speech and language to deliver a hypothesis of organise when I am going to buy the fruit.
the user emotional status to assist decisions of the
EMPATHIC-VC. In this framework, the research
focus on the detection of sudden shifts in the user
emotional status or emotional changes during a
certain period of time: a) extraction of emotional
features b) data-driven approaches for multimodal
modelling combining emotional cues provided by
each source c) identification of significant changes
in individual behaviour.
3. Implement health-coach goals and actions through
an intelligent computational system, intelligent
coach and spoken dialogue system adapted to users’
intentions, emotions and context. EMPATHIC
researches a) Data-driven modelling of user and
tasks; b) Machine learning for understanding the
user; c) Learning policies and questionnaires to deal !
with coach goals; d) Statistical approaches for
dialogue management driven by both user and
Coach goals; e) Online learning for adaptation.
4. Provide the virtual coach with a natural, empathic,
personalised and expressive communication model
in a supportive manner to allow emotional bonds ! This project has received founds from the
that result in engaged and effective relationships. SC-H2020 Health, demographic change and well-being work
EMPATHIC researches and develop, through early program under Research and Innovation Action number 76987
173
IberSPEECH 2018
Advances on the Transcription of Historical Manuscripts based on

Multimodality, Interactivity and Crowdsourcing
Emilio Granell, Carlos-D. Martı́nez-Hinarejos, Verónica Romero

Universitat Politècnica de València,
Camı́ de Vera s/n, 46022, València, Spain
{egranell,cmartine,vromero}@dsic.upv.es
Abstract the contents, or on-line HTR [4] from touchscreen pen strokes.
In this context, a multimodal interactive assistive scenario [5],
The transcription of digitalised documents is useful to ease the where the assistive system and the paleographer cooperate to
digital access to their contents. Natural language technologies, generate the perfect transcription, would reduce the time and
such as Automatic Speech Recognition (ASR) for speech audio the human effort required for obtaining the final result.
signals and Handwritten Text Recognition (HTR) for text im-
The use of multimodal collaborative transcription appli-
ages, have become common tools for assisting transcribers, by
cations (crowdsourcing) [6], where collaborators can employ
providing a draft transcription from the digital document that
speech dictation of text lines as a transcription source from their
they may amend. This draft is useful when it presents an error
mobile devices, allows for a wider range of population where
rate low enough to make the amending process more comfort-
volunteers can be recruited, producing a powerful tool for mas-
able than a complete transcription from scratch.
sive transcription at a relatively low cost, since the supervision
The work described in this thesis is focused on the improve- effort of paleographers may be dramatically reduced.
ment of the transcription offered by an HTR system from three
In this thesis 1 [7], the reduction of the required human ef-
scenarios: multimodality, interactivity and crowdsourcing.
fort for obtaining the actual transcription of digitalised histori-
The image transcription can be obtained by dictating their
cal manuscripts is studied in the following scenarios:
textual contents to an ASR system. Besides, when both sources
of information (image and speech) are available, a multimodal • Multimodality: An initial draft transcription of a hand-
combination is possible, and this can be used to provide assis- written text line image can be obtained by using an off-
tive systems with additional sources of information. Moreover, line HTR system. An alternative for obtaining this draft
speech dictation can be used in a multimodal crowdsourcing transcription is to dictate the contents of the text line im-
platform, where collaborators may provide their speech by usage to an ASR system. Furthermore, when both sources
ing mobile devices. (image and speech) are available, a multimodal combi-
Different solutions for each scenario were tested on two nation is possible, and an iterative process can be used
Spanish historical manuscripts, obtaining statistically signifi- in order to refine the draft transcription. Multimodal
cant improvements combination can be used in interactive transcription sys-
Index Terms: handwritten text recognition, automatic speech tems for combining different sources of information at
recognition, multimodality, combination, interactivity, crowd- the system input (such as off-line HTR and ASR), as
sourcing well as to incorporate the user feedback (on-line HTR).
At the same time, the multimodal and iterative combi-
1. Introduction and Motivation nation process can be used to improve the initial off-line
HTR draft transcription by using the ASR contribution
Transcription of digitised historical documents is an interesting
of different speakers in a collaborative scenario.
task for libraries in order to provide efficient information access
to the contents of these documents. The transcription process • Interactivity: The use of assistive technologies in the
is done by experts on ancient and historical handwriting called transcription process reduces the time and human effort
paleographers. required for obtaining the actual transcription. The assis-
In the latest years, the use of off-line Handwritten Text tive transcription system proposes a hypothesis, usually
Recognition (off-line HTR) systems [1] has allowed to speed derived from a recognition process of the handwritten
up the manual transcription process. HTR systems are com- text line image. Then, the paleographer reads it and pro-
posed of modules and employ models similar to those of clas- duces a feedback signal (first error correction, dictation,
sical speech recognition systems. However, state-of-the-art off- etc.), and the system uses it to provide an alternative hy-
line HTR systems [2] are far from being perfect, and human pothesis, starting a new cycle. This process is repeated
supervision is required to really produce a transcription of stan- until a perfect transcription is obtained. Multimodality
dard quality. The initial result of automatic recognition may can be incorporated to the assistive transcription system,
make the paleographer task easier, since they are able to per- in order to improve the human-computer interaction and
form corrections on a good draft transcription. to provide the system with additional sources of infor-
In addition to using off-line HTR systems from text line mation.
images, other modalities of natural language recognition can be
used to help paleographers on the transcription process, such as 1 Publicly available in the UPV institutional repository: http://
Automatic Speech Recognition (ASR) [3] from the dictation of hdl.handle.net/10251/86137
174 10.21437/IberSPEECH.2018-35
• Crowdsourcing: Open distributed collaboration to ob- to modify the general language model in order to make more
tain initial transcriptions is another option for improv- likely the decoded sentences; this modified language model is
ing the draft transcription to be amended by the pale- employed in the decoding for the other modality. This proce-
ographer. However, current transcription crowdsourcing dure can be used iteratively. This approach presents a few draw-
platforms are mainly limited to the use of non-mobile de- backs: there is not a single hypothesis given that each modality
vices, since the use of keyboards in mobile devices is not provides its own, and it is not known beforehand which one is
friendly enough for most users. An alternative, is the use more accurate, and the initial modality must be chosen arbitrar-
of speech dictation of handwritten text lines as a tran- ily.
scription source in a crowdsourcing platform where col- Chapter 3 of the thesis (Combining Handwriting and
laborators may provide their speech by using their own Speech) presents a new proposal based on the use of Confusion
mobile device. Multimodal combination allows the im- Networks for obtaining a single hypothesis from the combina-
provement of the initial handwritten text recognition hy- tion of the hypotheses obtained from an off-line HTR and an
pothesis by using the contribution of speech recognition ASR recognisers for decoding a text line image and the dicta-
from several speakers, providing as a final result a better tion of its contents. In the next chapter (Chapter 4), our multi-
draft transcription to be amended by a paleographer with modal proposal is tested and compared with other combination
less effort. In this framework, since collaborators are methods.
usually a scarce resource, their acquisition effort should The experiments were performed on two different Spanish
be optimised with respect to the quality of the draft tran- historical manuscripts. Cristo Salvador, which is a single writer
scriptions. book from the 19th century provided by Biblioteca Valenciana
The rest of this paper is structured as follows: Section 2 of- Digital, and Rodrigo [10], that corresponds to the digitisation
fers the main scientific and technological goals; Section 3 sum- of the book Historia de España del arçobispo Don Rodrigo,
marises the contents of this thesis; Section 4 contains the main which was written in old Castilian (Spanish) in 1545. Both cor-
conclusions; Section 5 draws the current work derived from this pora are publicly available for research purposes on the website
thesis and the future work lines; Finally, Section 6 presents the of the Pattern Recognition and Human Language Technology
achievements, and the scientific contributions. (PRHLT) research center 2 . Acoustics models were trained by
using the Spanish phonetic corpus Albayzin [11].
The transcriptions quality is assessed using the Word Error
2. Scientific and Technological Goals Rate (WER) value, which allows us to obtain a good estimation
The main scientific and technological goals of this thesis are the for the paleographer post-edition effort, and the lattices qual-
following: ity by the oracle WER, which represents the WER of the best
hypotheses contained in the word lattices (more details about
• To study the unimodal and multimodal combination corpora and evaluation metrics can be found in Chapter 2 of the
techniques, in order to propose a new multimodal combi- thesis).
nation technique for improving the transcription of digi-
talised historical manuscripts by using the speech dicta-
Table 1: Summary of the multimodal experimental results.
tion of their contents.
• To study the use of multimodal combination techniques Cristo Salvador Rodrigo
in a computer assisted system to improve the computer- Experiment
WER Oracle WER WER Oracle WER
human interaction and to accelerate the interactive tran-
Off-line HTR 32.9% 27.5% 39.3% 28.2%
scription process. ASR 43.3% 27.4% 62.9% 29.5%
• To develop a multimodal crowdsourcing platform based Multimodal 29.3% 13.4% 35.9% 14.8%
on the studied multimodal combination techniques to
ease and widespread the transcription of digitalised his-
torical manuscripts. Table 1 summarises the results of the multimodal experi-
ments. As it can be observed, the behaviour is similar for both
3. Thesis Overview corpora. The use of the ASR does not improve the WER of the
draft offered by the off-line HTR system, although the word-
The thesis document [7] is structured in five parts to facilitate graphs generated offer similar values of oracle WER for both
the reading experience. It starts with a first introductory part, modalities. However, combining both modalities by using our
followed by a part for each one of the three studied scenarios, proposal, not only the WER is improved, but the oracle WER
and it finishes with a part which presents the general conclu- value of the multimodal word-graph lattices is substantially re-
sions and future work lines. This section presents an overview duced. Given that the oracle WER value is related to the quality
of the contents of the three central parts (multimodality, inter- of the alternatives offered by our interactive and assistive sys-
activity, and crowdsourcing). tem, an outstanding effect on interactive transcription can be
expected.
3.1. Multimodality
3.2. Interactivity
The integration of knowledge given by off-line HTR and ASR
processes presents two limitations: both signals are asyn- The result of combining the knowledge given by off-line HTR
chronous and each modality uses different basic linguistic units and ASR processes may make the paleographer task easier,
(usually, characters for off-line HTR and phonemes for ASR). since they are able to correct on an improved draft transcrip-
An initial approach for solving this limitation was proposed in tion. However, given that paleographer revision is required to
previous works [8, 9], where the output of the recognition pro-
cess of one modality, in form of word-graph lattice, is used 2 https://prhlt.upv.es/
175
produce a transcription of standard quality, an interactive assis- the percentage of words corrected by means of the keyboard
tive scenario, where the automatic system and the paleographer (KBD). As it can be observed, the multimodal combination of
cooperate to generate the perfect transcription, would provide the on-line feedback with the input hypotheses reduces signif-
an additional reduction of the human effort and time required icantly the amount of words that are required to be corrected
for obtaining the final result. by using the keyboard, and most of the paleographer effort is
Chapter 5 of the thesis (Assistive Transcription) presents a concentrated in the more ergonomic touchscreen feedback.
multimodal interactive transcription system where the paleogra-
pher feedback is provided by means of touchscreen pen strokes, 3.3. Crowdsourcing
traditional keyboard, and mouse operations. The combination
of the different sources of information is based on the use of As an alternative to the keyboard, volunteers could employ
Confusion Networks derived from the decoding output of three voice as input for transcription. Nearly all mobile devices pro-
recognition systems: two HTR systems (off-line and on-line), vide this modality, which widens the range of population and
and an ASR system. Off-line HTR and ASR are used to derive situations where collaboration can be performed. The main
(by themselves or by combining their recognition results) the drawback is that the audio transcription, usually obtained by
initial hypothesis, and on-line HTR is used to provide feedback. ASR systems [3], presents an ambiguity not present in typed
In the next chapter (Chapter 6 of the thesis), our multimodal and input. Even the state-of-the-art techniques [12], although more
interactive proposals are tested. accurate than a few years ago, produce a considerable amount
In this case, the interactive performance is given by Word of errors in the recognition process, which makes it necessary
Stroke Ratio (WSR), the definition of which makes it compara- to obtain a balance between the amount of collaborations and
ble with the WER. The relative difference between them gives the quality they provide.
us the effort reduction (EFR), which is an estimation of the tran- In any case, the need for final supervision by a paleographer
scription effort reduction that can be achieved by using the in- enables the possibility that, although not perfect, voice inputs
teractive system (see Chapter 2 of the thesis for more details). combined with off-line HTR provide an initial draft transcrip-
tion more accurate than that given only by off-line HTR. This
Table 2: Summary of the multimodal interactivity experimental fact was confirmed with the statistically significantly improve-
results. ments obtained in the experiments performed for the previous
parts of this thesis, multimodal, and interactive transcription.
Thus, the employment of speech collaborations will allow us to
Cristo Salvador Rodrigo significantly reduce the final transcription effort.
Experiment
WSR EFR WSR EFR Chapter 7 of the thesis (Collective Collaboration) explores
Off-line HTR 30.2% 8.2% 36.2% 7.9% how a crowdsourcing framework that allows for text line dic-
ASR 35.1% −6.7% 47.2% −20.1% tations acquisition could decrease the transcription effort. The
framework is based on the use of multimodal recognition, both
Multimodal 14.1% 57.1% 27.0% 31.3% employing and combining off-line HTR and ASR results, to im-
prove the final transcription that is going to be offered to the pa-
leographer. The multimodal recognition approach is based on
Table 2 summarises the results of the assistive and interac-
language model interpolation and Confusion Network combina-
tive experiments. As it can be observed, the estimated interac-
tion techniques. The crowdsourcing platform was implemented
tive human effort (WSR) required for obtaining the perfect tran-
by using a client-server architecture. The client is a mobile ap-
scription from the off-line HTR decoding represents about 8%
plication [13] that allows speech acquisition and the server part
of relative effort reduction (EFR) over the off-line HTR WER
performs the recognition and combination operations. In the
for both corpora (see Table 1). However, in the case of ASR no
next chapter (Chapter 8), our multimodal crowdsourcing pro-
effort reduction can be considered. Regarding multimodality, as
posal is tested in a supervised and in an unsupervised mode for
expected, the use of the proposed multimodal approach allows
the Rodrigo [10] corpus.
the interactive system to achieve more than 30% of relative ef-
fort reduction over the off-line HTR WER for both corpora.
45%
Worst speaker order
Table 3: Summary of the multimodal feedback and interactivity 40%
Median speaker order
Best speaker order
experimental results for Cristo Salvador.
Word Error Rate
35%
WSR
Experiment EFR
Deletions TS KBD Global 30%
Off-line HTR 5.5% 26.0% 6.7% 32.7% 0.6% 25%

ASR 5.1% 31.6% 3.5% 35.1% −6.7%
Multimodal 1.9% 10.7% 1.3% 12.0% 63.5% 20%
HTR Baseline 1st 2nd 3rd 4th 5th 6th
Collaborators
In Table 3 a summary of the multimodal feedback (i.e. on-

line handwriting feedback) and interactivity experimental re- Figure 1: Results of the speaker ordering in the supervised
sults for Cristo Salvador are presented. In this case, the WSR crowdsourcing experiments. Best, worst and the median of 11
is calculated under the assumptions that the deletion of words different random orders.
have no cost, and that the cost of keyboard-correcting an erro-
neous on-line feedback word is similar to another on-line HTR
interaction. Therefore, the WSR correspond with the percent- In the supervised experiments, the collaborators were ran-
age of words written with the on-line HTR feedback (TS) and domly sorted 11 times giving 11 different order lists. Figure 1
176
shows the evolution, from the initial off-line HTR baseline until for speech acquisition is publicly available [13].
the process of the speech of the last collaborator, for the lists The experiments showed that, in this framework, the num-
that obtained the worst, the median and the best final results. ber of collaborators is more important than the order in which
As it can be observed, the worst and the best final results do not their speech is processed. Through this experimentation, it
represent any statistically significant differences. These results has been shown that the use of speech is a good additional
show that, in the best case, only two speakers are needed to ob- source of information for improving the transcription of histori-
tain significant improvements. Meanwhile, in the worst case at cal manuscripts, and that this modality allows people to collab-
least four speakers are needed. orate in this task using their own mobile device.
100%
ASR baseline 5. Current and Future Work
90% Crowdsourcing system output
80%
Crowdsourcing ASR output Currently, we are testing the performance of our assistive and
interactive system with more robust modelling methods based
Word Error Rate
70%
60% on deep learning.
50% Regarding multimodality, we propose for future studies the
40% use of whole sentences instead of lines of the handwritten text
30% document because it might make multimodality more natural
20% from the point of view of the paleographer or speaker who has
HTR baseline 5th 10th 15th 20th 25th
to dictate the contents of the handwritten text images to the ASR
Collaborators system.
In the case of interactive transcription, we have already
Figure 2: Baseline values and the evolution of the system and tested the use of speech not only as an additional source of in-
ASR outputs for the whole unsupervised collaborations. The formation of the handwritten text line image to transcribe in the
horizontal lines represent the corresponding average ASR WER. interactive and assistive system, but as an additional modality
for human-computer interaction [14]. Furthermore, our future
works aim also at taking advantage of the real samples that are
From the unsupervised experiments, Figure 2 draws the produced while the system is used for adapting the feedback
baseline values for both modalities and the evolution of the sys- natural language recognisers to the user.
tem and ASR outputs. As it can be observed, the language Finally, the proposed multimodal crowdsourcing frame-
model interpolation permits to reduce the error level in the work and the multimodal interactive transcription system were
next speech decoding process [8], and the combination with the integrated [15], and in the near future, we are planning to test it
speech decoding results allows the system output to converge to with other datasets.
a better hypothesis with less errors to correct [16]. Besides, the
ASR performance is considerably improved, reducing the aver- 6. Scientific Contributions
age WER baseline value. Finally, after processing the speech
of the last collaborator, the system outputs presented 25.3% of The main contributions of this thesis can be summarised in:
WER that represents 35.6% of relative statistically significant the evaluation on how to combine the decoding output of
improvement over the off-line HTR baseline, and an estimated different natural language recognition systems, the integra-
time reduction for the paleographer revision of about 5 minutes tion of the combination of different signals in a computer as-
per page [10]. sisted transcription system, and the development of a multi-
modal crowdsourcing platform for the transcription of historical
manuscripts.
4. Main Conclusions
The scientific impact of this thesis was supported by eight
Regarding multimodality, the benefits of multimodal combi- publications at the time of the dissertation presentation. Con-
nation of the results obtained from off-line HTR with addi- cretely, the multimodality part was supported by two articles
tional sources of information for the transcription of historical presented in two international conferences (ICDAR 2015 [16],
manuscripts have been confirmed. and CAIP 2015 [17]), the interactivity part by two publications,
With respect to interactivity, multimodality was applied on one in an international conference and the other in a book chap-
an interactive tool for transcribing historical handwritten docu- ter (DAS 2016 [18], Handwriting, Nova 2017 [19]), and the
ments. On the one hand, the multimodal hypotheses combina- crowdsourcing part by four publications, two in international
tion allows to reduce the human time and workload required for conferences, one in a book chapter, and one in a JCR interna-
transcribing historical books, due to the increased recognition tional journal (DocEng 2016 [20], IberSPEECH 2016 [21, 13],
accuracy and the better quality of the alternatives contained in IEEE/ACM TASLP [22]).
the multimodal lattice. On the other hand, the use of multimodal Moreover, in the time of writing this paper, an additional
combination allows to improve the human-computer interaction publication on an international conference is supporting the in-
(by using on-line touch-screen handwritten pen strokes), given teractivity part (DAS 2018 [14]), and another in a JCR interna-
that the multimodal combination allows to correct errors on the tional journal the crowdsourcing part (COIN [15]).
interactive system hypothesis by using the information provided
by the on-line handwritten text introduced by the user.
Finally, the proposed multimodal crowdsourcing frame-
7. Acknowledgments.
work is based on the iterative refinement of the language model Work partially supported by: Percepción - TSI-020601-2012-
and hypotheses combination. This framework uses a client 50 (MINETUR), SmartWays - RTC-2014-1466-4 (MINECO),
/ server architecture in order to allow collaborators to decide STraDA - TIN2012-37475-C02-01 (MINECO), and CoMUN-
when and where to collaborate. The mobile application used HaT - TIN2015-70924-C2-1-R (MINECO/FEDER).
177
8. References Recognition: The shared views of four research groups,” IEEE
Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[1] A. Fischer, “Handwriting Recognition in Historical Documents,”
Ph.D. dissertation, University of Bern, 2012. [13] E. Granell and C.-D. Martı́nez-Hinarejos,
[2] A. Manoj, P. Borate, P. Jain, V. Sanas, and R. Pashte, “A Survey on “Read4SpeechExperiments: A Tool for Speech Acquisition
Offline Handwriting Recognition Systems,” International Journal from Mobile Devices,” in Proceedings of the IX Jornadas en
of Scientific Research in Science, Engineering and Technology, Tecnologı́as del Habla and the V Iberian SLTech Workshop
vol. 2, no. 2, pp. 253–257, 2016. (IberSPEECH’2016), 2016, pp. 411 – 417.
[3] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. [14] C.-D. Martı́nez-Hinarejos, E. Granell, and V. Romero, “Com-
Prentice Hall, 1993. paring different feedback modalities in assisted transcription of
manuscripts,” in Proceedings of the 13th IAPR International
[4] R. Plamondon and S. N. Srihari, “On-Line and Off-Line Hand- Workshop on Document Analysis Systems (DAS ’18), 2018, pp.
writing Recognition: A Comprehensive Survey,” IEEE Transac- 115–120.
tions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1,
pp. 63–84, 2000. [15] E. Granell, V. Romero, and C.-D. Martı́nez-Hinarejos, “Mul-
timodality, interactivity, and crowdsourcing for document tran-
[5] A. H. Toselli, E. Vidal, and F. Casacuberta, “Computer Assisted
scription,” Computational Intelligence, vol. 34, no. 2, pp. 398–
Transcription of Text Images,” in Multimodal Interactive Pattern
419, 2018.
Recognition and Applications. Springer, 2011, ch. 3, pp. 61–98.
[6] A. Fornés, J. Lladós, J. Mas, J. M. Pujades, and A. Cabré, [16] E. Granell and C.-D. Martı́nez-Hinarejos, “Combining Handwrit-
“A Bimodal Crowdsourcing Platform for Demographic Histori- ing and Speech Recognition for Transcribing Historical Handwrit-
cal Manuscripts,” in Proceedings of the First International Con- ten Documents,” in Proceedings of the 13th International Confer-
ference on Digital Access to Textual Cultural Heritage (DATeCH ence on Document Analysis and Recognition (ICDAR’15), 2015,
’14), 2014, pp. 103–108. pp. 126–130.
[7] E. Granell, “Advances on the Transcription of Historical [17] ——, “Multimodal Output Combination for Transcribing Histor-
Manuscripts based on Multimodality, Interactivity and Crowd- ical Handwritten Documents,” in Proceedings of the 16th Interna-
sourcing,” Ph.D. dissertation, Universitat Politècnica de València, tional Conference on Computer Analysis of Images and Patterns
2017, supervisors: C.-D. Martı́nez-Hinarejos and V. Romero, (CAIP), 2015, pp. 246–260.
Available: http://hdl.handle.net/10251/86137. [18] E. Granell, V. Romero, and C.-D. Martı́nez-Hinarejos, “An
[8] V. Alabau, V. Romero, A. L. Lagarda, and C.-D. Martı́nez- Interactive Approach with Off-line and On-line Handwritten
Hinarejos, “A Multimodal Approach to Dictation of Handwritten Text Recognition Combination for Transcribing Historical Doc-
Historical Documents,” in Proceedings of the 12th Annual Con- uments,” in Proceedings of the 12th IAPR International Workshop
ference of the International Speech Communication Association on Document Analysis Systems (DAS ’16), 2016, pp. 269–274.
(Interspeech), 2011, pp. 2245–2248. [19] ——, “Using Speech and Handwriting in an Interactive Approach
[9] V. Alabau, C.-D. Martı́nez-Hinarejos, V. Romero, and A. L. La- for Transcribing Historical Documents,” in Handwriting: Recog-
garda, “An iterative multimodal framework for the transcription of nition, Development and Analysis. Nova Science, 2017.
handwritten historical documents,” Pattern Recognition Letters, [20] E. Granell and C.-D. Martı́nez-Hinarejos, “A Multimodal Crowd-
vol. 35, pp. 195–203, 2014, frontiers in Handwriting Processing. sourcing Framework for Transcribing Historical Handwritten
[10] N. Serrano, F. Castro, and A. Juan, “The RODRIGO Database,” Documents,” in Proceedings of the 16th ACM Symposium on Doc-
in Proceedings of the 7th International Conference on Language ument Engineering (DocEng), 2016, pp. 157–163.
Resources and Evaluation (LREC 2010), 2010, pp. 2709–2712.
[21] ——, “Collaborator Effort Optimisation in Multimodal Crowd-
[Online]. Available: http://aclweb.org/anthology/L10-1330
sourcing for Transcribing Historical Manuscripts,” in Advances
[11] A. Moreno, D. Poch, A. Bonafonte, E. Lleida, J. Llisterri, J. B. in Speech and Language Technologies for Iberian Languages.
Mariño, and C. Nadeu, “Albayzin speech database: Design of Springer, 2016, pp. 234–244.
the phonetic corpus,” in Proceedings of the 3rd European Confer-
ence on Speech Communication and Technology (Eurospeech’93), [22] ——, “Multimodal Crowdsourcing for Transcribing Handwrit-
1993, pp. 175–178. ten Documents,” IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 25, no. 2, pp. 409–419, 2017.
[12] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kings-
bury, “Deep Neural Networks for Acoustic Modeling in Speech
178
IberSPEECH 2018
Bottleneck and Embedding Representation of Speech for DNN-based

Language and Speaker Recognition
Alicia Lozano-Diez, Joaquin Gonzalez-Rodriguez, Javier Gonzalez-Dominguez
Audias-UAM, Universidad Autonoma de Madrid, Madrid, Spain

alicia.lozano@uam.es
Abstract instead of traditional acoustic features as input in LID and SID

systems based on i-vectors. This approach revolutionized these
In this manuscript, we summarize the findings presented in
two fields, since they highly outperformed state-of-the-art sys-
Alicia Lozano Diez’s Ph.D. Thesis, defended on the 22nd of
tems (i-vector based on acoustic features). Our analysis focuses
June, 2018 in Universidad Autonoma de Madrid (Spain). In
on how different configurations of the DNN used as BNF ex-
particular, this Ph.D. Thesis explores different approaches to
tractor, which is trained for ASR, influences performance of re-
the tasks of language and speaker recognition, focusing on sys-
sulting features for LID and SID.
tems where deep neural networks (DNNs) become part of tra-
Finally, we propose a novel approach for LID, in which the
ditional pipelines, replacing some stages or the whole system
DNN is used to extract a fixed-length utterance-level represen-
itself. First, we present a DNN as classifier for the task of lan-
tation of speech segments known as embedding, comparable to
guage recognition. Second, we analyze the use of DNNs for fea-
i-vector, and overcoming the disadvantage of variable length se-
ture extraction at frame-level, the so-called bottleneck features,
quence of BNFs. This embedding-based approach has recently
for both language and speaker recognition. Finally, utterance-
shown promising results for SID, and our proposed system was
level representation of the speech segments learned by the DNN
able to outperform a strong state-of-the-art reference i-vector
(known as embedding) is described and presented for the task
system on the last challenging 2015 and 2017 NIST LREs.
of language recognition. All these approaches provide alter-
We explore different architectures and data augmentation tech-
natives to classical language and speaker recognition systems
niques to improve results of our system, obtaining comparable
based on i-vectors (Total Variability modeling) over acoustic
or better results than the well-established i-vectors.
features (MFCCs, for instance). Moreover, they usually yield
better results in terms of performance.
2. DNN as a Classifier for LID
1. Introduction We call end-to-end DNN-based systems to those that perform
the target task from the input, without any other backend. For
Lately, automatic speech recognition (ASR) has experienced a
LID, they usually take some input features and are trained to
breathtaking progress, partially thanks to the introduction of
classify each input frame into one of the target languages, out-
deep neural networks (DNNs) into their approaches. This has
putting the probability vector of a frame belonging to each lan-
spread across related areas such as language identification (LID)
guage. Here, we use CDNNs and LSTMs, which have less pa-
and speaker recognition (SID), where DNNs have noticeably
rameters than other DNNs. Besides, LSTMs have shown to be
improved their performance.
a good model for time-depending sequences [2]. We present
In this manuscript we present a summary of the main find-
experiments with both architectures on a balanced subset of 8
ings of the Ph.D. Thesis defended by Alicia Lozano Dı́ez, where
languages from LRE 2009, on the 3 s task. We compare our
we focused on different approaches for LID and SID based on
systems to an i-vector baseline, with 1024-dimensional UBM,
DNNs, replacing some stages or the whole system. The com-
400-dimensional i-vectors and cosine scoring.
plete dissertation can be found in [1].
First, end-to-end language recognition systems based on
DNNs are analyzed, where the network is used as classifier 2.1. Convolutional DNN for Language Recognition
directly. We focus on two architectures: convolutional DNNs CDNNs usually consist of convolution and subsampling lay-
(CDNNs) and long short-term memory (LSTM) recurrent neu- ers [3]. The former aims to perform feature extraction and each
ral networks (RNNs), which are less demanding in terms of of its units is connected to a local subset of units in the previous
computational resources due to the reduced amount of free pa- layer. Groups of these units share their parameters and form a
rameters in comparison with other DNNs. Thus, they provide an feature map that extracts the same features from different loca-
alternative to classical i-vectors, achieving comparable results, tions in the input. The subsampling layer selects the maximum
especially when dealing with short utterances. activation of each region on the input (max-pooling). The gen-
Second, we explore one of the most prominent applications eral scheme of our CDNN-LID system is depicted in Figure 1.
of DNNs in speech processing: as feature extractors. Here, We use as input 56-dimensional MFCC-SDC [4] feature
DNNs are used to obtain a frame-by-frame representation of vectors and segments of 3 s. All the architectures have 3 hid-
the speech signal, the so-called bottleneck feature (BNF) vec- den layers and we vary the number of filters (feature maps) in
tor, which is learned directly by the network and is then used each one, keeping fixed their shape and max-pooling regions.
The output layer is a fully-connected layer with softmax activa-
This work was funded by the PhD scholarship Ayudas para con-
tratos predoctorales para la formación de doctores 2013 (BES-2013-
tion, which outputs the probability that a test segment belongs
064886), under the project CMC-V2: Caracterización, Modelado y to a certain language. The network is trained with stochas-
Compensación de Variabilidad en la Señal de Voz (TEC2012-37585- tic gradient descent to minimize the negative log-likelihood.
C02-01), funded by Ministerio de Economı́a y Competitividad, Spain. We conducted experiments to evaluate the influence of differ-
179 10.21437/IberSPEECH.2018-36

Feature maps Feature maps Feature maps

OUTPUT
Table 3: LSTM systems performance for LID on the 8 languages
Feature maps
Feature maps
15@22x144 Feature maps 20@1x62
15@11x72
20@1x1 8 units
Lang1
subset of NIST LRE 2009.
INPUT 5@52x296
5@26x148 .
56x300 .
.
.
.
Architecture Performance (%)

Lang8
Convolution
Full
ID Size Acc. EERavg Impr.
Subsampling Convolution connection
Filter shape: 5x5 Pool shape: 2x2 Filter shape: 5x5 Subsampling
Pool shape: 2x2 Subsampling
Convolution Pool shape: 1x62
#1 i-vector 23M 65.02 16.94 -
HIDDEN LAYER 1
HIDDEN LAYER 2
Filter shape: 11x11
#2 lstm 1×512 1.2M 57.51 17.82 -
HIDDEN LAYER 3
#3 lstm 1×750 2.5M 63.39 15.61 ∼7.85
#4 lstm 1×1024 4.4M 65.63 15.10 ∼10.86
Figure 1: Representation of a CDNN architecture for LID.
#5 lstm 2×256 850k 63.73 14.96 ∼11.69
#6 lstm 2×512 3.3M 70.90 12.51 ∼26.15
Table 1: Configuration of the CDNN models for LID.
Configuration Development Data Speech

Signal
ID # Filters/Layer Train Validation
LANGUAGE /
ConvNet 1 [20, 30, 50] ∼178h ∼31h SPEAKER
ConvNet 2 [5, 15, 20] ∼178h ∼31h RECOGNITION
ConvNet 3 [5, 15, 20] ∼356h ∼63h Cepstral-based . . . . Cepstral features

UBM / i-vector
ConvNet 4 [5, 15, 20] ∼534h ∼63h system
ConvNet 5 [10, 20, 30] ∼356h ∼63h MFCC

ConvNet 6 [10, 20, 30] ∼534h ∼63h vectors
Table 2: CDNN systems performance for LID on the 8 lan- Bottleneck features
DNN
BN feature-based
BN UBM / i-vector
guages subset of NIST LRE 2009. system
Performance
Figure 2: Scheme of cepstral vs. BN based LID/SID systems.
ID Size EERavg (%) Cavg
i-vector ∼23M 16.94 0.1535
ConvNet 1 ∼198k 22.14 0.2406
ConvNet 2 ∼39k 25.90 0.2700 We train our systems with sequences of 2 s of MFCC-SDCs
ConvNet 3 ∼39k 24.69 0.2616 with no stacking of acoustic frames. They consist of 1 or 2
ConvNet 4 ∼39k 23.48 0.2461 LSTM hidden layers followed by a softmax output layer, which
ConvNet 5 ∼78k 21.60 0.2282 returns a probability for each input frame and language. For
ConvNet 6 ∼78k 21.11 0.2293 scoring, we average the output per frame, using just the last
10% of each utterance.
ConvNet 6+i-vector - 15.96 0.1433
2.2.1. Experiments and Results

ent amounts of data used to train, balanced per language. The
configurations used are summarized in Table 1. Table 3 summarizes the performance of 5 LSTM systems in
terms of EERavg and accuracy. We see that 4 out of the 5 pro-
2.1.1. Experiments and Results posed architectures for the LSTM system outperform the refer-
ence i-vector based system in EERavg , with 5 to 21 times fewer
Results of this section are summarized in Table 2. Although parameters. In the models with one layer, increasing the number
our standalone CDNN-based systems are outperformed by the of units improves the performance. However, deeper models (2
i-vector, their size is smaller and are trained with less data. We layers) yield better results.
see that systems benefit from larger training datasets (compare
systems 2, 3 and 4) and bigger models (compare ConvNet 3 and
5, or 4 and 6). Moreover, improvements are obtained when fus- 3. Frame-by-frame DNN-based
ing the best CDNN system with the i-vector baseline, meaning Representation: Bottleneck Features
complementary information extracted from the same features.
Generally, LID and SID systems based on BNF use a DNN
2.2. LSTM RNN for Language Identification with a bottleneck (BN) layer that is trained for ASR. Then, for
each input frame, the time-dependent output of the BN layer is
LSTMs are able to store information from previous inputs dur-
used as a new frame-by-frame representation to feed the i-vector
ing long time periods [5, 6, 7], which makes them more suitable
model, instead of the classical cepstral features (see Figure 2).
to model sequential data. They replace hidden units in a classi-
cal DNN with memory blocks [8], which have input, output and Thus, BNFs provide a new frame-wise representation of an
forget gates: the input and output gates control respectively the audio signal, learned directly by a DNN, containing information
flow of input activations into the memory cell and the output about the phonetic content since the DNN is trained for ASR.
flow of cell activations into the rest of the network; the forget The BN layer of this DNN is relatively small with respect to
gate allows the flow of information from the memory block to the rest and aims to compress the information learned by the
the cell, adaptively resetting the cell’s memory. previous layers [9].
180
Table 4: Results of BNFs for LID (development set of NIST LRE Table 6: Results of BNFs for SID on the NIST SRE 2010.
2015), varying the number of layers of the DNN.
Features Norm. Phone Acc(%) EER(%)
Number of DNN EERavg (in %) ASR feat. Utt. CMN 49.8 2.51
Hidden Layers Frame Acc. 30s 10s 3s MFCC∆+∆∆ ST-CMVN 49.6 1.99
3 47.82 5.52 9.04 14.34 MFCC20dim ST-CMVN 45.57 1.67
4 49.55 4.33 7.81 13.76
5 50.46 5.22 8.57 14.15
3.2. Analysis of Bottleneck Features for SID
Table 5: Results of BNFs for LID (development set of NIST LRE We explore whether DNNs suboptimal for ASR can provide
2015) with different position for the BN layer. better BNFs for SID. We present here experiments with dif-
ferent features to feed the DNN, either optimized for ASR
Position of DNN EERavg (in %) (“ASR feat.”) or for SID (“MFCC”). The ASR optimized fea-
BN Layer Frame Accuracy 30s 10s 3s tures [10] are composed of 24 Mel-filter bank log outputs con-
catenated with 13 fundamental frequency (F0) features, with
First 49.17 9.37 12.24 16.59 utterance mean normalization, which is what we used as de-
Second 49.46 6.27 9.55 14.58 fault for ASR [11]. SID optimized features are the classical 20
Third 49.55 4.33 7.81 13.76 MFCCs used for SID, either adding the derivatives or not, and
Fourth 48.05 4.64 8.00 14.17 normalized with short-term cepstral mean and variance normal-
ization (ST-MVN). We evaluate the systems on the NIST SRE
2010, condition 5, female task [12].
3.1. Analysis of Bottleneck Features for LID
Here, we analyze how the topology of the DNN trained for ASR
influences the performance of the resulting BNFs for LID on The aspect analyzed in this section is the DNN input features,
the NIST LRE 2015 development dataset. We use a feedfor- which are either optimized for ASR or SID (“ASR feat.” vs.
ward DNN with an input layer, three to five hidden layers, and “MFCC”). Results of these experiments are summarized in Ta-
the output layer. To feed the network, we use 20 MFCCs pre- ble 6.
processed with a context of 31 frames. The hidden layers are We see that the ASR features (with per utterance mean nor-
composed of 1500 units and the BN layer, of 80. The softmax malization) yield better performance in terms of phone accu-
output layer provides the probability of each input to correspond racy than the MFCCs since they are expected to be optimized
to a given phoneme state (3083 triphone states are used). We use for ASR. However, BNFs obtained from these DNNs do not
stochastic gradient descent to optimize the cross-entropy. seem to be as discriminative as the ones obtained with DNNs
trained using MFCCs optimized for SID. Moreover, adding first
and second derivatives to MFCCs provide better phone accuracy
but resulted in a worse SID performance. We see as for LID that
First we vary the number of layers in the DNN from 3 to 5 better ASR performance (in terms of phone accuracy) does not
(Table 4). Despite the 5 layers configuration gives better perfor- necessarily correspond to better SID performance.
mance in terms of frame accuracy, it is the architecture with 4
hidden layers the one that reaches the lowest EERavg for LID. 4. Utterance Level Representation:
Therefore, the discriminative task (ASR) is easier for the DNN
when the classifier is more complex (5 layers DNN) and, thus,
DNN-based Embeddings
improves the frame accuracy. However, that network is not be- Despite the success of BNFs for SID [13, 14, 15] and LID [16,
ing forced to focus on obtaining a compact representation of the 17, 18, 19, 20, 21], the variable length of this frame-wise repre-
signal, which is then used for LID. sentation poses a challenge in consequent modeling. The clas-
Then, keeping fixed the number of layers to 4, we explore sical i-vector compacts the utterance representation in a fixed-
how the LID system performs depending on the position that the length vector. However, the aim of i-vectors is to capture infor-
BN layer occupies in the DNN, which correspond to different mation about sources of variability in the training data, but this
levels of extracted information closer or further from the pho- information is not necessarily relevant to the target task.
netic information (output layer). Results can be seen in Table 5. In this section we use embeddings for LID (after their suc-
The closer the BN layer to the input layer, the noisier the re- cess for SID [22]), which are a fixed-length representation of an
sulting representation would be, which might explain the drop utterance extracted from a sequence summarizing DNN trained
in performance for the first and second layers with respect to discriminatively for the target task (LID).
the results of the last two layers. The best performance in terms The DNN consists of a first part that works on a frame-by-
of EERavg for LID is obtained when the bottleneck layer is lo- frame basis from a given sequence of feature vectors, followed
cated in the third layer, but that result is very close to the one by a pooling layer, which in our case computes the mean and
obtained with the BN at the fourth layer. Performance of the standard deviation over time of the activations of the previous
DNN also drops when the BN layer moves from layer third to layer. Finally, a number of hidden layers follow to capture the
fourth. In this topology, the BN layer in position fourth is con- information contained in the input, providing a single vector of
nected directly to the output layer, resulting in a weight matrix values per sequence (embeddings), which can be modeled by
that connects a small layer with just 80 hidden units with the some other backend.
output layer, of size 3083. These weights might be difficult to In particular, our DNN-embedding system takes stacked
learn, which may explain this drop in performance of the DNN. BNFs as input and use bidirectional LSTM (BLSTM) layers for
181
Pooling
(mean, std) Table 8: Results with PCA on top of embeddings for LID on the
Emb_a Emb_b
Fully NIST LRE 2015 evaluation dataset.
Connected
Input BLSTM BLSTM
sequence 256 256
Output
14 lang Cavg × 100
SBN Posterior System None 100 25

probabilities
DNN 1 (orig 812) 20.04 18.67 19.98
DNN 2 (orig 406) 19.19 18.11 17.44
DNN 3 (orig 203) 19.30 18.70 18.13
Frame-by-frame
Utterance (sequence) level
embedding representation
Figure 3: Architecture of the proposed embedding DNN for LID.
Table 7: Comparison i-vectors vs. embeddings of different size

for LID on the NIST LRE 2015 evaluation dataset.
System Cavg × 100

Reference i-vector 16.93
DNN 1 (812) 20.04
DNN 2 (406) 19.19
DNN 3 (203) 19.30 Figure 4: Influence of data augmentation on DNN-embeddings
for LID on the NIST LRE 2017 evaluation dataset.
the frame-level part. After pooling, two more fully connected we performed a post-evaluation analysis.
layers are added, whose output values will serve as embeddings.
First, we compared the architecture with the same config-
Finally, the output layer consists of a softmax layer that pro-
uration as DNN 1 in previous section with a larger one, where
vides a vector of language posterior probabilities for each utter-
the fully connected layer has 1500 units and both embeddings
ance. An example of this architecture is depicted in Figure 3.
are 512-dimensional. With that larger model, performance im-
proved from 22.18% to 19.86% (Cprimary ).
4.1. Analysis on NIST LRE 2015
Moreover, we extended the training dataset by performing
For these experiments, we use the architecture described above. data augmentation through addition of noise, reverberation and
We feed the DNN with 30-dimensional stacked BNFs and the tempo variations of original audio files. Figure 4 shows the
output softmax layer provides a 20-dimensional vector of lan- comparison of performance when training with up to 11 copies
guage posterior probabilities for each utterance. As reference, of the original data with different corruptions. In general, in-
we use an i-vector system that consists of a 2048-dimensional creasing the number of copies of the data yields improvements
UBM trained on the same BNFs and 600 dimensional i-vectors. in performance. In particular, adding any noisy version of the
First, we experiment varying the size of the embedding data (combined or not with other corruptions) makes the system
layers (keeping fixed the rest to 256 for each layer up to the more robust against data mismatch, providing gains in perfor-
pooling). We start with 512 and 300-dimensional embeddings mance. The only two cases in which data augmentation does not
(DNN 1) and half each twice (DNN 2 and 3, respectively). Re- improve the system trained only on original data are the ones in
sults stacking both embeddings are shown in Table 7. We see a which just reverberation or tempo variations are performed.
better performance of DNN 2 embeddings, which are half size
w.r.t. DNN 1. This suggest that the embeddings of larger size 5. Conclusions
contain more detrimental information about channel since all
DNNs reached the same performance on the training data. The main contributions of this Ph.D. Thesis are the following.
Motivated by this, we explore further dimensionality reduc- First, the proposed end-to-end approaches for LID based on
tion via PCA in Table 8. We see that with smaller embeddings, CDNNs and LSTMs, which provide an alternative to i-vectors
we are able to get improvements even reducing the dimension- with less parameters. Secondly, the systematic study of bottle-
ality up to 25. Best results are achieved with embeddings from neck feature DNN-based LID systems and the analysis of this
DNN 2 whose dimensionality (406) is close to the typical i- approach for SID, which show that optimal DNN configuration
vector (400 or 600). Besides, we achieved a performance of for BNFs for LID and SID might differ from the most beneficial
17.44%, close to our i-vector baseline (16.93%) and score level for ASR, task for which the DNN is trained. Finally, the novel
fusion of both gave us a Cavg of 15.69%. approach based on embeddings for LID, in line with previous
works in SID, which provides a fixed-length representation of
utterances directly learned by the DNN for the target task able
4.2. Analysis on NIST LRE 2017
to outperform the well-established i-vectors.
After developing the embedding system for the NIST LRE In terms of articles, from research directly output from this
2017, where it was included in the primary submission of BUT Ph.D. Thesis, two journal articles and seven peer reviewed in-
team (a fusion of 3 i-vector systems and the embedding system), ternational conference papers were published.
182
6. References [18] R. Fér, P. Matějka, F. Grézl, O. Plchot, and J. Černocký, “Multilin-
gual bottleneck features for language recognition,” in Proceedings
[1] A. Lozano-Diez, “Bottleneck and embedding representation of Interspeech 2015, vol. 2015, no. 09, 2015, pp. 389–393.
of speech for dnn-based language and speaker recogni-
tion,” Ph.D. dissertation, June 2018. [Online]. Available: [19] P. Matějka, L. Zhang, T. Ng, H. S. Mallidi, O. Glembek, J. Ma,
https://repositorio.uam.es/handle/10486/684191 and B. Zhang, “Neural network bottleneck features for language
identification,” in Proceedings of Odyssey 2014. International
[2] T. Mikolov, S. Kombrink, L. Burget, J. H. Cernocky, and S. Khu- Speech Communication Association, 2014.
danpur, “Extensions of recurrent neural network language model,”
in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE [20] F. Richardson, D. Reynolds, and N. Dehak, “Deep neural network
International Conference on. IEEE, 2011, pp. 5528–5531. approaches to speaker and language recognition,” IEEE Signal
Processing Letters, vol. 22, no. 10, pp. 1671–1675, Oct 2015.
[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” in Intelligent Signal [21] B. Jiang, Y. Song, S. Wei, J.-H. Liu, I. V. McLoughlin, and L.-R.
Processing. IEEE Press, 2001, pp. 306–351. Dai, “Deep bottleneck features for spoken language identifica-
tion,” PLOS ONE, vol. 9, no. 7, pp. 1–11, 07 2014. [Online].
[4] P. A. Torres-Carrasquillo, E. Singer, M. A. Kohler, and J. R. Available: https://doi.org/10.1371/journal.pone.0100795
Deller, “Approaches to Language Identification Using Gaussian
Mixture Models and Shifted Delta Cepstral Features,” in Proc. [22] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
ICSLP, vol. 1, 2002, pp. 89–92. “Deep neural network embeddings for text-independent speaker
verification,” in Proceedings of Interspeech 2017, 2017.
[5] A. Graves, Supervised Sequence Labelling with Recurrent
Neural Networks. Springer, 2012, vol. 385. [Online]. Available:
http://dx.doi.org/10.1007/978-3-642-24797-2
[6] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
Continual prediction with LSTM,” Neural Computation, vol. 12,
no. 10, pp. 2451–2471, 2000.
[7] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning
precise timing with LSTM recurrent networks,” Journal of Ma-
chine Learning Research, vol. 3, pp. 115–143, Mar. 2003.
[8] K. Greff, R. K. Srivastava, J. Koutnı́k, B. R. Steunebrink, and
J. Schmidhuber, “Lstm: A search space odyssey,” arXiv preprint
arXiv:1503.04069, 2015.
[9] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, “Probabilis-
tic and bottle-neck features for lvcsr of meetings,” in 2007 IEEE
International Conference on Acoustics, Speech and Signal Pro-
cessing - ICASSP ’07, vol. 4, April 2007, pp. IV–757–IV–760.
[10] F. Grézl, M. Karafiát, and L. Burget, “Investigation into bottle-
neck features for meeting speech recognition,” in Proc. Inter-
speech 2009, no. 9. International Speech Communication As-
sociation, 2009, pp. 2947–2950.
[11] M. Karafiát, F. Grézl, K. Veselý, M. Hannemann, I. Szőke, and
J. Černocký, “But 2014 babel system: Analysis of adaptation in
nn based systems,” in Proceedings of Interspeech 2014. Interna-
tional Speech Communication Association, 2014, pp. 3002–3006.
[12] NIST, “The nist year 2010 speaker recognition eval-
uation plan,” www.itl.nist.gov/iad/mig/tests/sre/2010/
NIST SRE10 evalplan.r6.pdf, 2010.
[13] D. Garcia-Romero and A. McCree, “Insights into deep neural net-
works for speaker recognition,” in INTERSPEECH 2015, 16th An-
nual Conference of the International Speech Communication As-
sociation, Dresden, Germany, September 6-10, 2015, 2015, pp.
1141–1145.
[14] S. Yaman, J. Pelecanos, and R. Sarikaya, “Bottleneck features for
speaker recognition,” in Proceedings of Odyssey 2012. Interna-
tional Speech Communication Association, 2012.
[15] A. Lozano-Diez, A. Silnova, P. Matějka, O. Glembek, O. Plchot,
J. Pešán, L. Burget, and J. Gonzalez-Rodriguez, “Analysis and op-
timization of bottleneck features for speaker recognition,” in Pro-
ceedings of Odyssey 2016. International Speech Communication
Association, 2016.
[16] Y. Song, B. Jiang, Y. Bao, S. Wei, and L. R. Dai, “I-vector rep-
resentation based on bottleneck features for language identifica-
tion,” Electronics Letters, vol. 49, no. 24, pp. 1569–1570, Novem-
ber 2013.
[17] A. Lozano-Diez, R. Zazo, D. T. Toledano, and J. Gonzalez-
Rodriguez, “An analysis of the influence of deep neural network
(dnn) topology in bottleneck feature based language recognition,”
PLOS ONE, vol. 12, no. 8, pp. 1–22, 08 2017. [Online].
Available: https://doi.org/10.1371/journal.pone.0182580
183
IberSPEECH 2018
Deep Learning for i-Vector Speaker and Language Recognition:

A Ph.D. Thesis Overview
Omid Ghahabi
TALP Research Center, Department of Signal Theory and Communications

Universitat Politecnica de Catalunya - BarcelonaTech, Spain
omid.ghahabi@upc.edu
Abstract and/or speaker labels are required for training, which are not
always accessible (e.g., [5, 6, 7, 8, 9]).
Recent advances in Deep Learning (DL) technology have Another possible use of DL is to represent a speech signal
improved the quality of i-vectors but the DL techniques in with a single low dimensional vector using a DL architecture,
use are computationally expensive and need speaker or/and rather than the traditional i-vector algorithm. These vectors are
phonetic labels for the background data, which are not eas- often referred to as speaker embeddings (e.g., [10, 7, 11, 12,
ily accessible in practice. On the other hand, the lack of 13]). The need of speaker labels for training the network is one
speaker-labeled background data makes a big performance gap, of the disadvantages of these techniques. Moreover, speaker
in speaker recognition, between two well-known cosine and embeddings extracted from hidden layer outputs are not so com-
PLDA i-vector scoring techniques. This thesis tries to solve the patible with Probabilistic Linear Discriminant Analysis (PLDA)
problems above by using the DL technology in different ways, backend [14, 15] as the posterior distribution of hidden layer
without any need of speaker or phonetic labels. We have pro- outputs are usually not truly Gaussian.
posed an effective DL-based backend for i-vectors which fills
The first objective in this thesis is to make use of deep ar-
46% of this performance gap, in terms of minDCF, and 79% in
chitectures for backend i-vector classification in order to fill
combination with a PLDA system with automatically estimated
the performance gap between the cosine (unlabeled-based) and
labels. We have also developed an efficient alternative vector
PLDA (labeled-based) scoring baseline systems given unla-
representation of speech by keeping the computational cost as
beled background data. The second one is to develop an effi-
low as possible and avoiding phonetic labels. The proposed
cient framework for vector representation of speech by keeping
vectors are referred to as GMM-RBM vectors. Experiments on
the computational cost as low as possible and avoiding speaker
the core test condition 5 of the NIST SRE 2010 show that com-
and phonetic labels. The last main objective is to make use of
parable results with conventional i-vectors are achieved with a
deep architectures for backend i-vector classification for Lan-
clearly lower computational load in the vector extraction pro-
guage Identification (LID) in intelligent vehicles. In this sce-
cess. Finally, for the LID application, we have proposed a DNN
nario, LID systems are evaluated using words or short sentences
architecture to model effectively the i-vector space of languages
recorded in cars in four languages, English, Spanish, German,
in the car environment. It is shown that the proposed DNN ar-
and Finnish.
chitecture outperforms GMM-UBM and i-vector/LDA systems
The three main objectives are summarized in sections 2-
by 37% and 28%, respectively, for short signals 2-3 sec.
4. Section 5 describes the experimental results. Section 6 lists
Index Terms: Deep Learning, Speaker Recognition, i-Vector, the publications resulted from the Ph.D. thesis and section 7
Deep Neural Network, Deep Belief Network, Restricted Boltz- concludes the paper.
mann Machine
2. Deep Learning Backend for i-Vector
1. Introduction Speaker Verification
The successful use of Deep Learning (DL) in a large variety of
signal processing applications, particularly in speech process- We have proposed the use of DL as a backend in which a two-
ing (e.g., [1, 2, 3]), has inspired the community to make use of class hybrid Deep Belief Network (DBN)-Deep Neural Net-
DL techniques in speaker and language recognition as well. A work (DNN) is trained for each target speaker to increase the
possible use of DL techniques in speaker recognition is to com- discrimination between target i-vector/s and the i-vectors of
bine them with the state-of-the-art i-vector [4]. However, the other speakers (non-targets/impostors) (Fig. 2). Proposed net-
main problem is that the use of DL increases highly the com- works are initialized with speaker-specific parameters adapted
putational cost of the i-vector extraction process and phonetic from a global model, which is referred to as Universal Deep
Belief Network (UDBN). Then the cross-entropy between the
The Ph.D. thesis has been carried out under supervision of Prof. class labels and the outputs is minimized using the back-
Javier Hernando in TALP Research Center, Department of Signal propagation algorithm.
Theory and Communications, Universitat Politecnica de Catalunya - DNNs usually need a large number of input samples to be
BarcelonaTech, Spain. The thesis was supported in part by the Spanish trained efficiently. In speaker recognition, target speakers can
projects TEC2010-21040-C02-01, PCIN-2013-06, TEC2015-69266-P,
and TEC2012-38939-C03-0 and the European project PCIN-2013-067.
be enrolled with only one sample (single session task) or mul-
The Author is now with EML European Media Laboratory GmbH, Hei- tiple samples (multi-session task). In both cases, the number
delberg, Germany. The full thesis manuscript can be found Online in of target samples is very limited. A network trained with such
http://hdl.handle.net/2117/118780. limited data is highly probable to overfit. On the other hand, the
184 10.21437/IberSPEECH.2018-37
UDBN & DBN Adaptation
DBN Universal
DBN Target/Impostor
Labels
Background
i-vectors
Impostor DBN Discriminative

Clustering DNN
Selection Minibatch Adaptation Target Model
Balance
Target
Replication
i-vector/s
Balanced Training
Train
Test
Unknown Resemblance
i-vector Measurement
Decision
Figure 1: Block-diagram of the proposed DL-based backend on i-vectors for target speaker modeling.
Outputs: pensive and need phonetic labels for the background data. It has
Posterior Probability
of target and non-target been proposed in this thesis an alternative vector-based repre-
classes Initialization: sentation for speakers in a less computationally expensive man-
Speaker-specific ner with no use of any phonetic or speaker labels.
Inputs: parameters adapted
A mix of target i-vector/s from UDBN RBMs are good potentials for this purpose because they
and the cluster centroids have good representational powers and they are unsupervised
of selected impostor and computationally low cost. It is assumed in this work that the
i-vectors
inputs of RBM, i.e., visible units, are GMM supervectors and
the outputs, i.e., hidden units, are the low dimensional vectors
Figure 2: Proposed deep learning architecture for training of
we are looking for. The RBM is trained given the background
each speaker model.
GMM supervectors and will be referred to as URBM. The role
of the URBM is to learn the total session and speaker variabil-
ity among the background supervectors. Different types of units
number of target and impostor samples will be highly unbal- and activation functions can be used for training the URBM but
anced, i.e., one or some few target samples against thousands we have proposed a variant of ReLU, which will be referred
of impostor samples. Learning from such unbalanced data will to as Variable ReLU (VReLU), for this application. It will be
result in biased DNNs towards the majority class. shown in section 5 that the proposed VReLU does not suffer
Fig. 1 shows the block diagram of the proposed approach. from the problems with sigmoid and ReLU and works the best.
Two main contributions are proposed in this thesis to tackle the After training the URBM, the visible-hidden connection weight
above problems. The balanced training block attempts to de- matrix is used to transform unseen GMM supervectors to lower
crease the number of impostor samples and, on the contrary, dimensional vectors which will be referred to as GMM-RBM
to increase the number of target ones in a reasonable and ef- vectors in this work.
fective way. The most informative impostor samples for target In fact, the proposed VReLU is defined as follows and is
speakers are first selected by the proposed impostor selection compared with ReLU function in Fig. 3,
algorithm. Afterwards, the selected impostors are clustered and (
the cluster centroids are considered as final impostor samples x x>τ
f (x) = , τ ∈ N (0, 1) (1)
for each target speaker model. Impostor centroids and target 0 x≤τ
samples are then divided equally into minibatches to provide
balanced impostor and target data in each minibatch. On the Given the GMM supervectors and the URBM parameters,
other hand, the DBN adaptation block is proposed to compen- the GMM-RBM vectors are extracted as follows,
sate the lack of input data. As DBN training does not need any −1/2
ωr = W Σubm N −1 (u)F̃(u) (2)
labeled data, the whole background i-vectors are used to build a
global model, which is referred to as Universal DBN (UDBN). where Σubm is the diagonal covariance matrix of the UBM, W
The parameters of the UDBN are then adapted to the balanced is the connection weights from URBM, and N (u) and F̃(u)
data obtained for each target speaker. At the end, given the tar- are zeroth and centralized first order Baum-Welch statistics, re-
get/impostor labels, the adapted DBN and the balanced data, a spectively.
DNN is discriminatively trained for each target speaker. More Like in case of i-vectors, resulting GMM-RBM vectors are
details can be found in [16]. mean normalized and whitened using the mean vector and the
whitening matrix obtained on the background data.
The comparison of equation 2 with that of i-vector in equa-
3. RBMs for Vector Representation tion 3 implies clearly that GMM-RBM vector extraction needs
of Speech much less computational load. More details can be found in
[16].
Recently, the advances in DL have improved the quality of i- −1 t −1
vectors, but the DL techniques in use are computationally ex- ω = I + T t Σ−1 N (u)T T Σ F̃(u) (3)
185
ReLU VReLU VReLU
y y y
y=x y=x y=x
y=0 x y=0 x y=0 x
x=
x=-
(a) (b) (c)
Figure 3: Comparison of ReLU and proposed VReLU. τ is randomly selected from a normal distribution with zero mean and unit
variance per each hidden unit and per each input sample. (b) and (c) are two examples of VReLU with positive and negative τ .
4. Deep Learning Backend for i-Vector

Language Identification
Figure 4 shows the architecture of DNNs we have proposed in
this work. The inputs are i-vectors and the outputs are the lan-
guage class posteriors. The softmax and sigmoid are used as the
activation functions of the internal and the output layers, respec-
tively. In order to Gaussianize the output posterior distributions,
we have proposed to compute the output scores in Log Posterior
Ratio (LPR) forms as in [16].
As the response time of the LID system is important in the
car, the computational complexity of the classifier should also Figure 4: Proposed DNN architecture used for i-vector lan-
be taken into account. Therefore, we have proposed to choose guage identification (L denotes the number of hidden layers and
the size of the first hidden layer as the lowest power of 2 greater m = dlogn2 e).
x
than the input layer size. From the second hidden layer towards
the output, the size of each layer will be half of the previous
Table 1: Performance comparison of the proposed DNN system
layer. For example, the configuration of a 3-hidden-layer DNN
with other baseline systems on NIST 2014 i-vector challenge.
will be as 400-512-256-128-4, where 400 is the size of the in-
put i-vectors and 4 is the number of language classes. It will Progress Set Evaluation Set
be shown in section 5 that, in this way, we can decrease the Unlabeled Background Data
EER (%) minDCF EER (%) minDCF
computational complexity to a great extent while keeping the
classification accuracy. [1] cosine 4.78 0.386 4.46 0.378
Two forms of i-vectors are considered as inputs to DNNs, [2] PLDA (Estimated Labels) 3.85 0.300 3.46 0.284
[3] Proposed DNN-1L 5.13 0.327 4.61 0.320
raw i-vectors and session-compensated i-vectors. LDA and [4] Proposed DNN-3L 4.55 0.305 4.11 0.300
WCCN are two commonly used techniques for session vari- Fusion [2] & [4] 2.99 0.260 2.70 0.243
ability compensation among i-vectors. Although LDA performs Labeled Background Data
better than WCCN for the LID application when cosine scoring
[5] PLDA (Actual Labels) 2.23 0.226 2.01 0.207
is used, we will use only WCCN session-compensated i-vectors Fusion [2] & [5] 2.04 0.220 1.85 0.204
as the inputs to DNNs. This is because the number of the lan- Fusion [4] & [5] 2.13 0.221 2.00 0.196
guage classes is very few in this application and, therefore, the Fusion [2] & [4] & [5] 1.88 0.204 1.74 0.190
maximum number of meaningful eigenvectors will be also few
(number of classes minus one). We implemented different DNN
architectures with LDA-projected i-vectors as inputs but no gain of 36,572, 6530, and 9634 i-vectors, respectively. The number
was observed. The use of raw i-vectors is advantageous as no of target speaker models is 1306 and for each of them five i-
language-labeled background data is required. More details can vectors are available. Each target model will be scored against
be found in [16]. all the test i-vectors and, therefore, the total number of trials
will be 12,582,004. Three baseline systems are considered in
5. Experimental Results this work for evaluation: cosine, PLDA with actual labels, and
This section summarizes the main results obtained on the ex- PLDA with estimated labels. The size of hidden layers is set to
periments for each main contribution presented in sections 2-4. 400. Table 1 compares the performance of the proposed DNN
The full database provided in the National Institute of Stan- systems with other baseline systems in terms of minDCF and
dard and Technology (NIST) 2014 speaker recognition i-vector EER. The interesting point is that the combination of the DNN-
challenge [17] is used for the experiments in section 2. Rather 3L and PLDA with estimated labels in the score level improves
than speech signals, i-vectors are given directly by NIST in the results to a great extent. The resulting relative improve-
this challenge to train, test, and develop the speaker recogni- ment compared to cosine baseline system is 36% in terms of
tion systems. This enables system comparison more readily minDCF on the evaluation set. This improvement with no use
with consistency in the front-end and in the amount and type of background labels is considerable compared to 45% relative
of the background data [17]. Three sets of 600-dimensional improvement which can be obtained by PLDA with actual la-
i-vectors are provided: development, train, and test consisting bels.
186
Table 2: Performance comparison of proposed GMM-RBM for Single and Multisession i-Vector Speaker Recogni-
vectors and conventional i-vectors on the evaluation set core tion, IEEE/ACM Transactions on Audio, Speech, and
test condition-common 5 of NIST SRE 2010. GMM-RBM vec- Language Processing, vol. 25, no. 4, pp. 807-817, Apr.
tors and i-vectors are of a same size of 400. 2017, (Awarded the best RTTH doctorate student ar-
ticle in 2017).
cosine PLDA 3. O. Ghahabi, A. Bonafonte, J. Hernando, and A. Moreno,
EER (%) minDCF EER (%) minDCF Deep neural networks for i-vector language identifica-
tion of short utterances in cars, in Proc. INTERSPEECH,
[1] i-Vector 6.270 0.05450 4.096 0.04993 2016, pp. 367-371.
GMM-RBM Vector
[2] 6.638 0.06228 4.517 0.05085 4. O. Ghahabi and J. Hernando, Restricted boltzmann ma-
(Trained with ReLU)
GMM-RBM Vector chine supervectors for speaker recognition, in Proc.
[3] 6.497 0.06099 3.907 0.05184 ICASSP, 2015, pp. 4804-4808.
(Trained with VReLU)
Fusion [1] & [3] 5.791 0.05238 3.814 0.04673 5. O. Ghahabi and J. Hernando, Deep belief networks for i-
vector based speaker recognition, in Proc. ICASSP, May
2014, pp. 1700-1704, (Awarded the Qualcomm travel
Table 3: Comparison of LID systems for short signals recorded grant).
in car. Performance values are reported based on LER (%).
6. O. Ghahabi and J. Hernando, i-vector modeling with
deep belief networks for multi-session speaker recogni-
Duration of Test Signals (in sec) t < 2 2 6 t < 3 t > 3 All
Number of Samples 2,472 2,355 5,591 10,418 tion, in Proc. Odyssey, 2014, pp. 305-310.
7. O. Ghahabi and J. Hernando, Global impostor selection
[1] GMM-UBM 9.98 4.56 4.70 6.02
for DBNs in multi-session i-vector speaker recognition,
[2] i-Vector + Cosine 17.28 6.58 5.00 8.09 in Advances in Speech and Language Technologies for
[3] i-Vector + WCCN + Cosine 14.50 5.03 3.42 6.31
[4] i-Vector + LDA + Cosine 12.41 3.96 2.32 5.03 Iberian Languages, ser. Lecture Notes in Artificial Intel-
ligence. Springer, Nov. 2014.
[5] i-Vector + WCCN + DNN 12.06 3.30 2.30 4.60
[6] i-Vector + DNN 11.01 2.87 2.58 4.54 8. P. Safari, O. Ghahabi, and J. Hernando, From features
Fusion [6] & [4] 11.63 3.41 1.95 4.48 to speaker vectors by means of restricted boltzmann ma-
Fusion [6] & [1] 10.20 3.04 2.49 4.41 chine adaptation, in Proc. Odyssey, 2016, pp. 366-371.
Fusion [6] & [4] & [1] 11.12 3.37 1.96 4.39
9. P. Safari, O. Ghahabi, and J. Hernando, Speaker recog-
nition by means of restricted boltzmann machine adap-
tation, in Proc. URSI, 2016, pp. 1-4.
For the experiments in Section 3, the NIST 2010 SRE [18], 10. P. Safari, O. Ghahabi, and J. Hernando, Feature classi-
core test-common condition 5, is used for evaluation. Table 2 fication by means of deep belief networks for speaker
compares the performance of GMM-RBM vectors, which are recognition, in Proc. EUSIPCO, 2015, pp. 2162-2166.
obtained with URBMs trained with ReLU and VReLU, with
traditional i-vectors on the evaluation set. The use of proposed 11. G. Raboshchuk, C. Nadeu, O. Ghahabi, S. Solvez, B. M.
VReLU shows better performance than the use of ReLU in both Mahamud, A. Veciana, and S. Hervas, On the acoustic
cosine and PLDA scoring. At the end, the best results are environment of a neonatal intensive care unit: Initial de-
achieved with score fusion of i-vectors and GMM-RBM vec- scription, and detection of equipment alarms, in Proc.
tors which shows about 7-7.5% and 4-6.5% relative improve- INTERSPEECH, 2014, pp. 2543-2547, (Awarded the
ments in terms of EER and minDCF, respectively, compared to ISCA travel grant).
i-vectors. For score fusion, BOSARIS toolkit [19] is used.
For the experiments of Section 4, the database has been 7. Conclusions
recorded within the scope of the EU project SpeechDat-Car
The main contributions of this thesis have been presented in
(LE4-8334) [20]. Table 3 summarizes the results for all the
three main works. In the first one, a hybrid architecture based
techniques in four categories based on the test signal durations:
on DBN and DNN has been proposed to discriminatively model
less than 2 sec, between 2 and 3 sec, more than 3 sec, and all
each target speaker for i-vector speaker verification. It was
durations. The first two categories are more interesting because
shown that the proposed hybrid system fills approximately 46%
the decision should be made fast in this application. Both i-
of the performance gap between the cosine and the oracle PLDA
vector+DNN systems show superior performance compared to
scoring systems in terms of minDCF. In the second work, a
i-vector + LDA baseline system. The frame-based GMM-UBM
new vector representation of speech has been presented for
baseline system works better than other systems only for test
text-independent speaker recognition. Gaussian Mixture Model
signals shorter than 2 sec. However, the accuracy is still high in
(GMM) supervectors have been transformed by a Universal
comparison to other categories.
RBM (URBM) to lower dimensional vectors, referred to as
GMM-RBM vectors. The experimental results show that the
6. Publications performance of GMM-RBM vectors is comparable with that of
traditional i-vectors but with much less computational load. In
1. O. Ghahabi and J. Hernando, Restricted Boltzmann ma-
the third work, a DNN architecture has been proposed for i-
chines for vector representation of speech in speaker
vector LID of short utterances recorded in cars. It has been
recognition, Computer Speech & Language, vol. 47, pp.
shown that for test signals with duration 2-3 sec the proposed
16-29, 2018.
DNN architecture outperforms GMM-UBM and i-vector/LDA
2. O. Ghahabi and J. Hernando, Deep Learning Backend baseline systems by 37% and 28%, respectively.
187
8. References print text-dependent speaker verification,” in Proc. ICASSP, 2014,
pp. 4052–4056.
[1] A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence
training of deep belief networks for speech recognition,” in Proc. [11] S. Wang, Y. Qian, and K. Yu, “What does the speaker embedding
Interspeech, 2010, pp. 2846–2849. encode?” Proc. Interspeech, pp. 1497–1501, 2017.
[2] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre- [12] G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker embed-
trained deep neural networks for large-vocabulary speech recog- dings for short-duration speaker verification,” Proc. Interspeech,
nition,” IEEE Transactions on Audio, Speech, and Language Pro- pp. 1517–1521, 2017.
cessing, vol. 20, no. 1, pp. 30–42, 2012. [13] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
[3] A. Senior, H. Sak, and I. Shafran, “Context Dependent Phone “Deep neural network embeddings for text-independent speaker
Models For LSTM RNN Acoustic Modelling,” in Proc. ICASSP, verification,” Proc. Interspeech, pp. 999–1003, 2017.
2015, pp. 4585–4589.
[14] S. Prince and J. Elder, “Probabilistic linear discriminant analysis
[4] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, for inferences about identity,” in Proc. ICCV, 2007, pp. 1–8.
“Front-end factor analysis for speaker verification,” IEEE Trans-
[15] P. Kenny, “Bayesian speaker verification with heavy tailed priors,”
actions on Audio, Speech, and Language Processing, vol. 19,
in Proc. Odyssey), 2010.
no. 4, pp. 788–798, May 2011.
[16] O. Ghahabi, “Deep learning for i-vector speaker and lan-
[5] Y. Lei, N. Scheffer, L. Ferre, and M. Mclaren, “A novel scheme
guage recognition,” PhD dissertation, Universitat Politecnica de
Catalunya, 2018, [Online]. Available: http://hdl.handle.net/2117/
network,” in Proc. ICASSP, 2014, pp. 1714–1718.
118780.
[6] F. Richardson, D. Reynolds, and N. Dehak, “Deep Neural Net-
work Approaches to Speaker and Language Recognition,” IEEE [17] NIST. (2014) The NIST speaker recognition i-vector machine
Signal Processing Letters, vol. 22, no. 10, pp. 1671–1675, 2015. learning challenge. [Online]. Available: http://nist.gov/itl/iad/
mig/upload/sre-ivectorchallenge 2013-11-18 r0.pdf.
[7] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, “Deep
feature for text-dependent speaker verification,” Speech Commu- [18] NIST, “The NIST year 2010 speaker recognition evaluation
nication, vol. 73, pp. 1–13, Oct. 2015. plan,” 2010, [Online]. Available: https://www.nist.gov/itl/iad/
mig/speaker recognition evaluation 2010.
[8] T. Pekhovsky, S. Novoselov, A. Sholokhov, and O. Kudashev, “On
autoencoders in the i-vector space for speaker recognition,” pp. [19] N. Brummer and E. Villiers, “BOSARIS toolkit user guide:
217–224, 2016. Theory, algorithms and code for binary classifier score pro-
cessing,” 2011, [Online]. Available: https://sites.google.com/site/
[9] J. Villalba, N. Brümmer, and N. Dehak, “Tied variational autoen- bosaristoolkit/.
coder backends for i-vector speaker recognition,” in Proc. Inter-
speech, 2017, pp. 1004–1008. [20] A. Moreno, B. Lindberg, C. Draxler, G. Richard, K. Choukri,
S. Euler, and J. Allen, “Speechdat-car. a large speech database
[10] E. Variani, X. Lei, E. McDermott, I. Lopez Moreno, and for automotive environments,” in Proc. LREC, 2000.
J. Gonzalez-Dominguez, “Deep neural networks for small foot-
188
IberSPEECH 2018
Unsupervised Learning for Expressive Speech Synthesis

Igor Jauk1
1
Universitat Politècnica de Catalunya, Spain
ij.artium@gmail.com
Abstract mechanism which can adopt to all possible conversational sit-

This article describes the homonymous PhD thesis realized at uations and create expressions “on the fly”. For this we need
the Universitat Politècnica de Catalunya. The main topic and automatic processes for analysis of data and training. Some
the goal of the thesis was to research unsupervised manners of suggestions to approach this problem were made for instance in
training expressive voices for tasks such as audiobook reading. [11, 39, 7, 23], where data was clustered by different criteria to
The experiments were conducted on acoustic and semantic do- select training data and similar ideas. And this is also the topic
mains. In the acoustic domain, the goal was to find a feature of this work. The guiding hypothesis are:
set which is suitable to represent expressiveness in speech. The
basis for such a set were the i-vectors. The proposed feature set 1. It is possible to define expressive voices from clusters
outperformed state-of-the-art sets extracted with OpenSmile. of data in the acoustic domain, applying unsupervised
Involving the semantic domain, the goal was first to predict methods to build the clusters, i.e. no labels of human in-
acoustic features from semantic embeddings of text for expres- terpretation are permitted to define the voices or the data
sive speech and to use the predict vectors as acoustic cluster in the clusters.
centroids to adapt voices. The result was a system which auto- 2. It is possible to improve the expressiveness of a syn-
matically reads paragraphs with expressive voice and a second thetic voice using in the training process semantic fea-
system which can be considered as an expressive search engine tures which codify some sort of expressive information
and leveraged to train voices with specific expressions. and are obtained fully automatically.
The third experiment evolved to neural network based
speech synthesis and the usage of sentiment embeddings. The So one of the main goals is the authomatic processing.
embeddings were used as an additional input to the synthesis Also, as the hypothesis reveal, the work embraces both, the
system. The system was evaluated in a preference test showing acoustic, and the semantic domains. The different approaches
the success of the approach. are presented in the sections below. Section 2 introduces
Index Terms: expressive speech, unsupervised training, se- i-vector-based representation methods for expressive speech.
mantic embeddings, acoustic features Section 3 introduces prediction methods of acoustic features
from semantic text representations, also called embeddings.
1. Introduction Section 4 discusses a neural network based TTS system which
uses sentimente embeddings to “predict” expressiveness. Fi-
Speech synthesis is an old, almost romantic, idea of machines, nally, Section 5 provides a small discussion and draws some
computers, and robots, who talk and express themselves as do conclusions, and Section 6 lists the helping parties of this
human beings. In futuristic science fiction movies and liter- project.
ature, there is almost no way around a talking computer or
robot. Sometimes, the talking computer, despite of the vast
artificial intelligence capabilities, is identified as such, talking 2. Acoustic feature selection
in a robotic and monotonous way, and being nothing else than Traditionally, expressive speech synthesis and analysis lever-
an aid to humans. In different occasions, computers act very aged especially prosodic features to represent emotions or
human-like, imitating emotions and free will. In today’s world speaking styles. For instance [19] use F0, intensity and du-
of smartphones, mobile connectivity and fast-paced life, speech ration; [35] use glottal parameters. In [32] the authors use a
applications gain more and more importance – not least due to set of prosodic parameters combined with some spectral ones;
ever-improving synthetic speech quality. A fast look at these [11] use prosodic features, i.e. F0, voicing probability, local jit-
applications reveals clearly that today’s users do not looking for ter and shimmer, and logarithmic HNR for audiobook cluster-
an AI with a robotic and monotonous voice: making jokes, un- ing and posterior synthetic voice training. Almost all research
derstanding and showing sympathy, and a long etc. of things concentrate on prosodic features. However, there are also find-
which can be resumed in “sounding human-like”, almost seems ings which state that some emotions might be better represented
to be a warrant for product quality of such applications. with spectral parameters, e.g. [27, 1]. In [22] the authors used
In a not too far past, expressive speech (or also emotional i-vectors (e.g. [31, 8]) to predict emotions. This idea was picked
speech) was pretty much of an effort. Systems designed spe- up in [17, 16] and developed to a set of features, i-vector-based
cific emotions, often with control mechanisms for intensity etc., and others, which were compared in three clustering experi-
and achieved impressive results with such techniques – e.g. ments.
[6, 28, 5, 13, 41], just to name a few. However, despite the
partly great sounding results, and apart of the laborious design, 2.1. Experimental framework
these systems focus on a limited set of emotions or expressive
speaking styles. Real-life human speech does include an infinite The first and the second experiments compared multiple fea-
number of emotions as those are speaker and situation depen- tures objectively and subjectively, as published in [17, 16]. The
dent. So in order to create a truly flexible system you need a framework is presented in figure 1.
189 10.21437/IberSPEECH.2018-38
Figure 1: Clustering and synthesis framework. Table 1: Perplexities (PP) for different feature sers for Expres-
sions (Ex) and Characters (Ch) in comparison to the database.
PP/Ex PP/Ch
DB 140.4 8.3
silRate 10.6 4.7
sylRate 9.4 4.0
meanF 0 9.8 4.2
Rhythm − P itch 8.7 3.4
Rhythm − P itch − JShimm 8.6 3.4
M F CCiV ec 9.0 3.5
F 0iV ec 11.2 4.2
iV ecC 8.8 3.8
Rhythm − iV ecC 8.2 3.3
Rhythm − JShimm − iV ecC 8.5 3.5
The idea is: use different feature sets to cluster expressive
speech, use the data in clusters to train expressive voices and
synthesize a diologue using these expressive voices. The corpus feature sets to the winning i-vector set: Rhythm & iVecC.1 Also
is a juvenile narrative audiobook recorded in European Span- here, the proposed combination of Rhythm & iVecC outper-
ish, with a total of 7900 sentences and 8.8 hours of duration. formed all OpenSmile sets.
The clustering algorithm is k-means, concretely VQ, as by [12].
Many different featur combinations were tested. The results of
Table 2: Perplexities for different features combinations, includ-
some of them a presented below. Some features were combined
ing openSMILE for expressions (E) and for characters (Ch).
to sets in order to facilitate the notation.
• silRate is silence rate and syllRate is syllable rate (#/sec) PP/Ex PP/Ch
(extracted with Ogmios [4]). DB 140.4 8.3
• Rhythm is silence and syllable rates (#/sec), duration is09 10.5 3.9
means and variation, computation based on segmenta- is10 10.8 4.0
tion. emobase 10.5 4.0
emolarge 8.8 3.7
• Pitch is F0 means, variance and range. Rhythm − iV ecC 8.2 3.3
• JShimm is Local jitter and shimmer (extracted with [3]).
• MFCCiVec are i-vectors calculated on basis of MFCCs;
F0iVec are i-vectors calculated on basis of F0. iVecC: 2.3. Subjective evaluation
F0 and MFCC based i-vectors (the acoustic features for Two subjective experiments were conducted. For these experi-
the i-vectors were extracted with the AHOCoder [9], the ments, for a given dialogue from the same audiobook (test set
i-vectors themselves were extracted using Kaldi [29]). excluded form the training set), a set of synthetic voices was
trained using the data in the clusters. The underlying system
2.2. Objective evaluation was an HMM based TTS [24] where the average voice was
An objective evaluation was performed: a small part of the cor- trained using the whole corpus (aprox. 10h) and the cluster data
pus was labeled with expressions and character (speaker) labes was used to perform adaptation. A total of 16 sentences was
(only for the evaluation purposes) and with the aid of these la- presented to a number of participants. The task: design your
bels, the perplexity of the clusters was calculated, derived from own audiobook dialogue using synthetic voices instead of the
real once.
entropy as by [33, 42]: P P = 2H̄(X) .
The interface is a website, where the participants could
A resume of the results is shown in the table 1. Due to
choose 1 of 10 synthetic voices for each sentence in a diolague.
space limitations only few chosen results are shown. The upper
The website design aimed to create the right atmosphere of the
line shows the perplexity calculated for the annotated part of the
book story and a more enjoyable experience. Also an introduc-
corpus. The part below it shows the performance of some “tra-
tion text was provided for the case that the participants were not
ditional features” and the part at the bottom, the performance of
familiar with the story.
sets which included i-vectors.
The experiments differ in so far, that in the first experiments
As can be seen, the combination of Rhythm and iVecC out- the participants had an example of the original character voice,
performed all other combinations in the given task. in the second they did not. Also: in the first experiment, the
In order to verify these results, an additional objective synthetic voices were chosen manually with the criterion of re-
evualation was conducted. For this evaluation, OpenSmile fea- samblance to the original voices, and mixed with other random
ture sets, as in openSMILE Book [10], were compared to the voices. In the second that choices was made automatically by
proposed sets. OpenSmile is a set of feature extraction tools acoustic distance for half of the sentences, the rest was random.
widely used for emotional and expressive speech analysis and
synthesis. It extracts thousands of features and statistics about 1 is09, is10, emobase, emolarge are feature sets by OpenSmile used
them and is considered to be state-of-the-art feature extraction in different experiments. For further details please refer to openSMILE
for expressive speech. The table 2 compares some OpenSmile Book [10].
190
The idea behind: if some voices are especially suitable for some TTS. The performance of the system is evaluated in two sub-
characters, the participants would tend to prefer them. jective experiments. For the embeddings the toolkit word2vec
In the first experiment, 19 persons had participated; in the [40, 25] was used to calculate the word embeddings; the sen-
second, 11 persons. Due to space limitations only the results for tence embeddings were calculated as centroids of the word em-
the second experiment are shown in Table 3. beddings in the vector space. The vector space has been trained
with the Wikicorpus [30].
Table 3: Relative preferences for the voices v0-v9 over the
whole paragraph for the narrator (Narr) and the two present 3.2. Subjective evaluation
characters (Ch2 and Ch3). The task in the first experiment was to read two book paragraphs
automatically predicting expressiveness for each sentences. For
v0 v1 v2 v3 v4 each sentence in the paragraph an embedding was calculated.
N arr 0.42 0.06 0.00 0.03 0.04 It was used to predict an acoustic feature vector as in Section
Ch2 0.13 0.16 0.14 0.23 0.03 2. Two prediction models were compared: a nearest-neighbour
Ch3 0.18 0.13 0.13 0.31 0.00 classifier and a neural network. Both paragraphs were extracted
from different books of the same series, as to preserve char-
v5 v6 v7 v8 v9 acters and the ambience. The expressive readings were also
N arr 0.23 0.04 0.10 0.06 0.01 compared to a neutral reading. The task was implemented as a
Ch2 0.09 0.05 0.03 0.10 0.03 preference test. The participants had the option to choose that
Ch3 0.18 0.00 0.00 0.00 0.05 two systems performed equally.
A total of 21 persons participated in the experiment. Table 4
The results show that there is an actual preference for some shows the results for these experiment for both paragraphs (P 1
voices atop of others. Of course, not all participants have the and P 2). The results show a clear preference for the expressive
same imagination of the book characters, especially if they systems.
don’t know the book. So individual preferences are out of scope
of this task. But the results show clearly, that the approach as Table 4: Prediction method preferences by users for the first
well as the proposed feature sets are suitable for the task. The two tasks. DNN method, nearest neighbor (NN) method, neutral
task itself can be interpreted as a simulatio of a real-life appli- voice.
cation of expressive speech. Further details on the experiments
and the results are published in [17, 16, 14] DNN NN
DNN NN neutral =NN =neutral
3. Semantics-to-Acoustics Mapping P1 0.19 0.43 0.0 0.38 0.0
P2 0.29 0.14 0.04 0.48 0.05
When we humans read a text aloud –lets say we read a good
night story to a small child–, we probably will read the story
with an expressive voice imitating book characters, their emo- In a further test the prediction system was used as a “search
tions in different situations etc. as to engage the child. Although engine” for expressive training data. For this purpose, a key-
likely we all would read slightly different, though the “quality” word called seed was used to predict an acoustic vector, which
of our reading will possibly be judged by our expressive abil- on its side was used as a cluster centroid for acoustic data and
ities. The question for this section is: What in the text does for voice adaptation. For example “Mysterious secret in silent
provide us the necessary information as to adequately adapt our obscurity” was used as seed to find training data for a suspense
reading style and can it be taught to a machine? voice. Other trained emotions were angry, happy and sadness,
The key approach is the automatic representation of text. also a neutral voice. Seven sentences were synthesized with
Such representations are often called embeddings and there is each of these voices and again, a preference test was presented
a large number of techniques to calculate them, like for in- to the 21 participants. The synthesized sentences were chosen
stance [21, 2, 26, 34]. The basic idea is to represent text in trying to reflect their expressive meaning. For example “Finally,
terms of word co-occurences of defined text units (e.g. sen- the holidays begin!” is supposed to be happy and the expecta-
tences, paragraphs, etc.). This way each unit is represented as a tion was that the participants would choose the happy voice for
co-occurence vector of its own words. More modern techniques it. Table 5 presents the results for this experiment.
train neural networks to predict a certain feature. Then extract
the vector representations from intermediate layers of the net- Table 5: Task 3. Voice preference by users for each sentence.
work, like [26] and [34].
The assumption is that these representations actually also happy angry suspense neutral
codify expressive information, especially if the underying net- Happy1 0.29 0.38 0.24 0.10
work is trained using an “expressive” criterion (see Section 4). Happy2 0.52 0.24 0.10 0.14
So the task is to convert this vector representation into acoustics. Angry1 0.14 0.48 0.24 0.14
Angry2 0.38 0.43 0.14 0.05
3.1. Experimental framework Suspense 0.0 0.05 0.81 0.14
The framework is: in a given text corpus, in this case an au- Sadness 0.19 0.05 0.43 0.43
diobook, for each sentence of the text a semantic embedding is Neutral 0.10 0.05 0.43 0.43
calculated. This embedding is then used to predict an acoustic
feature vector, concretely from the above experiment, which for The preferences for voices are pretty clear. It is interesting
its part is the centroid of a data cluster. As in the experiments to remark that happy and angry voices for sometimes exchange-
above, these data clusters are used for adaptation in an HMM able, the same can be said about the sad and neutral voices in
191
their respective contexts. Further details on these experiments general preference results will be shown here. More details can
can be found in [15, 14]. be found in [14, 18].
Table 6 shows the preferences divided by the sentiment.
4. NN-based expressive TTS with sentiment For positive and negative sentences, the word level system per-
formed best, although for negative sentences with high variance.
Looking back at the experiment in the previous chapter there For neutral sentences, the word context and tree distance system
are few spontaneous suggestions which can be made as to de- performed best. Possibly it is due to the fact that it probably has
velop the approach and improve the results. First, nowadays an equilibrating effect.
HMM-based synthesis is almost completely replaced by syn- T-tests show that for negative sentences, there is a signifi-
thesis based on neural networks. NN-based TTS provides new cant difference between the system without sentiment and the
possibilities of leveraging semantic vectors and avoiding clus- word level system, and no significant difference for the other
tering, which is an advantage itself since all data is always taken systems. For neutral sentences, there is a significant difference
into account in the training process. For this experiment the between the system without sentiment and the word context and
DNN-based TTS as described in [36] was used. tree distance system, but not for the other systems. For positive
The second point is the usage of embeddings which are sentences, there is only significant difference for the one-tailed
more suitable to represent expressiveness. The authors in t-test between the system without sentiment and the word level
[20, 34] propose the Stanford Sentiment Parser. The system system.
is trained on movie reviews and predicts the positivity or nega-
tivity of the sentence. Table 6: System preferences for positive, negative and neutral
In previous work, neural network based systems have al- sentences. ws: without sentiment, wcd: word context and tree
ready been combined with semantic vector input, though not distance, wl: word level
for expressive speech. To name a few, [37] use word embed-
dings to substitute TOBI and POS tags in RNN-based synthesis
achieving significant system improvement. [38] enhance the in-
put to NN-based systems with continuous word embeddings, ws wcd wl
and also try to substitute the conventional linguistic input by positive mean 1.84 1.85 1.71
the word embeddings. They do not achieve performance im- positive variance 0.54 0.76 0.54
provement, however, when they use phrase embeddings com- negative mean 2.06 1.96 1.84
bined with phonetic context, they do achieve significant im- negative variance 0.52 0.67 1.1
provement in a DNN-based system. [38] enhances word vectors neutral mean 2 1.83 1.96
with prosodic information, i.e. updates them, achieving signifi- neutral variance 0.71 0.6 0.95
cant improvements.
4.1. Experimental framework

The DNN-based system was trained on two audiobooks in 5. Discussion & Conclusions
American English. An additional linguistic input is introduced,
the sentiment predicted by the Stanford sentiment parser. Here, The main topic addressed in this thesis is the automatic training
different input combinations are tested, including word context. of expressive voices. Several experiments were conducted on
With word context, probability vector of the word in ques- the acoustic domain, automatically selecting training data, and
tion and the probability vectors of two words on the left and two on semantic domain, predicting acoustics from semantics. In
words on the right were used. Also the tree distance, which is the last experiments, two state-of-the-art techniques were com-
the hierarchical distance counted in the number of binary tree bined for the given task. Speech technology domain is an in-
nodes between words is added, such that the input vector for credibly fast evolving field, especially being data driven where
each word for the system (v wcd) is composed as follows: nowadays data is the key to all information technology. Nev-
ertheless, the underlying technologies leveraged and presented
P = {Pl2 , Pl1 , Pc , Pr1 , Pr2 , Dt } (1) in this work are not only not out of date, but are actually the
where Pc is the probability vector for the current word, the driving power of current technologies. This includes the NN-
Pl2 is the probability vector for the second word on the left, based TTS and the semantic embeddings for text represention,
Pl1 is probability vector for the first word on the left, Pr1 is but also not least i-vector-like representations for the acoustic
probability vector for the first word on the right and Pr2 is the domain. In that sense, this thesis is a substantial contribution to
probability vector for the second word on the right, each of the speech technology research.
probability vectors as defined in equation ??. D is the hierar-
chical tree distance (distance in tree counted in nodes). 6. Acknowledgements
I would like to thank especially Antonio Bonafonte, the tutor of
4.2. Subjective experiments
this thesis. This work was supported by the Spanish Ministerio
Two experiments have been conducted in order to test the sys- de Economı́a y Competitividad and European Regional Devel-
tem performance. In all of them, similar to the experiment opmend Fund, contract TEC2015-69266-P (MINECO/FEDER,
above, a preference test is conducted among a group of par- UE) and by the FPU grant (Formación de Profesorado Univer-
ticipants. A total of 20 persons participated in the experiment. sitario) from the Spanish Ministry of Science and Innovation
However, there was a larger group of speech technology ex- (MCINN) to Igor Jauk, with a special collaboration with the
perts and people who have no experience with speech technol- University of Texas at El Paso, under Pr. Nigel Ward, and the
ogy which allowed for an interesting comparison of the devi- National Institute of Informatics in Tokyo, under Pr. Junichi
ations of these two groups. Due to space limitations only the Yamagishi.
192
7. References [23] J. Lorenzo Trueba. Design and Evaluation of Statistical Paramet-
ric Techniques in Expressive Text-To-Speech: Emotion and Speak-
[1] R. Barra-Chicote, J. Yamagishi, S. King, J. Montero, and ing Styles Transplantation. PhD thesis, E.T.S.I. Telecomunicación
J. Macias-Guarasa. Analysis of statistical parametric and unit (UPM), 2016.
selection speech synthesis systems applied to emotional speech.
Speech Communication, 52:394–404, 2010. [24] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai. Speech syn-
thesis from hmms using dynamic features. In Acoustics, Speech,
[2] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neu- and Signal Processing (ICASSP), pages 389–392, 1996.
ral probabilistic model. Journal of Machine Learning Research,
3(Feb):1137–1155, 2003. [25] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estima-
tion of word representations in vector space. In Proceedings of
[3] P. Boersma and D Weenink. Praat: doing phonetics by computer Workshop at ICLR, 2013.
(version 5.4.07), 2015.
[26] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Dis-
[4] T. Bonafonte, P. Aguero, J. Adell, J. Perez, and A. Moreno. Og- tributed Representations of Words and Phrases and their Compo-
mios: the UPC text-to-speech synthesis system for spoken transla- sitionality. ArXiv e-prints, October 2013.
tion. In Proceedings of TC-STAR Workshop on Speech-to-Speech [27] J.M. Montero, J. Gutierrez-Arriola, J. Colas, E. Enriquez, and
Translation, pages 199–204, 2006. J.M. Pardo. Analysis and modeling of emotional speech in span-
[5] F. Burckhardt and W.F. Sendelmeier. Verification of acoustical ish. In Proceedings of ICPhS, pages 671–674, 1999.
correlates of emotional speech using formant synthesis. In Pro- [28] I.R. Murray and J.L. Arnott. Toward the simulation of emotion
ceedings of ISCA Workshop on Speech and Emotion, pages 151– in synthetic speech: a review of the literature on human vocal
156, 2000. emotion. The Journal of the Acoustic Society of America, pages
[6] J.E. Cahn. Generation of affect in synthesized speech. In Pro- 1097–1108, 1993.
ceedings of American Voice I/O Society, pages 251–256, 1989. [29] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
[7] L. Chen, M.J.F. Gales, N. Braunschweiler, M. Akamine, and N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
K. Knill. Integrated expression prediction and speech synthesis J. Silovsky, G. Stemmer, and K. Vesely. The kaldi speech recogni-
from text. IEEE Journal of Selected Topics in Signal Processing, tion toolkit. In IEEE 2011 Workshop on Automatic Speech Recog-
8(2):323–335, 2014. nition and Understanding. IEEE Signal Processing Society, De-
cember 2011. IEEE Catalog No.: CFP11SRW-USB.
[8] N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet.
Front end factor analysis for speaker verification. IEEE Transac- [30] S. Reese, G. Boleda, L. Cuadros, M. Padró, and G. Rigau. Wiki-
tions on Audio, Speech and Language Processing, 19(4):788–798, corpus: A word-sense disambiguated multilingual wikipedia cor-
2011. pus. In Proceedings of 7th Language Resources and Evaluation
Conference (LREC10), pages 1418–1421, 2010.
[9] D. Erro, I. Sainz, E. Navas, and I. Hernaez. Improved HNM-based
[31] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker veri-
vocoder for statistical synthesizers. In Proceedings of Interspeech,
fication using adapted Gaussian mixture models. Digital Signal
pages 1809–1812, 2011.
Processing, 10(1-3):19–41, 2000.
[10] F. Eyben. The opensmile book, 2016. [32] B. Schuller, R. Müller, M. Lang, and G. Rigoll. Speaker indepen-
[11] F. Eyben, S. Buchholz, N. Braunschweiler, J. Latorre, V. Wan, dent emotion recognition by early fusion of acoustic and linguistic
M. Gales, and K. Knill. Unsupervised clustering of emotion and features within ensembles. In Proceedings of Interspeech, pages
voice styles for expressive TTS. In Proceedings of ICASSP, pages 805–808, 2005.
4009–4012, 2012. [33] C.E. Shannon. A mathematical theory of communication. The
[12] R.M. Gray. Vector quantization. IEEE ASSP Magazine, 1(2):4– Bell System Technical Journal, 27(3):379–423, 1948.
29, 1984. [34] R. Socher, A. Perelygin, J.Y. Wu, J. Chuang, C.D. Manning, A.Y.
[13] W. Hamza, R. Bakis, E. Eide, M. Picheny, and J. Pitrelli. The IBM Ng, and C. Potts. Recursive deep models for semantic composi-
expressive speech synthesis system. In Proceedings of ICSLP, tionality over a sentiment treebank. In Conference on Empirical
pages 2577–2580, 2004. Methods in Natural Language Processing (EMNLP), 2013.
[14] I. Jauk. Unsupervised Learning for Expressive Speech Synthesis. [35] E. Szekely, J. Cabral, P. Cahill, and J. Carson-Berndsen. Cluster-
PhD thesis, Universitat Politècnica de Catalunya, 2017. ing expressive speech styles in audiobooks using glottal source pa-
rameters. In Proceedings of Interspeech, pages 2409–2412, 2011.
[15] I. Jauk and A. Bonafonte. Direct expressive voice training based
[36] S. Takaki and J. Yamagishi. Constructing a deep neural network
on semantic selection. In Proceedings of Interspeech, pages
based spectral model for statistical speech synthesis. Recent Ad-
3181–3185, 2016.
vances in Nonlinear Speech Processing, 48:117–125, 2016.
[16] I. Jauk and A. Bonafonte. Prosodic and spectral ivectors for ex- [37] P. Wang, Y. Qian, F.K. Soong, L. He, and H. Zhao. Word embed-
pressive speech synthesis. In Proceedings of Speech Synthesis ding for recurrent neural network based tts synthesis. In Proceed-
Workshop 9, pages 59–63, 2016. ings of International conference on acoustics, speech and signal
[17] I. Jauk, A. Bonafonte, P. López-Otero, and L. Docio-Fernandez. processing (ICASSP), pages 4879–4883, 2015.
Creating expressive synthetic voices by unsupervised clustering [38] X. Wang, S. Takaki, and J. Yamagishi. Investigating of using con-
of audiobooks. In Interspeech 2015, pages 3380–3384, 2015. tinuous representation of various linguistic units in neural network
[18] I. Jauk, J. Lorenzo-Trueba, J. Yamagishi, and A. Bonafonte. Ex- based text-to-speech synthesis. IEICE Transactions on Informa-
pressive speech synthesis using sentiment embeddings. In Pro- tion and Systems, E99-D(10):2471–2480, 2016.
ceedings of Interspeech, pages 3062–3066, 2018. [39] O. Watts. Unsupervised Learning for Text-to-Speech Synthesis.
[19] R. Kehrein. The prosody of authentic emotions. In Proceedings PhD thesis, University of Edinburgh, 2012.
of Speech Prosody, pages 423–426, 2002. [40] word2vec Tool for computing continuous distributed representa-
[20] D. Klein and C.D. Manning. Accurate unlexicalized parsing. tions of words.
ACL, pages 423–430, 2003. [41] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi. Modeling
of various speaking styles and emotions for hmm-based speech
[21] K. Kuttler. An introduction to linear algebra. Bringham Young
synthesis. In Proceedings of Eurospeech, pages 2461–2464, 2003.
University, 2007.
[42] Y. Zhao and G. Karypis. Empirical and theoretical comparisons
[22] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo. iVec- of selected criterion functions for document clustering. Machine
tors for continuous emotion recognition. In Proceedings of Iber- Learning, 55(3):311–331, 2004.
speech 2014, pages 31–40, 2014.
193
IberSPEECH 2018
ODESSA/PLUMCOT at Albayzin Multimodal Diarization Challenge 2018

Benjamin MAURICE 1 , Hervé BREDIN 1 , Ruiqing YIN 1 , Jose PATINO 3 ,
Héctor DELGADO 3 , Claude BARRAS 2 , Nicholas EVANS 3 , Camille GUINAUDEAU 2
LIMSI, CNRS, Université Paris-Saclay, Orsay, France

1
2
LIMSI, CNRS, Univ. Paris-Sud, Université Paris-Saclay, Orsay, France
3
EURECOM, Sophia-Antipolis, France
12 3
firstname.lastname@limsi.fr, firstname.lastname@eurecom.fr
Abstract 2.1. Face tracking

This paper describes ODESSA and PLUMCOT submissions to After an initial step of shot boundary detection using opti-
Albayzin Multimodal Diarization Challenge 2018. Given a list cal flow and displaced frame difference [3], face tracking-by-
of people to recognize (alongside image and short video sam- detection is applied within each shot using a detector based on
ples of those people), the task consists in jointly answering the histogram of oriented gradients [4] and the correlation tracker
two questions “who speaks when?” and “who appears when?”. proposed in [5]. More precisely, face detection is applied ev-
Both consortia submitted 3 runs (1 primary and 2 contrastive) ery frame (so every 40ms), and tracking is performed in both
based on the same underlying monomodal neural technologies: forward and backward directions.
neural speaker segmentation, neural speaker embeddings, neu-
ral face embeddings, and neural talking-face detection. Our 2.2. Face embedding
submissions aim at showing that face clustering and recogni-
tion can (hopefully) help to improve speaker diarization. Each face track is then processed using the ResNet network
Index Terms: multimodal speaker diarization, face clustering with 29 convolutional layers [6] available in the dlib machine
learning toolkit [7]. This network was trained on both Face-
Scrub [8] and VGG-Face [9] datasets to project each face into a
1. Introduction 128-dimensional Euclidean space, in which faces from the same
This paper describes ODESSA1 and PLUMCOT2 submissions person are expected to be close to each other. Each face track is
to Albayzin Multimodal Diarization Challenge 2018 [1]. Given described by its average face embedding xface .
a collection of broadcast news TV recordings and a list of peo-
ple to recognize (alongside image and short video samples of 2.3. Face clustering
those people), the task consists in jointly answering the two
questions “who speaks when?” and “who appears when?”. Face tracks are grouped together using agglomerative cluster-
While ODESSA submissions are made of the simple jux- ing. Clustering is initialized with one cluster per face track (de-
taposition of two monomodal system (audio-only speaker di- scribed by the average face embedding introduced in the pre-
arizaton on one side, visual-only face recognition on the other vious section, xk = xface ). Then, the following process is re-
side), PLUMCOT runs aimed at showing that face clustering peated iteratively until a stopping criterion is reached:
and recognition can help to improve speaker diarization (and • find the two most similar clusters (i and j) according to
vice-versa). the Euclidean distance dij = d(xi , xj ) between their
Figure 1 provides an overview of PLUMCOT multimodal embedding xi and xj ;
pipelines. The upper part corresponds to the face recognition • compute the embedding x of the newly formed cluster as
pipeline described in details in Section 2. The lower part cor- the weighted average of the embedding xi and xj of the
responds to the speaker diarization pipeline further described two merged clusters i and j:
in Section 3. Section 4 describes the proposed multimodal ap-
proaches, depicted by vertical arrows between audio and visual ni · xi + nj · xj
modalities in Figure 1. x = (1)
ni + nj
Instead of optimizing audio and visual pipelines separately, n = ni + nj (2)
we propose to tune the whole set of hyper-parameters jointly
with respect to the official evaluation metric. This is described where ni and nj are the total number of face tracks be-
in Section 5. Following sections 7 and 8 introduce the experi- longing to clusters i and j respectively.
mental protocol and results on the development set3 .
This agglomerative process stops when dij is greater than a tun-
able threshold θfc .
2. Face clustering and recognition
This section describes the building blocks of the “face” part of 2.4. Face recognition
our runs. They all rely on the pyannote.video toolkit introduced While a perfect face clustering should lead to a perfect (visual)
in [2]. It mostly consists of three sub-modules: face tracking, diarization error rate, the actual metric used in the Albayzin
neural face embedding, and face clustering. Multimodal Diarization Challenge assumes that only a limited
1 ANR/SNF project ANR-15-CE39-0010 list of T target persons should be returned by the system. Enrol-
2 ANR/DFG project ANR-16-CE92-0025 ment data is provided for each target, containing approximately
3 Official results on the test set are not available yet. 10 pictures and one short video sample showing their face.
194 10.21437/IberSPEECH.2018-39
c2f ace
Recognition
θf a
Face tracking
pf ace c1f ace
Clustering Recognition Filtering

θf c θf a λf s
pspeaker
Mapping Resegmentation
c2speaker
Speaker
Clustering Mapping Resegmentation
segmentation
θsc c1speaker
Resegmentation
λsc
Figure 1: Pipeline used for PLUMCOT submissions (p = primary, c = contrastive).

θ• and λ• are jointly-optimized hyper-parameters.
One can then assign each cluster to the closest target t∗ by 3.2. Speaker embedding
comparing the cluster embedding x to each target embedding
xt computed as the average embedding extract from all their The embedding architecture used is the one introduced in [2]
enrolment pictures: and further improved in [13]. In the embedding space, using
the triplet loss paradigm, two sequences xi and xj of the same
t∗ = argmint∈{1...T } d(x, xt ) (3) speaker (resp. two different speakers) are expected to be close to
(resp. far from) each other according to their angular distance.
In case d(x, xt∗ ) is greater than a tunable threshold θfa , the The embeddings are trained on the Voxceleb corpus.
cluster is decided to be a non-target person and therefore not
returned by the system. This approach is denoted by pface in
Figure 1 and constitutes the “face” part of PLUMCOT primary 3.3. Speech turn clustering
submission and of all ODESSA submissions.
A variant of this cluster-wise face recognition approach is As proposed in [10], we use Affinity propagation (AP) algo-
to perform recognition directly at the level of face tracks (i.e. rithm [14] to perform clustering of speech turns. AP does not
without prior face clustering). This variant is denoted by c2face require a prior choice of the number of clusters contrary to other
in Figure 1 and constitutes the “face” part of PLUMCOT con- clustering methods. All speech segments are potential cluster
trastive submission #2. centers (exemplars). Taking as input the pair-wise similarities
between all pairs of speech segments, AP will select the exem-
plars and associate all other speech segments to an exemplar. In
3. Speaker diarization our case, the similarity between ith and j th speech segments
This section describes the building blocks of the “speaker” part is the negative angular distance between their embeddings. AP
of PLUMCOT runs. They all rely on the speaker diarization has two hyper-parameters: preference θsc and damping factor
approach introduced in [10]. λsc .
3.1. Speech turn segmentation 3.4. Re-segmentation

All submission share a same speech activity detection (SAD)
system proposed in [11]. SAD is modeled as a supervised bi- A final re-segmentation step is performed to refine time bound-
nary classification task (speech vs. non-speech), and addressed aries of the segments generated in the clustering step. It uses
as frame-wise sequence labeling tasks using bi-directional Gaussian mixture models (GMM) to model the clusters, and
LSTM on top of MFCC features. For SCD, two systems were maximum likelihood scoring at feature level. Since the log-
explored: the first one named uniform segmentation which likelihoods at frame level are noisy, an average smoothing
splits the speech parts into 1s segments, the second one using within a 1s sliding window is applied to the log-likelihood
the system proposed by [12]. Similar to SAD, SCD is mod- curves obtained with each cluster GMM. Then, each frame is
eled as a supervised binary sequence labeling task (change vs. assigned to the cluster which provides the highest smoothed
non-change). log-likelihood.
195
4. Multimodal fusion parameters that minimizes the speaker diarization error rate for
“speaker” part and the face diarization error rate for “face” part.
This section describes our attempts at improving speaker di-
arization with face clustering, and vice versa. Those two ap-
proaches were respectively submitted as PLUMCOT primary 6. Submissions
run (4.1) and PLUMCOT first contrastive run (4.2). Figure 1 summarizes the primary and two contrastive runs of
the PLUMCOT consortium. All of them have been introduced
4.1. Improving speaker diarization with face clustering in the previous sections of this paper.
Let us assume that there are N speakers according to speaker di- The ODESSA consortium mostly focused on the
arization, and M persons according to face clustering (or recog- monomodal speaker diarization aspects of the task. Therefore,
nition). Let K ∈ RN ×M be the co-occurrence matrix of the ODESSA submissions to the “speaker” part of the multimodal
output of both pipelines: Kij is the overall duration in which diarization challenge rely on the same systems used for its
speaker i ∈ {1 . . . N } is speaking and person j ∈ {1 . . . M } is open-set submissions to the speaker diarization challenge: the
visible. fusion at similarity-level of various speech turn representation
The main intuition motivating this approach arises from the (such as neural embeddings and binary keys). More informa-
following observation about broadcast news videos: most of the tion can be found in [16]). All three ODESSA submissions use
time, the camera is pointing at the current speaker. Therefore, the same “face” part as PLUMCOT primary submission.
the proposed approach simply updates each speaker cluster by
assigning them to the most co-occurring face cluster: 7. Experimental protocol
i ← argmaxj∈{1...M } Kij (4) 7.1. RTVE2018 corpus
Thanks to the joint optimization (described in Section 5) of The RTVE2018 dataset is a collection of diverse TV shows
stopping criteria for both face clustering and speaker diariza- aired between 2015 and 2018 on the public Spanish National
tion, we anticipate that this approach will “choose” to favour Television (RTVE). The development subset of the RTVE2018
smaller (but purer) speaker clusters than the purely monomodal database contains one single 2 hours show “La noche en 24H”
speaker diarization pipeline. A speaker divided into several labeled with speaker and face timestamps. It also contains 11
small clusters may then be merged back together thanks to (a additional files (for a total duration of 14 hours) labeled with
hopefully better) face clustering and Equation 4. speaker timestamps only. Enrollment files for the target persons
are also provided: they consist of a few pictures and one short
4.2. Filtering face detection with speech activity detection video for each target.
The evaluation set contains 3 videos files of almost 2 hours
Long silence each of TV shows labeled with speaker and face timestamps.
Time
However, at the time of the submission of this paper, we have
Speech turns
no result on the test set so we are not reporting results on this
test set.
Face tracks
Time
7.2. Evaluation metric
The evaluation metric used for this task is the diarization error
Figure 2: Face tracks within long non-speech regions (red) are rate (DER) defined as follows:
removed.
false alarm + missed detection + confusion
DER = (5)
total
Our face detection and tracking module tends to detect lots
of non-target faces, leading to a huge amount of false alarms where false alarm is the duration of non-speech incorrectly clas-
(e.g. in crowds, in credits at the end of TV shows, etc.). As sified as speech, missed detection is the duration of speech in-
depicted in Figure 2, we propose a very simple solution to this correctly classified as non-speech, confusion is the duration of
problem: filtering face tracks in long non-speech regions. speaker confusion, and total is the total duration of speech in the
reference. Note that this metric does take overlapping speech
5. Hyper-parameters joint optimization into account, potentially leading to increased missed detection
in case the speaker diarization system does not include an over-
As mentioned in Section 4.1, the various modules of PLUM-
lapping speech detection module. DER is a standard metric for
COT runs are jointly optimized. For instance, the “speaker”
evaluating and comparing speaker diarization systems but it can
part of PLUMCOT primary run is the combination of two mod-
also be applied for face clustering by replacing speech turns by
ules with their own set of hyper-parameters: face clustering (θfc )
face tracks.
and speaker diarization (θsc and λfc ). Instead of tuning the for-
mer for optimal face clustering performance and the latter for
7.3. Implementation details
optimal speaker diarization separately, the whole pipeline (in-
cluding the assignment step described in Equation 4) is jointly 7.3.1. Face clustering and recognition
optimized.
Practically, we use the Covariance Matrix Adaptation Evo- As already stated in Section 2, we use the pre-trained face de-
lution Strategy minimization method [15] available in the tector and face embedding available in dlib library [7], wrapped
chocolate library4 to automatically select the set of hyper- in our pyannote.video toolkit5 . All hyper-parameters of the face
4 chocolate.readthedocs.io 5 github.com/pyannote/pyannote-video
196
clustering and recognition pipeline are jointly optimized in or- 6.86% while c2speaker only gets DER = 10.68%). This shows the
der to minimize the (face) diarization error rate on the only an- benefit of the joint optimization of hyper-parameters: a better
notated video of the RTVE2018 development set provided by “face” system does not necessarily lead to a better multimodal
the organizers of the challenge. “speaker” pipeline.
As described in details in [16], ODESSA “speaker” primary
7.3.2. Speaker diarization run is the combination at similarity level of three different rep-
resentations (x-vector trained on NIST SRE data, triplet loss
Feature extraction. All modules in the speaker diarization
embedding trained on VoxCeleb and binary key). This complex
pipeline share the same feature extraction step: 19 MFCC coef-
system reaches a performance of DER = 7.21% which is still be-
ficients (with their first and second derivatives, and the first and
low the simpler multimodal PLUMCOT primary run (that com-
second derivatives of the energy) are extracted every 10ms on a
bines triplet loss speaker embedding and neural face embed-
25ms windows. The only exception is the re-segmentation step
ding) with DER = 6.86%. One could hope that combining both
that does not use any derivative.
approaches would help us get even closer to perfect diarization.
Segmentation. Both speech activity and speaker change de-
tection modules are trained with the Catalan broadcast news
database from the 3/24 TV channel proposed for the 2010 Al- 9. Conclusion and future work
bayzin Audio Segmentation Evaluation [17]. We use the exact We have conducted experiments on monomodal face clustering
same configuration as the one described in [10]: stacked bi- and speaker diarization and shown an improvement of the re-
directional LSTMs and multi-layer perceptron on 3.2s sliding sults when we combine them into a multimodal approach. It has
windows. also been shown that combining two monomodal approaches
Speaker embedding. Speaker embeddings are trained using tuned separately does not automatically lead to the best results:
VoxCeleb1 dataset [18]. We use the exact same architecture one should rather tune them jointly using a global optimization
as the one used in [13] (stacked bi-directional LSTMs on a 3s process.
window) and the training process introduced in [19] (triplet loss While results of the multimodal approaches are promising,
with angular distance). there is still room for improvement. In particular, we plan to
Speaker diarization pipeline. Once every module is trained, investigate the use of the talking-face detection approach intro-
hyper-parameters of the speaker diarization pipeline are jointly duced in [20] to improve the module in charge of mapping face
optimized in order to minimize the diarization error rate on the clusters with speaker clusters.
development set (dev2) of RTVE2018 corpus provided by the Finally, we would like to highlight the fact that the code
organizers of the challenge. for most monomodal building blocks is available for other re-
searchers to use67 .
8. Results and discussion
Table 1 summarizes the performance of each submission on the 10. Acknowledgements
development set. Official results on the test set were not avail- This work was partly supported by ANR through the ODESSA
able at the time of writing the paper. (ANR-15-CE39-0010) and PLUMCOT (ANR-16-CE92-0025)
projects.
Consortium Run Speaker Face
PLUMCOT primary 6.86 28.15 11. References
contrastive 1 10.59 28.15
contrastive 2 10.68 31.01 [1] E. Lleida, A. Ortega, A. Miguel, V. Bazán, C. Pérez, M. Zotano,
and A. de Prada, “Albayzin evaluation: Iberspeech-rtve 2018 mul-
ODESSA primary 7.21 28.15
timodal diarization challenge,” 06 2018. [Online]. Available: http:
contrastive 1 9.29 28.15 //catedrartve.unizar.es/reto2018/EvalPlan-Multimodal-v1.3.pdf
contrastive 2 11.46 28.15
[2] H. Bredin and G. Gelly, “Improving speaker diarization of tv se-
Table 1: Diarization error rate on the development set ries using talking-face detection and clustering,” in Proceedings
of the 2016 ACM on Multimedia Conference. ACM, 2016, pp.
157–161.
Comparing “speaker” parts of PLUMCOT primary run [3] Y. Yusoff, W. Christmas, and J. Kittler, “A study on automatic
(DER = 6.86%) and constrative run #1 (DER = 10.59%) shows shot change detection,” in European Conference on Multimedia
that speaker diarization can be greatly improved when guided Applications, Services, and Techniques. Springer, 1998, pp. 177–
by face clustering: this amounts to a relative improvement of 189.
35%. Face clustering also helps significantly for face recogni- [4] N. Dalal and B. Triggs, “Histograms of oriented gradients for
tion: it is improved from DER = 31.01% for track-wise face human detection,” in Computer Vision and Pattern Recognition,
recognition (c2face ) to DER = 28.15% for cluster-wise face 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.
IEEE, 2005, pp. 886–893.
recognition (pface ).
There are no difference between face primary run and con- [5] M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate
strative run #1 maybe because during the long silence founded scale estimation for robust visual tracking,” in British Machine
Vision Conference, Nottingham, September 1-5, 2014. BMVA
faces were already deleted with the recognition threshold θf a Press, 2014.
with the enrollment data.
[6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
While cluster-wise face recognition (DER = 28.15%, pface )
for image recognition,” in Proceedings of the IEEE conference on
is better than raw face clustering (DER = 46.02%, not shown computer vision and pattern recognition, 2016, pp. 770–778.
in Table 1) for the “face” part, the latter does lead to bet-
ter “speaker” performance than the former when jointly opti- 6 github.com/pyannote/pyannote-audio
mized with the speaker diarization pipeline (pspeaker gets DER = 7 github.com/pyannote/pyannote-video
197
[7] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of Ma-
chine Learning Research, vol. 10, no. Jul, pp. 1755–1758, 2009.
[8] H.-W. Ng and S. Winkler, “A data-driven approach to cleaning
large face datasets,” in Image Processing (ICIP), 2014 IEEE In-
ternational Conference on. IEEE, 2014, pp. 343–347.
[9] O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recog-
nition.” in BMVC, vol. 1, no. 3, 2015, p. 6.
[10] R. Yin, H. Bredin, and C. Barras, “Neural Speech Turn Seg-
mentation and Affinity Propagation for Speaker Diarization,” in
19th Annual Conference of the International Speech Communica-
tion Association, Interspeech 2018, Hyderabad, India, September
2018.
[11] G. Gelly and J.-L. Gauvain, “Minimum Word Error Training of
RNN-based Voice Activity Detection.” in 186th Annual Confer-
ence of the International Speech Communication Association, In-
terspeech 2015, Dresden, Germany, September 2015, pp. 2650–
2654.
[12] R. Yin, H. Bredin, and C. Barras, “Speaker Change Detection in
Broadcast TV using Bidirectional Long Short-Term Memory Net-
works,” in 18th Annual Conference of the International Speech
Communication Association, Interspeech 2017, Stockholm, Swe-
den, August 2017.
[13] G. Wisniewksi, H. Bredin, G. Gelly, and C. Barras, “Combin-
ing speaker turn embedding and incremental structure prediction
for low-latency speaker diarization,” in 18th Annual Conference
of the International Speech Communication Association, Inter-
speech 2017, Stockholm, Sweden, September 2017.
[14] B. J. Frey and D. Dueck, “Clustering by Passing Messages Be-
tween Data Points,” science, vol. 315, no. 5814, pp. 972–976,
2007.
[15] C. Igel, N. Hansen, and S. Roth, “Covariance matrix adapta-
tion for multi-objective optimization,” Evolutionary Computation,
vol. 15, no. 1, pp. 1–28, 2007.
[16] J. Patino, H. Delgado, R. Yin, H. Bredin, C. Barras, and N. Evans,
“ODESSA at Albayzin Speaker Diarization Challenge 2018,” in
IberSPEECH, Barcelona, Spain, November 2018.
[17] A. Ortega, D. Castan, A. Miguel, and E. Lleida, “The albayzin
2012 audio segmentation evaluation.” Iberspeech 2012, 2012.
[18] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-
scale speaker identification dataset,” in 18th Annual Conference
of the International Speech Communication Association, Inter-
speech 2017, Stockholm, Sweden, September 2017.
[19] H. Bredin, “TristouNet: Triplet Loss for Speaker Turn Embed-
ding,” in ICASSP 2017, IEEE International Conference on Acous-
tics, Speech, and Signal Processing, New Orleans, USA, March
2017.
[20] J. S. Chung and A. Zisserman, “Out of time: automated lip sync in
the wild,” in Workshop on Multi-view Lip-reading, ACCV, 2016.
198
IberSPEECH 2018
UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge
Miquel India, Itziar Sagastiberri, Ponç Palau, Elisa Sayrol, Josep Ramon Morros, Javier Hernando
Universitat Politècnica de Catalunya (UPC)

miquel.india@tsc.upc.edu, itziar.sf@opendeusto.es, ppalaupuigdevall@gmail.com,
[elisa.sayrol, ramon.morros, javier.hernando]@upc.edu
Abstract lenge. In this challenge a list of people occurrences within the

RTVE2018 database should be provided as a result. This list
This paper presents the UPC system proposed for the Multi-
must contain people either if they are talking, or their faces ap-
modal Speaker Diarization task of the 2018 Albayzin Chal-
pear in the video, or both at the same time. The two modalities
lenge. This approach works by processing individually the
should therefore be provided. Identities and enrollment data
speech and the image signal. In the speech domain, speaker
are given, so the identification might be considered supervised.
diarization is performed using identity embeddings created by a
Nevertheless, in the evaluation data unknown persons may ap-
triplet loss DNN1 that uses i-vectors as input. The triplet DNN
pear and should be distinguished from those that are known.
is trained with an additional regularization loss that minimizes
the variance of both positive and negative distances. A sliding This paper is organized as follows: Section 2 describes the
window is then used to compare speech segments with enroll- system that has been developed to detect identities from the im-
ment speaker targets using cosine distance between the embed- age and the speech modalities. Section 3 gives the technical
dings. To detect identities from the face modality, a face de- details of the setup and the experimental results on the provided
tector followed by a face tracker has been used on the videos. database. Finally, conclusions and final remarks are given in
For each cropped face a feature vector is obtained using a Deep section 4.
Neural Network based on the ResNet 34 architecture, trained
using a metric learning triplet loss (available from dlib library). 2. System Description
For each track the face feature vector is obtained by averaging
the features obtained for each one of the frames of that track. The UPC Multimodal Speaker Diarization System consists on
Then, this feature vector is compared with the features extracted two monomodal systems and a fusion block that combines the
from the images of the enrollment identities. The proposed sys- outputs of the previous systems so as to obtain a more refined
tem is evaluated on the RTVE2018 database. speaker labelling. The speech and the image are processed in
Index Terms: Speaker Diarization, Face Diarization, Multi- an independent manner with a speaker and a face diarization
modal System system. The face tracks and the speaker segments are then
fused with an algorithm that combines the intersection of these
sources according to a set of assumptions made on the data. The
1. Introduction next section describes in detail the monomodal approaches and
Video broadcasting generates a massive amount of multimedia the fusion system to combine them.
information that once archived generates a need to access its
contents. In particular, it is essential to develop tools that are 2.1. Video System
able to automatically search and detect the presence of people.
Challenges such as REPERE [1] or the MediaEval Multimodal The video system is responsible of localizing the faces of the in-
Person Discovery in Broadcast TV [2], [3] addressed the iden- dividuals appearing in the scene and to determine if these faces
tification of people appearing and speaking. belong to one of the N given identities.
Two main approaches can be found in the literature for Our approach is based on performing face tracking to iden-
retrieval of person identification in videos: Perhaps the most tify the intervals where a given person is appearing in the video.
popular is based on clustering face tracks, speech segments or A face track consists of the location of the faces of an individ-
both [4, 5]. This provides multiple clusters (aggrupations of ual in the consecutive frames where he/she appears in the video.
signal segments) that correspond to the identities in the video. Thus, the face track determines the spatial location of the faces
Then, an assignment of names to clusters is performed. The and the temporal interval in the video where this person ap-
main problem of this approach is that a large number of non pears. Then, each face track is forwarded through a classifier
important identities can appear, and that the clusters are highly with N + 1 classes, namely the identities of the known per-
non-homogeneous. Combining speech and face modalities is sons (this is, the set of persons in the enrollment set) plus the
also a challenge in these systems. The second usual approach unknown class. The approach is summarized in Figure 1.
is verification on the signal segments [4, 6]. Here, two stages
are defined: enrollment and search. Each segment is compared
against the enrollment data and a decision is made for each seg-
ment. In several cases, both approaches (clustering and ver-
ification) use of some kind of metric learning to improve the
discriminativeness of the feature vectors.
This paper presents the UPC proposal to the Albayzin Eval-
uation: IberSPEECH-RTVE 2018 Speaker Diarization Chal-
1 Deep Neural Network Figure 1: Block diagram for the face modality.
199 10.21437/IberSPEECH.2018-40
Our approach uses a tracking by detection approach: First,
all the faces in the video sequence are detected. For this, we use
a detector based on HOG+SVM2 [7] from the dlib [8] library.
Once the faces have been detected, a KLT3 tracker [9, 10,
11] is used to relate the detections in successive frames. We
used the implementation4 provided for the baseline system of
the Multimodal Person Discovery in Broadcast TV task in Me-
diaEval 2015 [2].
As mentioned previously, a face track provides the spatial
location of a set of faces of a given individual, which are used
for feature extraction, and the temporal interval where this per-
son is visible in the video.
In the video system, the track is the basic unit of recog-
nition: we will output a result for each track that is classified
as belonging to one of the known persons. Tracks classified as
unknown are discarded and no output is provided.
To characterize each track, we follow a two step process:
first, a feature vector is extracted for each detected face in the
track. Then, the final feature vector for the track is obtained by
averaging all the track’s feature vectors.
These feature vectors are obtained using the last fully con- Figure 2: Speaker front-end diagram.
nected layer from a Deep Neural Network based on the ResNet
34 architecture [12], trained using the metric learning triplet
loss process described in FaceNet [13]. This learns a mapping
from the detected faces to a compact space where the feature I-vectors are low rank vectors, typically between 400 and
vectors (i.e. 128 dimensional FaceNet embeddings) originat- 600, representing a speech utterance. Given a speech sig-
ing from the faces of a given individual are located in a sep- nal, acoustic features like Mel Frequency Cepstral Coefficients
arate and compact region of the space. Thus, the vectors are (MFCC) are extracted. These feature vectors are modeled by
highly discriminative, allowing to use standard techniques to a set of Gaussian Mixtures (GMM) adapted from a Universal
perform classification/verification. We have used the off-the Background Model (UBM). The mean vectors of the adapted
shelf dlib [8] implementation, without any adaptation nor fine- GMM are stacked to build the M supervector, which can be
tuning to the task identities. written as:
A similar method is used to extract the feature vectors for
the images and videos of the enrollment set. For each person, 10 M = µ + Tω (1)
still images and one short clip were provided. For each still im-
age, we detect the face and we extract a feature vector. The short where µ is the speaker- and session-independent mean su-
video is processed similarly to the test video: scene detection, pervector from UBM, T is the total variability matrix, and ω is
face detection and face tracking. A feature vector is extracted a hidden variable. The mean of the posterior distribution of ω is
for each resulting track. This results in a variable number of en- referred to as i-vector. This posterior distribution is conditioned
rollment vectors for each person, depending on the number of on the Baum-Welch statistics of the given speech utterance.
tracks in the short video. These vectors are associated with the The T matrix is trained using the Expectation-Maximization
name of the corresponding person and used as a person model. (EM) algorithm given the centralized Baum-Welch statistics
To decide the track identity, we used a k-NN classifier with from background speech utterances. More details can be found
a cosine distance metric. A global threshold is applied to deter- in [14].
mine if the track belongs to any of the persons in the database. If Given an i-vector, a DNN is used to extract a more discrim-
this is the case, the identity corresponding to the nearest vector inative speaker vector. This DNN is composed by 2 hidden lay-
in the database is used as the track identity. This simple method ers of 400 nodes, where the activations of the second layer are
is possible because the highly discriminative properties of the used as a speaker embedding. This neural network is fed with i-
FaceNet 128 dimensional embeddings. vectors and an initial L2 normalization is applied to these inputs
before the first hidden layer. After each hidden layer, a batch
2.2. Speaker System normalization layer is used as regularizer. Initially, the DNN is
pretrainned as a speaker classifier. Therefore, a softmax layer is
The speaker system works as a tracking algorithm that uses added in the output of the network and the DNN is trainned min-
speaker embeddings to compare speech signal segments with imizing the cross-entropy loss. Following to this pretrainning,
the speech utterances of the enrollment identities. These repre- the softmax layer is removed and the DNN is trainned with the
sentations are created with a DNN which is feed with i-vectors following multiple objective loss:
and is trainned with a multi-objective loss (Figure 2). This loss
is based on a triplet margin loss and a regularitzation function N
which minimizes the variance of both positive and negative tu- 1 X λ
Loss = T Lossi + RLossi (2)
ple distances. N i=1 2
2 Histogram of Oriented Gradients + Support Vector Machine T Lossi = max(0, d(Ai , Pi ) − d(Ai , Ni ) + margin) (3)
3 Kanade-Lucas-Tomasi tracker
4 https://github.com/MediaevalPersonDiscoveryTask/Baseline2015
200
• Some speakers do not come into view any time in the
show and there are other people who are shown in the
screen but do not speak. These faces and speakers corre-
spond in major part to the unknown identities.
According to these assumptions, an algorithm has been de-
signed based in weighting temporal overlaps between the tracks
of the face system and the speech segments of the speaker sys-
tem (Figure 3). As its shown in the figure, the intersection be-
tween face tracks and speaker segments produces a new multi-
modal segmentation. The temporary segments where face and
Figure 3: Fusion scheme. Green boxes refer to the segments speech are not overlapped are discarded. We use this new seg-
where id assignation has been directly propagated. Orange mentation to combine the assignations of both modalities:
boxes refers to segments which id ask has been assigned after
the score combination. • The segments where the corresponding face/speaker seg-
ments have the same target assigned are automatically
tagged with that identity.
1 X
N • When the speaker and face assignations are not the same,
RLossi = (|d(Ai , Pi ) − |d(Aj , Pj )|)+ we produce a new scoring combining both modalities
N j=1
distances between the segment and the enrollment tar-
1 X
N gets. First we extract the scores of the multimodal seg-
(|d(Ai , Ni ) − |d(Aj , Nj )|) (4) ment for each modality. The range of these scores are
N j=1
different for each source, hence it is needed to normal-
ize them. This normalization is produced with a softmax
where T Loss corresponds to the triplet margin loss [13] activation which has a different temperature τ parameter
and RLoss corresponds to our proposed regularization func- for each modality. A new set of scores is then produced
tion. T Loss is computed with hinge loss and d(x, y) refers with the average of both modalities scores. Given these
to the 2nd order euclidean distance between a pair of vectors. new multimodal scores, a new threshold is used to deter-
On the other hand, RLoss is a function that forces the DNN to mine whether the segment correspond to the most similar
minimize the variance of both positive and negative tuples dis- target or to an unknown identity.
tances. Hence, in each batch we estimate the means of the posi-
tive and negative scores. These means are then used to minimize
the distance between each positive or negative pair distance and 3. Optimization and Experimental Results
its corresponding mean. We add a λ penalty term so as to bal- In the following section we describe the setup of the proposed
ance the magnitude of the regularization function in the global approaches and we present the results of these systems for
loss. the Multimodal Speaker Diarization task of the 2018 Albayzin
The i-vector framework combined with the DNN is used as Challenge.
a front-end block in the speaker tracking system. This front-
end allows to extract features of the speech signal and compare 3.1. Speaker System
it with the signals of the speaker targets. In our approach, a
sliding window strategy have been used to extract speaker em- The speaker front-end block has been trained on the Vox-
beddings from 3 seconds length speech segments with a 0.25 Celeb2 [15] database. Feature extraction is performed with 20
seconds shift size. For the enrollment identities, we have used size MFCC plus delta features. The UBM has been trained
the whole signal to extract an embedding for each target. Cosine with a 1024 mixtures GMM and the T Matrix size is 400. For
distance metric is then used to evaluate the similarity of speech the whole i-vector framework we have used the Alize [16, 17]
segment embeddings for each target. The target with the biggest toolkit and we have only used the first 1000 speakers of Vox-
similarity is then assigned to the corresponding speech segment. Celeb2 development partition. The DNN used is composed by
In order to classify the non-interest or unknown speakers, a two 400 size hidden layers. The pretraining has been performed
threshold is imposed to determine the assignation between the using the same data used for the i-vector framework. For the
best candidate and the speech segment. If the most similar target triplet based DNN training, the whole VoxCeleb2 development
distance is below the threshold, the speech segment is automat- partition have been used. In order to obtain a good estimation
ically tagged as an unknown identity. of the positive and negative pair means, batch size have been set
to 1024. The λ for the RLoss have been set to 1. Both network
2.3. Fusion System trainnings have been performed with Adam optimizer. Learn-
ing rate have been set to 0.01 and the pretraining has been reg-
A fusion system has been considered in order to combine the ularized with an additional 0.001 weight decay. For the target
previous information sources. Speaker and video diarization assignation, the decision threshold has been tuned to improve
are performed first in an individual manner. The results of both DER results on the RTVE2018 development set. A final value
modalities are then fused so as to obtain a better speaker assig- of 0.08 threshold over the the cosine distance (in range [-1,1])
nation. In order to combine both outputs properly, we made the has been obtained.
following assumptions:
3.2. Video system
• Speaker segments and face tracks of the same person are
temporarily correlated. Hence, it is very likely that the The method described in Section 2.1 has been used to obtain
person who appears in the video is the one who is speak- the results. We have filtered short tracks (tracks shorter than
ing. 1s) because they are likely to belong to non-important faces.
201
System Miss FA SER DER could have improved the rate of error caused by these factors.
Monomodal Speaker 3.5% 5.7% 31.9% 41.13% On the other hand, using an initial speaker segmentation on the
Monomodal Face 37.9% 0.5% 1.9% 40.24% signal instead of a sliding windows strategy could also lead to a
Fusion (Spk. Eval.) 26.6% 2.3% 38.2% 66.99% better system performance.
Fusion (Face Eval.) 51.7% 0.3% 26.9% 78.92% For the face modality, the main source of error is the high
Table 1: DER results on the development partition. number of missed face time (37.9%). On the other side, the
FA and the SER are very low. The missed face time error
could be originated from two different motives. For one side,
a threshold too low could cause many false rejections of valid
This also allows to reduce the computational load of the system. tracks (i.e. tracks belonging to valid enrollment identities). On
For each track, a 128D feature vector has been generated. The the other side, this error could be originated because the face
final identity decision is determined by a k-NN classifier. As detection/tracking failed to extract valid tracks. To determine
the number of enrollment vectors is low, a value of k = 1 has which one of this errors is predominant we have set the rejec-
been used. By looking at the small Speaker Error Rate value in tion threshold at its maximum value (th = 1), meaning that all
Table 1, this approach is effective, thanks to the discriminating tracks should be accepted. After doing that, we found that the
power of the embeddings. The principal challenge in this task missed face error was still very high (37.2%). This indicates
was the high number of tracks belonging to persons that are not that the errors are mainly produced by the tracking step.
in the enrollment set. To reject these tracks, a global threshold The fusion system presents worst results than the the
th has been used. This threshold has been determined as the speaker and the face systems used individually. Both fu-
value providing the highest DER measure in the development sion systems evaluated with the speaker and the video
set. A final value of th = 0.47 over the cosine distance (in groundtruths present higher MISS and SER in comparison with
range 0 − 1) with the nearest neighbor has been obtained. the monomodal systems. We have noticed that the multimodal
segmentation does not improve the results because it automat-
ically discards a lot of speaker segments and face tracks. In
3.3. Fusion system
one hand, it discards the segments where there is no overlap-
Given the scores between signal speaker/face segments and the ping between speaker segments and face tracks. On the other
target vectors, a softmax activation have been used to normalize hand, when more than one face appears in the video, the system
the scores of each modality. In order to obtain similar scores, automatically discards the face track of the person who is not
the softmax of each modality has been applied with a different speaking. Therefore, our assumptions would work better with
temperature τ parameter. For speech τ = 3 has been used and the aim of looking for who is shown and speaking at the same
for the face modality τ has been set to 2. For the fusion system time but not for this kind of multimodal evaluation.
the target/non-target distance threshold have been set to 0.03.
4. Conclusions
3.4. Results
We have presented two monomodal and one multimodal tech-
The proposed systems have been evaluated in the RTVE2018 nologies to perform person identification in broadcast videos.
database for the Multimodal Speaker Diarization task of the A quantitative analisis has been performed on the RTVE 2018
2018 Albayzin Challenge. The development partition is com- dataset as provided in the Albayzin challenge. From the exper-
posed of one video, with a duration of around 2 hours. Enroll- iments it can be seen that the monomodal systems should be
ment data (10 still images and a short video) is provided for a improved. For the speaker approach, it would be interesting to
total of 34 identities. The test partition is composed of three test explore transfer learning methods to adapt our generic model in
videos, with a total duration of around 4 hours with enrollment a smaller scenario and to include a speaker segmentation algo-
data for 39 identities. The metric used to evaluate the systems rithm in the system. For the face modality, we plan to improve
is the Diarization Error Rate (DER), which is the sum of three the face detection and tracking step as it has been proven that is
different errors: Miss Speech (MISS), False Alarm (FA) and the main source of error for the face modality. There is also a
Speaker Error Rate (SER). In this challenge, the presented ap- large room for improvement for the multimodal fusion system.
proeaches are evaluated individually in each modality. Hence, Instead of fusing the systems from the output of the monomodal
it is needed to produce a diarization result for both speaker and systems, an end to end multimodal system could work better if
face sources. a big amount of data is available.
Table 1 shows the results of the presented approaches on
the development partition. The first two rows correspond to 5. Acknowledgements
the face and speaker system evaluated with their correspond-
ing face/speaker groundtruth. Fusion system corresponds to the This work was supported in part by the Spanish Project
combination approach described in Section 2.3. Therefore, the DeepVoice (TEC2015-69266-P) and the project MALEGRA
third and fourth row of the table correspond to the fusion system (TEC2016-75976-R), financed by the Spanish Ministerio de
evaluated with the speaker and the face groundtruth. Economa, Industria y Competitividad and the European Re-
gional Development Fund (ERDF).
Speaker system shows a 41.13% DER, where the main
source of error is the SER with a 31.9%. The threshold used
to decide whether a segment corresponds to a target or to an un- 6. References
known identity produces a low MISS but leads to a higher FA [1] O. G. G. Bernard and J. Kahn, “The first official repere evalua-
and SER. We noticed that our system failed in segments where tion,” in SLAM-Interspeech, 2013.
music was included in the background and with these targets [2] J. Poignant, H. Bredin, and C. Barras, “Person discovery in broad-
whose enrolment signal was very different to the show in terms cast tv at mediaeval 2015,” in Working Notes Proceedings of the
of channel variability. Adapting the model to the RTVE corpus MediaEval 2015 Workshop, 2015.
202
[3] H. Bredin, C. Barras, and C. Guinaudeau, “Person discovery in
broadcast tv at mediaeval 2016,” in Working Notes Proceedings of
the MediaEval 2016 Workshop, 2016.
[4] N. Le, H. Bredin, G. Sargent, M. India, P. Lopez-Otero, C. Barras,
C. Guinaudeau, G. Gravier, G. Barbosa da Fonseca, I. Lyon Freire,
J. Z. Patrocinio, S. J. F. Guimares, M. Gerard, J. Morros, J. Her-
nando, L. Docio-Fernandez, C. Garcia-Mateo, S. Meignier, and
J. Odobez, “Towards large scale multimedia indexing: A case
study on person discovery in broadcast news,” in CBMI 2017.
[5] M. India, D. Varas, , V. Vilaplana, J. Morros, and J. Hernando,
“Upc system for the 2015 mediaeval multimodal person discov-
ery in broadcast tv task,” in Working Notes Proceedings of the
MediaEval 2015 Workshop, 2015.
[6] M. India, G. Marti, , E. Sayrol, J. Morros, J. Hernando, C. Cor-
tillas, and G. Bouritsas, “Upc system for the 2016 mediaeval mul-
timodal person discovery in broadcast tv task,” in Working Notes
Proceedings of the MediaEval 2016 Workshop, 2016, pp. 1–3.
[7] N. Dalal and B. Triggs, “Histograms of oriented gradients for hu-
man detection,” in CVPR’05, vol. 1, June 2005, pp. 886–893.
[8] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of
Machine Learning Research, vol. 10, pp. 1755–1758, 2009.
[9] B. D. Lucas and T. Kanade, “An iterative image registration tech-
nique with an application to stereo vision,” in IJCAI’81, 1981, pp.
674–679.
[10] C. Tomasi and T. Kanade, “Detection and tracking of point fea-
tures,” International Journal of Computer Vision, Tech. Rep.,
1991.
[11] J. Shi and Tomasi, “Good features to track,” in CVPR 1994, June
1994, pp. 593–600.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in CVPR 2016, June 2016, pp. 770–778.
[13] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified
embedding for face recognition and clustering,” in CVPR 2015,
06 2015, pp. 815–823.
[14] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,
“Front-end factor analysis for speaker verification,” IEEE Trans.
on Audio, Speech, and Language Processing, vol. 19, no. 4, pp.
788–798, 2011.
[15] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep
speaker recognition,” in INTERSPEECH, 2018.
[16] J.-F. Bonastre, F. Wils, and S. Meignier, “Alize, a free toolkit for
speaker recognition,” in ICASSP’05, vol. 1, 2005, pp. I–737.
[17] J.-F. Bonastre, N. Scheffer, D. Matrouf, C. Fredouille, A. Larcher,
A. Preti, G. Pouchoulin, N. W. Evans, B. G. Fauve, and J. S.
Mason, “Alize/spkdet: a state-of-the-art open source software for
speaker recognition.” in Odyssey, 2008, p. 20.
203
IberSPEECH 2018
The GTM-UVIGO System for Audiovisual Diarization

Eduardo Ramos-Muguerza, Laura Docío-Fernández, José Luis Alba-Castro
AtlanTTic Research Center, University of Vigo

eramos, ldocio, jalba@gts.uvigo.es
Abstract 2. Video Processing

This paper explains in detail the Audiovisual system deployed Television programs such as news, debates, interviews, etc., are
by the Multimedia Technologies Group (GTM) of the atlanTTic characterized by frequent changes of shot and scene, the
research center at the University of Vigo, for the Albayzin appearance of multiple people and the mixture of different
Multimodal Diarization Challenge (MDC) organized in the scenes in the final configuration of frame that consumes the end
Iberspeech 2018 conference. This system is characterized by user. In this way, the final audiovisual content is very different
the use of state of the art face and speaker verification from the typical scenarios where biometric identification is
embeddings trained with publicly available Deep Neural used, such as restricted access, video security or mobile
Networks. Video and audio tracks are processed separately to scenarios.
obtain a matrix of confidence values of each time segment that
The solution we have adopted for this competition in the
are finally fused to make joint decisions on the speaker
video processing part is based on two fundamental ideas that
diarization result.
apply to this type of content. On the one hand, we know that a
Index Terms: speaker recognition, face recognition, deep change of shot implies, in general, a change in the person who
neural networks, image processing. appears on the scene, although it does not always happen and it
does not happen in the same way regarding the speaker. On the
1. Introduction other hand, the people who appear in a shot remain in it as long
as there is no movement of the camera or of the people
In recent years, the field of pattern recognition has witnessed a themselves. This way, detection of shot changes gives an
shift from the extraction of handmade features to machine- important clue for subsequent face processing.
learned features using complex neural network models.
Biometric verification is a clear example of an application 2.1. Detection of shot changes
scenario where Deep Neural Networks have produced a notable
increase in performance, providing transformations of space This subsection explains a simple approach to detect shot
where the face and voice of users are represented in clusters that changes that is designed to have more false positives than false
are more compact and separable than in the original sample negatives. Shot changes will be used to restart face trackers
space. This representation makes the problem of diarization and because we cannot rely on tracking a face through shot changes,
verification in multimedia content more tractable than with so losing a shot change could have a greater impact in the
previous approaches [1][2][3][4]. tracker than initializing the face tracker unnecessarily.
However, facial and speaker verification models are still Detection of movement is also an important feature to have
not perfect and make many mistakes in verifying the identity of a more complete understanding of the footage, but we haven’t
people in natural conditions. These situations are common included in this version of the system a specific movement
when analyzing audiovisual content with constant shot changes detection block. Instead, we have used the false positive rate of
and different types of scenarios, variability in the appearance of the shot detection block as an indication of movement.
faces (pose, expression and size), variability in the mix of The steps to detect a change of shot are the following:
voices, noise and background music. Also, the appearance of
1. Reduce the size of the frame to save computational load,
many other people who are not registered to be identified and
are considered "intruders" to the system, causes many false 2. Calculate the derivatives of the image to keep the edges of
identity assignments. the scene,
In this paper we explain the approach that the GTM 3. Divide the frame into blocks and calculate the mean of edge
research group has followed to tackle the person identification pixels per block,
problem in audiovisual content. We have prepared a system that 4. Subtract the mean of the same block in the previous frame,
works separately on the video and audio tracks and makes a 5. Set a threshold for considering that a block difference
final fusion to fine tune the speaker diarization result. The rest represents a change (threshold set with the development video
of the paper is organized as follows. Section 2 explains the footage),
video processing part, including the segmentation of the video
footage into different shots and the face detection, tracking, 6. Count the number of block changes and set a threshold
verification and post-processing at shot level. Section 3 defined for a change of shot (also using the development set).
explains the Speaker Diarization and Verification subsystem. This approach is very simple and fast. It also leaves a lot of
Section 4 deals with the fusion of modalities and Section 5 gives room for improvement by defining areas with different shot
the computational cost information. Finally, section 6 presents change thresholds depending on the type of scene or type of
the conclusions and details the on-going research lines. video realization. For example, in some of the videos of the
competition, the consumed scene have the frame divided into
204 10.21437/IberSPEECH.2018-41
different areas with different video content. It is quite common new detection should be more restrictive than assigning it to a
that the shots change at different pace in the different areas, so previous one located in an overlapped area. It is important to
a solution that can make local decisions on shot change is quite highlight that a shot change can make a face with different ID
useful. However, we left for future improvements of the system to appear in the same position that another face in the previous
the local detection of shot changes. For this version we just set frame before the shot. So, a change of shot restarts the tracking
a permissive global threshold that allows detection of total or process.
partial change of shot, movement and fading as a unique event.
2.2.3. Face clusters augmentation
2.2. Face processing
The unreliable robustness of the face recognizer with extreme
The face processing subsystem comprises several sequential poses and expressions lead us to define a strategy to cope with
operations that are briefly explained through the Figure 1 and potential erroneous ID assignments.
in the subsections below. A previous study of the capacity of dlib’s facial recognition
network, advised us to use a restrictive Th_newID, so
2.2.1. Face Detection and Geometric Normalization comparison of faces with extreme poses between the current
embedding and the embeddings in the enrollment dataset will
Face detection is a fundamental step in the sequential
not surpass it. This way, it is less likely that a wrong ID
processing. We have used the detector based on Multi-Task
assignment is propagated through a tracked BB. Only when the
Cascaded Convolutional neural Network [5], that jointly finds
head moves to a more frontal pose, or a similar extreme pose is
a Bounding Box for the face and five landmarking points useful
stored in the enrollment dataset, the ID is assigned to the BB
to normalize the face. This face detector is quite robust to pose,
and propagated. But, what happens if the face in the shot is
expression and illumination changes. False negatives are
always in an extreme pose? We need a way to enrich the
typical in extreme poses with yaw angles beyond +/-60º and
enrollment dataset with new samples of a specific ID that
pitch angles beyond +/- 40º, that are not so uncommon in
appear in the content but are different enough from the
interview and debate contents. This approach also brings a bit
enrollment dataset. Given that the Th_previousID is more
amount of false positives in areas where textured objects with
relaxed, during a shot where a previous match has been
skin colors appear, like hands, arms and other not human
assigned, several different face samples with almost any pose
objects.
and expression can enrich the enrollment dataset. The criterion
Once a face is detected (being true detection or not), its
to enrich the dataset is the increase of variance of the ID face
bounding box (BB) is saved with several parameters that will
cluster. Then, the enriched dataset is ready to use in the next
allow to do tracking and assign identities during the process. An
frame.
overlapping function between the current BB, and the BBs of
the previous frame allows linking the BBs belonging to the 2.2.4. Face ID backtracking
same person and do backtracking when the shot has finished.
The detected face is passed to a geometric normalization Once a shot change is detected, an online post-processing is run
that prepares the face to be plugged in in a standardized way to to reassign IDs or even delete potentially wrong assignments in
the face recognition block. the past shot. This block is based on heuristic rules defined after
observing the typical behavior of the previous processing
2.2.2. Face recognition blocks in the development scenarios. If the detector and
recognizer were ideal, a shot without rapid movements or with
We have used the face recognizer based on dlib’s
slow camera movement should contain just a number of BB
implementation [6] of the Microsoft ResNet DNN [1]. This
tracks that matches the number of persons in the shot; the first
DNN finds an embedded space where similar faces are grouped
BB should appear in the first shot frame and the last BB of its
together and far from different faces. This network is also
BB track should appear in the last frame of the shot. But things
trained to be quite robust to pose, expression and illumination
are not perfect, so the rationale of the backtracking post-
changes. In this case, the network makes quite many false
processing is based on the next observations:
identification assignments when poses are beyond +/-50º in
yaw and +/- 20º in pitch. Also extreme facial expressions
produce false assignments, being quite robust, though, for  False negatives of the detector break the paths of the BBs,
neutral and smiling faces (the great bulk of face images found so there will be more paths than in the ideal case.
in internet-based datasets for face recognition). However, TV  False initial matchings in every new track will yield tracks
contents offer more facial expressions of emotion than neutral with incorrect IDs.
and smiling, so false assignments are quite common also in  The number of persons in the shot can be roughly estimated
these cases. by the average number of detections. This assumption is
After a face is detected for the first time in a shot, meaning broken when extreme poses appear during a long period,
that no previous BB is linked to the current one, a candidate ID producing an underestimated number of persons in the shot.
is assigned to the BB if it surpasses a distance threshold for new  Every ID matched in the shot can appear in several broken
IDs (Th_newID) when comparing the embedded vector against paths. The accumulated confidence of that ID gives a rough
all the embedded vectors of the enrollment set. The closest ID approximation of its probability of being in that shot.
is kept for that BB. A confidence value is also assigned to that
ID in that specific BB. If the BB is linked to a previous BB, the Using these observations we defined a rule to keep tracks
embedded vector is compared just with the cluster of enrolled of a number of persons just above the estimated average and
vectors of the previous candidate ID. If it surpasses a second with the IDs with largest accumulated confidence. Those IDs
threshold (Th_previousID) then the same ID is assigned to the are finally assigned to the time intervals within and across shots
BB. The rationale of keeping two different thresholds and and written down in an rttm file.
having Th_newID > Th_previousID is that assigning an ID to a
205
Video file Audio file
Enrollment dataset Augmenting ID clusters Augmented dataset Extract DNN embeddings

of faces with IDs with new embeddings of faces with IDs (3 s window, 0.5 hop)
Obtain new frame

embeddings
Chinese Whispers
clustering
New shot
detected
NO Obtain speech Same short-term cluster  same color
YES segments
YES
Find face BBs in Extract DNN embeddings

current frame (10 s window, 0.5 hop)
Enr ollment dataset

of speech with IDs
augment embeddings
YES BB overlap NO Chinese Whispers
dataset? clustering
with previous
tracks?
Extract DNN embeddings
YES (10 s window, 0.5 hop)
Same long-term cluster  same color
Find ID maching Find ID matching Temporal
with previous ID with all IDs Segmentation by
clustering
Chinese Whispers
clustering
YES NO
Th_previousID?
Find ID matching
Pure speaker clusters in temporal
segmetation
Keep BB known YES NO

ID1 ID2 ID3 ID2 ID1 ID3 ID1 ID5 ID1 ID1
track it, and save Th_newID?
confidence Same long-term cluster  same color
Store ID results and
send to RTTM
Different ID#  different person
processing
Keep BB unknown
but track it
Back to obtain
Figure 2: Block diagram of the speaker diarization and
new frame NO YES recognition subsystem.
End of shot
detected
3.1. Speaker enrollment

Backtrack
and postprocess The audio signal provided for each person in the enrollment set
IDs in shot
Back to obtain
is used to obtain DNN speech-based embeddings. A sliding
new frame window of at least 10 seconds with a half a second hop is used.
Then, these embeddings are clustered using the Chinese
Stor e ID results and
send to RTTM Whispers algorithm [9]. The threshold of the clustering
processing
algorithm is adjusted so that the clusters are pure and at least as
many as the number of identities in the enrollment set. In this
way an enrolled person can be represented by one or more
Figure 1: Flow diagram of face processing.
clusters.
3. Speaker Diarization and Recognition

The developed strategy for speaker diarization and verification 3.2. Off-line Speaker Diarization
uses a DNN trained to discriminate between speakers, and First, the audio signal is divided into 3 second segments with a
which maps variable-length utterances or speech segments to half a second hop. DNN short-term audio embeddings were
fixed-dimensional embeddings that are also called x-vectors extracted for each of these segment, clustered using the Chinese
[2]. Whispers algorithm and their timestamps kept. From the
A pretrained deep neural network downloaded from clustering result we obtain an audio segmentation. Next, each
http://kaldi-asr.org/models.html was used. The network was of these segments, of arbitrary duration, are processed in order
implemented using the nnet3 neural network library in the Kaldi to extract one or more long-term audio embeddings using the
Speech Recognition Toolkit [7] and trained on augmented same DNN. To do this, a sliding window of at least 10 seconds
VoxCeleb 1 and VoxCeleb 2 data [8]. with a half a second hop is used. Then, these embeddings are
clustered using again the Chinese Whispers algorithm, using a
Figure 2 represents the block diagram of the speaker
threshold that minimizes the diarization error.
diarization and recognition subsystem.
206
3.3. On-line Identity Assignment 2018. The system uses state of the art DNN algorithms for face
detection and verification and also for speaker diarization and
The clusters obtained in the previous step need to be assigned verification. The application scenario is studied to implement
to the enrollment identities. Keeping the timestamps of each ad-hoc post-processing strategies to fine-tune the ID
embedding in the clustering process, allows to design an online assignments made by the video and audio parts. Specifically,
ID assignment approach. Temporal segments are defined as the information on shot changes are exploited to avoid tracking
consecutive timestamps with embeddings associated to the faces across shots. Confidence matrices are used in a fusion
same cluster. The ID assigned to a time segment is the strategy that allows changing pre-assigned speaker identities.
enrollment ID of the best-matching enrollment cluster, as far as This framework leaves a lot of room for improvement in each
this distance is less than a threshold. This threshold is defined of the fundamental processing stages and also in the ad-hoc
after observing the typical behavior of the system in the rules for fine-tuning. One of the main future lines consist of
development scenarios. A confidence value for that ID in that increasing the robustness of face matchings for extreme poses
specific temporal segment is stored to be used jointly with the and expressions and the robustness of speaker ID assignments
face-based confidence value in the fusion process. in overlapped speech. From a video processing point of view, a
better characterization of the display montage would allow the
application of post-processing rules less prone to errors.
4. Fusion Finally, speeding up some DNN critical parts by using GPUs
and efficiently coding some other parts would allow real-time
Once a decision was made for both modalities, a multimodal processing of the system.
fusion approach was implemented in order to correct potentially
wrong speech-based ID assignments. Given a temporal segment
that has been assigned a speaker identity ID1, if a high-
Acknowledgements
confidence single face identity ID2 has been detected in more This work has received financial support from the Spanish
than 60% of the video frames in that speaker ID1 segment, the Ministerio de Economía y Competitividad through project
speaker ID1 is changed to the face identity ID2. ‘TraceThem’ TEC2015-65345-P, from the Xunta de Galicia
This rule doesn’t apply if ID1 and ID2 have different gender (Agrupación Estratéxica Consolidada de Galicia accreditation
(as given by the enrollment name). 2016-2019) and the European Union (European Regional
Much more elaborated fusion rules can be applied at this Development Fund ERDF).
stage, but only this one was tested for the competition.
The results over the Development video provided by the References
organizers of the competition are presented in Table 1.
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning
for Image Recognition,” in IEEE Conference on Computer
Table 1: DER results on the Development video
Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
Modality DER [2] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
“Deep neural network embeddings for text-independent
Face 36.20% speaker verification,” in INTERSPEECH 2017 – 97th Annual
Speaker 14.25% Conference of the International Speech Communication
Speaker Fusion 7.51% Association, Proceedings, pp. 999–1003, 2017.
Average DER 21.85% [3] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end
text-dependent speaker verification,” in 2016 IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2016, pp. 5115–5119.
5. Computational Cost [4] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey and A. McCree,
The computational cost of the proposed audiovisual diarization "Speaker diarization using deep neural network embeddings,"
2017 IEEE International Conference on Acoustics, Speech and
system was measured in terms of the real-time factor (RT). This Signal Processing (ICASSP), New Orleans, LA, 2017, pp. 4930-
measure represents the amount of time needed to process one 4934.
second of audiovisual content: xRT = P/I, where I is the [5] K. Zhang, Z. Zhang, Z. Li and Y. Qiao, “Joint Face Detection
duration of the processed video and P is the time required for and Alignment Using Multitask Cascaded Convolutional
processing it. The whole development video was processed to Networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp.
compute the RT, thus taking into account many different 1499-1503, Oct. 2016.
audiovisual situations. The duration of this video is I = 7410 s, [6] D. E. King. “Dlib-ml: A Machine Learning Toolkit, Journal of
and the time needed to process it was P = 33457 s, leading to Machine Learning Research,” 10:1755-1758, 2009.
[7] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N.
RT = 4.51. These computation time was obtained by running Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al.,
this experiment on an Intel(R) Core(™) i5 CPU 670@3.47 GHz “The Kaldi speech recognition toolkit,” in IEEE 2011 Workshop
with 12 GB RAM. Even though the process is running more on Automatic Speech Recognition and Understanding (ASRU),
than 4 times slower than real-time, the code is not optimized at pp. 1-42011.
all (some parts are coded in Matlab) and the machine is just [8] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large
using 1 CPU and no GPU. We are working to speed up the scale speaker identification dataset,” INTERSPEECH 2017 –
process and expect to have it running in real-time in the next 97th Annual Conference of the International Speech
months. Communication Association, 2017.
[9] Chris Biemann, “Chinese whispers: an efficient graph clustering
algorithm and its application to natural language processing
6. Conclusions and future work problems,” in First Workshop on Graph Based Methods for
Natural Language Processing (TextGraphs-1), pp. 73-80, 2006.
We have presented the GTM-UVIGO System deployed for the
Albayzin Multimodal Diarization Competition at Iberspeech
207
IberSPEECH 2018
The SRI International STAR-LAB System Description for

IberSPEECH-RTVE 2018 Speaker Diarization Challenge
Diego Castan, Mitchell McLaren, Mahesh Kumar Nandwana
Speech Technology and Research Laboratory, SRI International, California, USA

{diego.castan, mitchell.mclaren, maheshkumar.nandwana}@sri.com
Abstract 16k or higher data (74,447 files) to 8k before up-sampling to

16k, which we have found to allow the embeddings DNN to
This document describes the submissions of STAR-LAB (the generalize across the 8-16 kHz bandwidth range and better ac-
Speech Technology and Research Laboratory at SRI Interna- commodate processing of telephone signals.
tional) to the open-set condition of the IberSPEECH-RTVE
We augmented the raw speaker embeddings training data
2018 Speaker Diarization Challenge. The core components
(counting the 16k re-sampled to 8k as raw data) to produce 2
of the submissions included noise-robust speech activity detec-
copies per file per degradation type (random selection of spe-
tion, speaker embeddings for initializing diarization with do-
cific degradation) such that the data available for training was
main adaptation, and Variational Bayes (VB) diarization using
9-fold the original amount. In total, this was 2,778,615 files
a DNN bottleneck i-vector subspaces.
for training the speaker embedding DNNs. For PLDA train-
ing used for clustering embeddings [6], the same degraded data
1. Introduction (excluding the 8k simulated data) was subsampled by a random
SRI International has long focused on the task of speaker recog- factor of 6 in order to make PLDA training data manageable
nition, but has only recently branched into the field of speaker and resulted in 343,535 files from 11,461 speakers. Finally,
diarization. For the STAR-LAB submissions, we leveraged the the databases used to train the UBM and the total variability
embeddings and diarization systems that we recently developed subspaces were Fisher, Switchboard and AMI PRISM (NIST
for the NIST 2018 Speaker Recognition Evaluation [1]. In SRE04-08).
addition, we attempt to leverage recent work in speech activ-
ity detection and in speaker embeddings for speaker recogni- 3. The STAR-LAB System Submissions
tion [2, 3, 4], the well-known Variational Bayes (VB) approach
to diarization [5], and the use of a DNN bottleneck based i- We start with a general overview of the submissions prior to
vector subspaces internal to the VB process. We describe the breaking down into details of each module used in the system.
three systems submitted for the open-set condition based on a Figure 1 shows a block diagram of the different parts of the
hybrid embeddings-VB approach with different parameters for system. All our submissions use embeddings clustering as seed
the speech activity detection system. to VB diarization. The differences between the primary and the
contrastive submissions are the thresholds applied for the SAD
2. System Training Data decisions as detailed in Section 3.2
System training data included 234,288 signals from 14,630

3.1. Acoustic Features and Bottleneck i-vector extraction
speakers. This data was compiled from NIST SRE 2004-2008,
NIST SRE 2012, Mixer6, Voxceleb1, and Voxceleb2 (train set) To train the DNN we used Power-Normalized Cepstral Coeffi-
data. Voxceleb1 data had 60 speakers removed that overlapped cients (PNCC) [7] with 30 dimensions extracted from a band-
with Speakers in the Wild (SITW), according to the publicly width of 100-7600 Hz using 40 filters. Root compression of
available list1 . 1/15 was applied. All audio was sampled (up or down sam-
Augmentation of data was applied using four categories pled where needed) to 16 kHz prior to processing. The features
of degradations as in [4], including music, and noise at 5 dB were first processed with mean and variance normalization over
signal-to-noise ratio, compression, and low levels of reverb. We a sliding window of 3 seconds prior to having SAD applied.
used 412 noises of at least 15 seconds in length compiled from
We used Mel Frequency Cepstral Coefficients for speech
both freesound.org and the MUSAN corpus. Music degrada-
activity detection and as input to a DNN to extract 80-
tions were sourced from 645 files from MUSAN, and 99 instru-
dimensional bottleneck (BN) features. This BN DNN was
mental pieces purchased from Amazon music. For reverbera-
trained for the task of automatic speech recognition. The
tion, examples were collected from 47 real impulse responses
DNN was trained to predict 1933 English tied tri-phone states
available on echothief.com, and 400 low-level reverb signals
(senones). MFCCs were used for input to the DNN after trans-
sourced from MUSAN. The random selection of reverb signals
forming them with a pcaDCT transform [8] trained on Fisher
gave almost 10x weight to the echothief.com examples in order
data, and restricting the output dimension to 90. The DNN con-
to balance data sources. Compression was applied using 32 dif-
sisted of 5 hidden layers of 600 nodes, except the last hidden
ferent codec-bitrate combinations with open source tools such
layer which was 80 nodes and formed the bottleneck layer from
as FFmpeg, codec2, Speex, GSM, and opus trans-coding pack-
with activations were extracted as features.
ages. In addition to these augmentations, we down-sampled any
BN features were used to train an i-vector extractor [9]
1 http://www.openslr.org/resources/49/ consisting of a 2048-component universal background model
voxceleb1_sitw_overlap.txt (UBM) with diagonal covariance and a subspace of rank 400.
208 10.21437/IberSPEECH.2018-42
beddings extractor [4] to the task of speaker clustering.
We used a multi-bandwidth speaker embedding DNN in our
submission. Speaker embedding DNNs were trained following
the protocol of [4]. Specifically, Kaldi was used to generate ex-
amples for training the DNN with a duration ranging between
2.0 and 3.5 seconds of speech. DNNs were trained using Ten-
sorflow over 6 epochs using a mini batch size of 128 exam-
ples, and dropout probability linearly increasing to 10% then
back to 0% in the final 2 epochs. The embeddings network
starts with five frame-level hidden layers, all using rectified lin-
ear unit (ReLU) activation and batch normalization. The first
three layers incrementally add time context with stacking of [-
2,-1,0,1,2], [-2,0,2], and [-3,0,3] instances of the input frame.
A statistics pooling layer then stacks the mean and standard de-
Figure 1: Flow diagram of components used in the STAR-LAB viation of the frames per audio segment, resulting in a 3000
team submissions to the IberSPEECH-RTVE 2018 Speaker Di- dimensional segment-level representation. The final two hid-
arization Challenge. den layers of 512 nodes operate at the segment-level and use
ReLU activation and batch normalization prior to the output
layer, which targets speaker labels for each audio segment us-
3.2. Speech Activity Detection
ing log softmax as the output. The embeddings are extracted
We used a DNN-based Speech Activity Detection (SAD) model from the first segment-level hidden layer of 512 nodes. This
leveraging short-term normalization. The SAD model was system used PLDA classification for clustering after applying
trained on clean telephone and microphone data from a ran- an LDA dimensionality reduction. We applied length and mean
dom selection of files from the provided Mixer datasets (2004- normalization to embeddings prior to use in PLDA. As a simple
2008), Fisher, Switchboard, Mixer6, SRE’18 unlabeled and method of domain adaptation, we mean normalized the chunked
SRE’16 unlabeled data. A 5 minute DTMF tone (acquired from embeddings from an audio file using the mean of all chunks.
YouTube), and a selection of noise and music samples with and The embeddings VB initialization process was performed
without speech added were added to the pool of data. In all, as follows. The audio was first segmented into 1.5 second seg-
11,668 files were used to train the SAD model. ments with 0.2 second shift. Following a similar strategy to VB
The system uses 19-dimensional MFCC features, which ex- diarization, we initialized a speaker cluster posterior matrix, q,
cluded C0 and used 24 filters over a bandwidth of 200-3300 Hz. to for 13 speakers. The number of speakers was selected from
These features were mean and variance normalized using a slid- previous experiments over the development data. We calculated
ing window of 3 seconds, and concatenated over a window of 31 for each speaker cluster, a weighted-average embedding based
frames. The resulting 620-dimensional feature vector formed on q and the 1.5s embeddings segments. These per-cluster em-
the input to a DNN which consisted of two hidden layers of beddings were compared using PLDA against each individual
sizes 500 and 100. The output layer of the DNN consisted of embedding segment. We scaled the likelihood ratios (LLRs)
two nodes trained to predict the posteriors for the speech and that resulted from PLDA by 0.05 and performed Viterbi decod-
non-speech classes. These posteriors are converted into like- ing of the LLRs to result in a new q and speaker priors. This pro-
lihood ratios using Bayes rule (assuming a prior of 0.5), and cess was iterated 10 times before using the result q and speaker
thresholded at a value of -1.5, -2.0 and -3.0 for the primary priors in the subsequent VB diarization based on BN+MFCC
and both contrastive systems, respectively. A padding of 0.5 features.
seconds was applied over the final segmentation to smooth the
transitions between speech/non-speech. 3.4. Variational Bayes diarization
We applied cross-talk removal on all interview data from
the NIST SRE corpora to suppress the interviewer speech that Our diarization approach was based on the work of [10]. This
bled through to the target speaker channel. This was espe- approach uses an i-vector subspace to produce a frame-level di-
cially important for distant microphone channels in which each arization output. Our i-vector subspace was trained using con-
speaker had similar energy. Cross-talk removal involved using catenated BN and MFCC features [11], resulting in a feature
the SAD Log-Likelihood Ratios (LLRs) from the target micro- with 140 dimensions. With VB diarization, we have used a
phone as well as the close-talking interviewer microphone, and left-to-right HMM structure of three states per speaker in order
removing any detected speech from the target channel that was to smooth the transitions between speakers that was proposed
detected in the interviewer channel with more than 3.5 in LLR in [12].
value of the target channel. The initialization of the VB diarization approach is done
For all augmented system training data, the SAD align- with the speaker posteriors estimated from the speaker embed-
ments from the raw audio were used rather than running SAD dings initialization. We performed a maximum of 20 iterations
on the degraded signals directly, as done in [4]. of VB diarization.
3.3. Embeddings VB Initialization

4. Results
Recent work in [2, 3] has shown significant advances in the re-
lated field of speaker recognition by replacing the well-known This section compares the development performance of the
i-vector extraction process with speaker embeddings extracted STAR-LAB submissions.
from a DNN trained to directly discriminate speakers. We de- Table 1 shows the system performances for the primary and
cided to apply our findings on what makes a good speaker em- the two contrastive systems in the development set.
209
Table 1: Development results for each system submission [10] VB diarization with eigenvoice and HMM priors, 2013,
http://speech.fit.vutbr.cz/software/
System Name Miss Sp FA Sp SpkErr DER vb-diarization-eigenvoice-and-hmm-priors.
[11] Mitchell McLaren, Yun Lei, and Luciana Ferrer, “Advances
Primary 2.1% 1.9% 13.4% 17.38% in deep neural network approaches to speaker recognition,” in
Contrastive 1 2.0% 2.3% 13.4% 17.81% Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE
Contrastive 2 2.0% 2.9% 12.2% 17.08% International Conference on. IEEE, 2015, pp. 4814–4818.
[12] M. Diez, L. Burget, and P. Matejka, “Speaker diarization based
on bayesian hmm with eigenvoice priors,” in Proc. Odyssey 2018
Table 2: Computational requirements of STAR-LAB submis- The Speaker and Language Recognition Workshop , 2018, 2018,
sions from based on RT factor (higher than 1.0 is slower than pp. 147–154.
real time) and maximum resident memory needed to diarize 10-
minutes of development file millennium-20170522.
System x RT Max. Res. RAM

Primary 1.19 3.6G
Contrastive 1 1.52 3.6G
Contrastive 2 1.28 3.6G
5. Computation
We benchmarked the computational requirements of the STAR-
LAB system on a single core. The machine was an Intel Xeon
E5630 Processor operating at 2.53GHz. The approximate pro-
cessing speed and resource requirements are listed in Table 2.
These calculations are based on total CPU time divided by
the total duration of the audio.
6. Acknowledgments
We’d like to thank Lukas Burget and the BUT team for their python
implementation of Variational Bayes diarization which was leveraged
in this work [10].
7. References
[1] Mitchell McLaren, Luciana Ferrer, Diego Castan, Mahesh Nand-
wana, and Ruchir Travadi, “The sri-con-usc nist 2018 sre sys-
tem description,” in NIST 2018 Speaker Recognition Evaluation,
2018.
[2] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero,
Y. Carmiel, and Sa Khudanpur, “Deep neural network-based
speaker embeddings for end-to-end speaker verification,” in Spo-
ken Language Technology Workshop (SLT). IEEE, 2016, pp. 165–
170.
[3] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
“Deep neural network embeddings for text-independent speaker
verification,” in Proc. Interspeech, 2017, pp. 999–1003.
[4] M. McLaren, D. Castan, M. K. Nandwana, L. Ferrer, and E. Yil-
maz, “How to train your speaker embeddings extractor,” in
Speaker Odyssey, 2018, pp. 327–334.
[5] Patrick. Kenny, “Bayesian analysis of speaker diarization with
eigenvoice priors,” in Tech. Rep. CRIM, 2008.
[6] S.J.D. Prince and J.H. Elder, “Probabilistic linear discriminant
analysis for inferences about identity,” in Proc. IEEE Interna-
tional Conference on Computer Vision. IEEE, 2007, pp. 1–8.
[7] C. Kim and R. Stern, “Power-normalized cepstral coefficients
(PNCC) for robust speech recognition,” in Proc. ICASSP, 2012,
pp. 4101–4104.
[8] Mitchell McLaren and Yun Lei, “Improved speaker recognition
using dct coefficients as features,” in Acoustics, Speech and Signal
IEEE, 2015, pp. 4430–4434.
[9] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,
“Front-end factor analysis for speaker verification,” IEEE Trans.
on Speech and Audio Processing, vol. 19, pp. 788–798, 2011.
210
IberSPEECH 2018
ODESSA at Albayzin Speaker Diarization Challenge 2018

Jose Patino1 , Héctor Delgado1 , Ruiqing Yin2 , Hervé Bredin2 , Claude Barras3 and Nicholas Evans1
EURECOM, Sophia Antipolis, France

1
LIMSI, CNRS, Université Paris-Saclay, Orsay, France

2
3
LIMSI, CNRS, Univ. Paris-Sud, Université Paris-Saclay, Orsay, France
1 2,3
firstname.lastname@eurecom.fr, firstname.lastname@limsi.fr
Abstract task of text-independent speaker recognition. Efforts were made

by the different members of the consortium to improve the re-
This paper describes the ODESSA submissions to the Al- liability of their own speaker diarization systems. In doing
bayzin Speaker Diarization Challenge 2018. The challenge ad- so, different speaker modelling techniques were used, on both
dresses the diarization of TV shows. This work explores three the closed- and open-set conditions of the evaluation. For the
different techniques to represent speech segments, namely bi- closed-set condition, binary key speaker modelling [8] offers a
nary key, x-vector and triplet-loss based embeddings. While training-free option that has produced competitive performance
training-free methods such as the binary key technique can be in previous editions of the challenge [9]. Triplet-loss neural em-
applied easily to a scenario where training data is limited, the beddings [10] trained on the provided data were also explored.
training of robust neural-embedding extractors is considerably Experiments in the open-set condition include embeddings in
more challenging. However, when training data is plentiful the form of state-of-the-art text-independent speaker recogni-
(open-set condition), neural embeddings provide more robust tion x-vector [11]. Different clustering techniques were also
segmentations, giving speaker representations which lead to explored across the training conditions and speaker modelling
better diarization performance. The paper also reports our ef- techniques.
forts to improve speaker diarization performance through sys- Finally, and motivated by the access to significantly differ-
tem combination. For systems with a common temporal reso- ent approaches to speaker modelling and diarization that could
lution, fusion is performed at segment level during clustering. potentially offer complementary solutions, the main contribu-
When the systems under fusion produce segmentations with an tion of this work lies in the exploration of different fusion tech-
arbitrary resolution, they are combined at diarization hypothesis niques for speaker diarization. Whereas fusion, for example at
level. Both approaches to fusion are shown to improve diariza- score level, is often applied as a mean of increasing robustness
tion performance. in closely related tasks such as speaker recognition, the problem
Index Terms: speaker diarization, diarization fusion, neural of merging clustering solutions for speaker diarization scenar-
embeddings, binary key ios remains challenging. Diarization hypothesis level fusion is
applied following a label merging approach [12]. A segment-
1. Introduction level fusion technique similar to that employed in [13] is also
tested.
Albayzin evaluations cover a range of speech processing re- The remainder of this paper is structured as follows. Sec-
lated tasks that include search on speech, audio segmentation, tion 2 details the processing blocks which compose the different
speech-to-text transcription, or speaker diarization. The latter diarization solutions. Section 3 describes the two fusion ap-
motivates the work reported on this paper. Speaker diarization proaches. Sections 4 and 5 report the experimental setup, and
is the task of processing an audio stream into speaker homoge- submitted systems with results, respectively. Finally, Sections 6
neous clusters. Traditionally considered as an enabling technol- and 7 provide discussions and conclusions.
ogy, a number of potential applications can benefit from speaker
diarization as a pre-processing step, such as automatic speech
recognition [1], speaker recognition and identification [2], or 2. Processing modules
spoken document retrieval. The increasing maturity of these This section reviews the different processing modules compos-
technologies calls for continuous improvement in speaker di- ing our diarization systems. These include feature extraction,
arization that has accordingly been the objective of numerous speech activity detection, segmentation, segment and cluster
evaluations and campaigns, be it in the context of the NIST representation, clustering and re-segmentation. As shown in
RT evaluations [3], the more recent multi-domain DIHARD Figure 1, one or more techniques are proposed for each pro-
challenge [4], or in the well-established Albayzin Speaker Di- cessing module.
arization evaluations [5]. The current edition of the Albayzin
Speaker Diarization Challenge [6] includes audio content from 2.1. Feature extraction
the recently released RTVE2018 database [7], composed of TV
shows from a range of topics broadcast on the Spanish TV pub- Two different acoustic frontends were used. They include (i)
lic network. Further details can be found in [6, 7]. a standard Mel-frequency cepstral coefficients (MFCC) [14]
This work has been produced in the context of the frontend and (ii) an infinite impulse response - constant Q, Mel-
ODESSA1 project, which is focused on improving speaker di- frequency cepstral coefficients (ICMC) [15] frontend. The lat-
arization performance by leveraging recent developments in the ter has been applied successfully to tasks including speaker
recognition, utterance verification [15] and speaker diariza-
1 This work was partly supported by ANR through the ODESSA tion [9, 16]. These features are similar to MFCC, but they re-
project (ANR-15-CE39-0010). place the short-time Fourier transform by an infinite-impulse re-
211 10.21437/IberSPEECH.2018-43
Speech
Feature activity Segment/cluster
extraction detection Segmentation representation Clustering Resegmentation
Binary key
MFCC 1-second AHC Diarization
Bi-LSTM Triplet-loss GMM
ICMC Bi-LSTM AP hypothesis
x-vectors
Figure 1: Diarization pipeline adopted by the proposed individual systems.
sponse, constant Q transform (IIR-CQT) [17]. This is a richer, ers use ReLu activations except the output layer neurons which
multi-resolution time-frequency representation for audio sig- use soft-max. The network is trained to discriminate between
nals, which provides a greater frequency resolution at lower fre- speakers in the training set. Once trained, the network is used
quencies and a higher time resolution at higher frequencies. to extract utterance-level embeddings for utterances from un-
seen speakers. The embedding is just the output of one of the
2.2. Speech activity detection and segmentation fully connected layers after the statistics pooling layer.
All submissions share a common speech activity detection
2.4. Clustering
(SAD) module [18], where SAD is modelled as a supervised bi-
nary classification task (speech vs. non-speech), and addressed Agglomerative hierarchical clustering. The AHC clustering
as a frame-wise sequence labelling task using a bi-directional uses a bottom-up agglomerative clustering algorithm as follows.
long short-term memory (LSTM) network operating on MFCC First, and assuming that the input audio stream is represented
features. As for segmentation, two systems were explored: (i) a as a matrix of segment-level embeddings, a number of clusters
straightforward uniform segmentation which splits speech con- Minit are initialised by a uniform splitting of the segment-level
tent into 1 second segments and (ii) segmentation via the de- embedding matrix. Cluster embeddings are estimated as the
tection of speaker change points. The speaker change detection mean segments embeddings. An iterative process including:
(SCD) module is that proposed in [19]. Similarly to the SAD (i) segment to-cluster assignment, (ii) closest cluster pair merg-
module, SCD is also modelled here as a supervised binary se- ing and (iii) cluster embedding re-estimation by averaging em-
quence labelling task (change vs. non-change). beddings of cluster members is then applied. All comparisons
are performed using the cosine similarity between embeddings.
2.3. Segment/cluster representation The clustering solutions generated after (i) are stored at every
iteration. The output solution is selected by finding a trade-
Binary key. This technique was initially proposed for speaker
off between the number of clusters and the within-class sum
recognition [8, 20] and applied to speaker diarization [21, 9, 16].
of squares (WCSS) among all solutions. This is accomplished
It represents speech segments as low-dimensional, speaker-
through an elbow criterion, as described in [21].
discriminative binary or integer vectors, which can be clustered
Affinity propagation. As proposed in [24], an affinity propa-
using some sort of similarity measure. The core model to per-
gation (AP) algorithm [25] is our second clustering method. In
form this mapping is a binary key background model (KBM)
contrast to other approaches, AP does not require a prior choice
which is trained in the test segment before diarization. The
of the number of clusters contrary to other clustering methods.
KBM is actually a collection of diagonal-covariance Gaussian
All speech segments are potential cluster centres (exemplars).
models selected from a pool of Gaussians learned on a sliding
Taking as input the pair-wise similarities between all pairs of
window over the test data. The window rate is adjusted dy-
speech segments, AP will select the exemplars and associate all
namically to assure a minimum number of Gaussians. Then,
other speech segments to an exemplar. In our case, the simi-
a selection process is performed to keep a percentage p of the
larity between the ith and j th speech segments is the negative
Gaussians in the pool to ensure sufficient coverage of all the
angular distance between their embeddings.
speakers in the test audio stream. The KBM is then used to bi-
narise an input sequence of acoustic features, which are then
accumulated to obtain a cumulative vector, which is the final 2.5. Re-segmentation
representation. Refer to [21] for more details. A resegmentation process is performed to refine time bound-
Triplet-loss neural embedding. The embedding architecture aries of the segments generated in the clustering step. It
used is the one introduced in [10] and further improved in [22]. uses Gaussian mixture models (GMM) to model the clusters,
In the embedding space, using the triplet loss paradigm, two and maximum likelihood scoring at feature level. Since the
sequences xi and xj of the same speaker (resp. two different log-likelihoods at frame level are noisy, an average smooth-
speakers) are expected to be close to (resp. far from) each other ing within a sliding window is applied to the log-likelihood
according to their angular distance. curves obtained with each cluster GMM. Then, each frame is
x-vector. This method [11] uses a deep neural network (DNN) assigned to the cluster which provides the highest smoothed
which maps variable length utterances to fixed-dimensional em- log-likelihood.
beddings. The network consists of three main blocks. The first
is a set of layers which implements a time-delay neural net- 3. System fusion
work (TDNN) [23] which operates at the frame level. The sec-
ond is a statistics pooling layer that collects statistics (mean and Two approaches to fusion were explored. The first operates at
variance) at the utterance level. Finally a number of fully con- the similarity matrix level suited to combine speaker diarization
nected layers are followed by the output layer with as many systems that are aligned at the segment level. The second oper-
outputs as speakers in the training data. Neurons of all lay- ates at the hypothesis level and can be applied to systems with
212
Cluster emb.
System A dim(emb A) by M
A1 A2 A1
System A
Segmental embeddings
dim(emb A) by N α
… B1 B2 B1 B2 B1
System B
Cosine
similarities M by N similarity matrices +
…
1-α Column-wise Max Fusion

dim(emb B) by N
A1B1
A2B2
A2B1
A2B2
A1B1
A1B2
B2
A2
Segment to cluster
dim(emb B) by M
System B Cluster emb. assignment
Figure 2: Illustration of the segment-to-cluster similarity matrix Figure 3: Illustration of the fusion of two diarization hy-
fusion. potheses.
arbitrary segment resolutions. 4.2. System configuration

Fusion at similarity matrix level. Systems sharing the same
Feature extraction. MFCCs are extracted with different num-
segmentation can be combined at the similarity level. In [13]
bers of coefficients depending on the subsequent segment repre-
fusion is performed by the weighted sum of the similarity ma-
sentation: 23 static coefficients for x-vector, and 19 plus energy
trices of two segment-aligned systems before a linkage agglom-
augmented with their first and second derivatives for triplet-loss
erative clustering. This approach was adapted to our AHC al-
embeddings. The binary key system uses 19 static ICMC fea-
gorithm by combining segment-to-cluster and cluster-to-cluster
tures. Finally, the re-segmentation stage uses 19 static MFCC
similarity matrices at every iteration. Similarities are also com-
features.
bined in the WCSS computation for the best clustering selection
Segment representation. For BK, the cumulative vector di-
procedure. In this way, the full process takes into account the
mension is set to p = 40% of the size of the initial pool of
influence of the systems being fused. An example of combina-
Gaussian components in the KBM, leading to different repre-
tion of two M -cluster to N -segments similarity matrices using
sentation dimensions which depend on the length of the test
weights α and 1 − α is depicted in Figure 2.
audio file. Gaussians are learned on a sliding window of 2 sec-
Fusion at hypothesis level. The combination of systems with onds to conform a pool with a minimum size set to 1024. The
totally diarization pipelines is generally only possible at hypoth- x-vector system uses the configuration employed in the Kaldi
esis level. In this work we explored hypothesis level combina- recipe for the SRE 2016 task2 . Data augmentation by means
tion using the approach described in [12]. Given a set of di- of additive and convolutive noise is performed for training. The
arization hypotheses, every frame-level decision can be merged dimension of the embeddings is 512, which was later reduced to
to assign a new frame-level cluster label which is the concate- 170 using LDA. For triplet-loss embeddings, and because of the
nation of all labels of the individual hypotheses. An example of lack of global identities in the Albayzin dataset, triplets are only
this strategy is illustrated in Figure 3. This process will result sampled from intra-files for the closed-set condition. Thanks to
in a large set of potential speaker clusters. Clusters shorter than the given speaker names in Voxceleb, triplets are also sampled
15 seconds are excluded and a final resegmentation is applied from inter-files for the open-set condition.
on the merged diarization to obtain the final diarization hypoth-
Clustering. AHC is initialised to a number Ninit of cluster
esis.
higher than the number of expected clusters in the test sessions.
We set Ninit = 30. The parameters of AF clustering such as
4. Experimental setup preference and damping factor are tuned on the development set
with the chocolate toolkit3 .
This section gives details of the training data and the configura-
Resegmentation. It is performed with GMMs with 128
tion of the different modules.
diagonal-covariance components. Likelihoods are smoothed by
a sliding window of 1s.
4.1. Training data
For the closed-set condition, the 3/24 channel database of 4.3. Evaluation
around 87 hours TV broadcast programmes in Catalan language Performance is assessed and optimised using the diarization er-
provided by the organisers was used. For the open-set condi- ror rate (DER), a standard metric for this task. It is defined as
tion, two popular datasets were used: DER = Espkr + EF A + EM S , where EF A and EM S are
SRE-data. It includes several datasets released over the years in factors mainly associated to the SAD module, namely the false
the context of the NIST speaker recognition evaluations (SRE), alarm and miss speech error rates. EM S is also incremented
namely SRE 2004, 2005, 2006, 2008 and 2010, Switchboard, by the percentage of overlapping speech present in the audio
and Mixer 6. This dataset contains mostly telephone speech sessions that is insufficiently assigned to a single speaker. Our
sampled at 8 kHz. systems do not contain any purpose-specific module to detect
VoxCeleb. The VoxCeleb1 dataset [26] consists of videos con- such segments. Finally, Espkr is the error related to speech seg-
taining more than 100,000 utterances for 1,251 celebrities ex- ments that are associated to the wrong speaker identities. It is
tracted from YouTube videos. The speakers represent a wide
range of different ethnicities, accents, professions and ages, and 2 https://github.com/kaldi-asr/kaldi/tree/
a large range of acoustic environments. This dataset is sampled master/egs/sre16/v2
at 16 kHz. 3 https://chocolate.readthedocs.io/
213
Table 1: Summary of ODESSA Primary (P) and contrastive (C1/C2) submissions for the closed- and open-set (denoted by c and o sub-
script, respectively) conditions, including feature extraction, segmentation and training data used, segment representation, clustering
and fusion. Performance (DER, %) is shown in the last column.
Condition Sys. Features Segmentation Segment rep. / train data Clustering Fusion DER
Pc - - - - C1c ,C2c , Hyp-level 10.17
Closed C1c ICMC 1-second BK / - AHC - 12.33
C2c MFCC BiLSTM EMB / 3/24 data AP - 14.10
Po - 1-second - AHC C1c ,C1o ,C2o , Sim-level5 7.21
Open C1o MFCC 1-second x-vector / SRE-data AHC - 9.29
C2o MFCC BiLSTM EMB / Voxceleb AP - 11.46
generally the main source of error in a speaker diarization sys- 6. Discussion

tem. To compute DER, a standard forgiveness collar of 0.25s is
Open- vs. closed-set conditions. As shown in Table 1, open-
applied around all reference segment boundaries.
set systems outperform closed-set ones on the development set.
Systems were tuned on approximately 15 hours of audio
This is somehow expected since neural approaches usually ben-
available at the 12 sessions that compose the partition dev2 of
efit from large amounts of data. Apart from differences in the
the RTVE2018 database. These sessions belong to the Spanish
amount of training data, the absence of global speaker IDs on
TV shows ”La noche en 24h” and ”Millenium”, of roughly 1h of
the closed-set training data are also likely to influence perfor-
duration and an average of 14 speakers per session. For further
mance. Our attempt to train an x-vector extractor for the closed-
details please refer to [7].
set condition was unfruitful, and performance of the triplet-loss
embedding extractor was significantly worse than when using
5. Submitted systems and results external training data (system C2c vs. C2o ). In this situation,
Table 1 summarises the ODESSA submissions and reports their the simpler, training-free binary key approach turned out to be
performance on the RTVE2018 development set4 . All systems the single best performing system (C1c ), showing the potential
share the same speech activity detection module, implying that of such techniques when training data is sparse or unavailable.
the speech/non-speech segmentation is identical for each sys- Segmentation. It is not clear if a dedicated module to speaker
tem. The segmentation error rate was 1.9%, composed of a turn detection brings significant benefits compared to simple,
missed speech rate of 0.3% and a 1.6% false alarm speech rate. straightforward uniform segmentation of the input stream. A
ODESSA submitted systems to both closed- and open-set con- volume of work reports the relative success of both explicit ap-
ditions. proaches to SCD [27, 28, 24] and uniform segmentation ap-
Closed-set condition. C1c uses ICMC features, 1-second uni- proaches [13, 29, 16]. In this work we found that some of the
form segmentation, BK representation and AHC clustering, speaker representations work better with either one of the two
while C2c uses MFCC features, BiLSTM based SCD, triplet- strategies, with the uniform approach being better suited to BK
loss neural embedding representation (EMB) and AP clustering. and x-vector, and SCD to the triplet-loss embeddings.
The DERs on the development set were 12.33% and 14.10%, System combination. In the closed-set condition, and because
for systems C1c C2c , respectively. Our primary system Pc is of the segmentation mismatch of our best performing single sys-
the fusion at diarization hypothesis level of C1c and C2c . Since tems, they could not be combined at the similarity matrix level.
these two systems use different segmentations this combination Hence, the most obvious combination was at hypothesis level.
method was found to be the most convenient. The combined The label combination procedure followed by a re-segmentation
hypothesis decreases the DER to 10.17%. resulted in a lower DER than those of the two individual sys-
Open-set condition. In the open-set condition, ODESSA tems. In the open-set condition three subsystems used the same
aimed to analyse how systems can benefit from greater amounts segmentation, and hence they could be fused at the similarity
of training data. The SRE and Voxceleb databases were used matrix level. This enabled the clustering to be performed jointly
for this purpose. Contrastive system C1o used MFCC features, by considering the contributions of the three subsystems in par-
1-second uniform segmentation, x-vector representation trained allel along the complete diarization process.
on SRE data and AHC clustering. System C2o aligns with the
closed-set C2c system, but where the training data was replaced 7. Conclusions
with the Voxceleb data. The DERs on the development set are
This paper reports the participation of the ODESSA team to
9.29% and 11.46%, respectively. The primary submission Po
the Albayzin Speaker Diarization Challenge 2018. As a con-
is the combination at similarity matrix level of three systems,
sortium, our main interest was on the combination of multiple
namely C1c (closed-set condition), C1o and C2o 5 . To speed
approaches to diarization. We assessed the effectiveness of two
up the tuning of optimal weights for three systems, we first
fusion strategies, namely at similarity matrix level, and at di-
tuned α for the fusion of C1o and C1c , and then β for the fu-
arization hypothesis level, which allowed the combination of
sion of previously fused C1o -C2o and C2o . Weights were set to
segment-time-aligned and arbitrary time-aligned diarization al-
α = β = 0.98. This combined system led to the best perfor-
gorithms, respectively. We also found the use of appropriate and
mance on the development set, with a DER of 7.21%.
abundant training data was critical to the learning of robust em-
4 At time of submission, neither results on the evaluation set, nor the beddings, while training-free approaches are demonstrated to
ground-truth labels were available. be adequate in the absence of suitable training data. For future
5 In the fused system, C2 was modified to use the 1-second seg-
o
work, and together with other classical challenges such as the
mentation and AHC clustering approaches to enable the alignment with problem of overlapping speakers, the impact of speaker change
C1o and C1c . detection should be further investigated.
214
8. References [16] J. Patino, H. Delgado, and N. Evans, “The EURECOM
Submission to the First DIHARD Challenge,” in Proc. IN-
[1] A. Stolcke, G. Friedland, and D. Imseng, “Leveraging speaker TERSPEECH 2018, 2018, pp. 2813–2817. [Online]. Available:
diarization for meeting recognition from distant microphones,” in http://dx.doi.org/10.21437/Interspeech.2018-2172
Proc. ICASSP. IEEE, 2010, pp. 4390–4393.
[17] P. Cancela, M. Rocamora, and E. López, “An efficient multi-
[2] J. Patino, R. Yin, H. Delgado, H. Bredin, A. Komaty, G. Wis-
resolution spectral transform for music analysis,” in Proc. ISMIR,
niewski, C. Barras, N. Evans, and S. Marcel, “Low-latency
2009, pp. 309–314.
speaker spotting with online diarization and detection,” in Proc.
Odyssey 2018, 2018, pp. 140–146. [18] G. Gelly and J.-L. Gauvain, “Minimum Word Error Training of
[3] “NIST Rich Transcription Evaluation,” 2009. [Online]. Available: RNN-based Voice Activity Detection.” in Proc. INTERSPEECH,
https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation 2015, pp. 2650–2654.
[4] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, [19] R. Yin, H. Bredin, and C. Barras, “Speaker Change Detection
and M. Liberman, “First DIHARD Challenge Evaluation Plan,” in Broadcast TV using Bidirectional Long Short-Term Memory
2018, https://zenodo.org/record/1199638. Networks,” in Proc. INTERSPEECH, Stockholm, Sweden,
August 2017. [Online]. Available: https://github.com/yinruiqing/
[5] A. Ortega, I. Viñals, A. Miguel, and E. Lleida, “The Albayzin change detection
2016 Speaker Diarization Evaluation,” in Proc. IberSPEECH,
2016. [20] G. Hernandez-Sierra, J. R. Calvo, J.-F. Bonastre, and P.-
M. Bousquet, “Session compensation using binary speech
[6] A. Ortega, I. Viñals, A. Miguel, E. Lleida, V. Bazán,
representation for speaker recognition,” Pattern Recognition
C. Pérez, M. Zotano, and A. de Prada, “Albayzin Evalua-
Letters, vol. 49, pp. 17 – 23, 2014. [Online]. Available: http://
tion: IberSPEECH-RTVE 2018 Speaker Diarization Challenge,”
www.sciencedirect.com/science/article/pii/S0167865514001779
2018. [Online]. Available: http://catedrartve.unizar.es/reto2018/
EvalPlan-SpeakerDiarization-v1.3.pdf [21] H. Delgado, X. Anguera, C. Fredouille, and J. Serrano, “Fast
[7] E. Lleida, A. Ortega, A. Miguel, V. Bazán, C. Pérez, single-and cross-show speaker diarization using binary key
M. Zotano, and A. de Prada, “RTVE2018 Database Description,” speaker modeling,” IEEE Transactions on Audio, Speech and Lan-
2018. [Online]. Available: http://catedrartve.unizar.es/reto2018/ guage Processing, vol. 23, no. 12, pp. 2286–2297, 2015.
RTVE2018DB.pdf [22] G. Gelly and J.-L. Gauvain, “Spoken Language Identification us-
[8] X. Anguera and J.-F. Bonastre, “A novel speaker binary key de- ing LSTM-based Angular Proximity,” in Proc. INTERSPEECH,
rived from anchor models,” in Proc. INTERSPEECH, 2010, pp. August 2017.
2118–2121. [23] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural
[9] J. Patino, H. Delgado, N. Evans, and X. Anguera, “EURECOM network architecture for efficient modeling of long temporal con-
submission to the Albayzin 2016 Speaker Diarization Evaluation,” texts,” in Proc. INTERSPEECH, September 2015, pp. 3214–3218.
in Proc. IberSPEECH, Nov 2016. [24] Y. Ruiqing, B. Hervé, and B. Claude, “Neural Speech Turn Seg-
[10] H. Bredin, “TristouNet: Triplet Loss for Speaker Turn Embed- mentation and Affinity Propagation for Speaker Diarization,” in
ding,” in Proc. ICASSP, New Orleans, USA, March 2017. Proc. INTERSPEECH, Hyderabad, India, September 2018.
[11] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- [25] B. J. Frey and D. Dueck, “Clustering by Passing Messages Be-
pur, “X-Vectors: Robust DNN Embeddings for Speaker Recogni- tween Data Points,” science, vol. 315, no. 5814, pp. 972–976,
tion,” in Proc. ICASSP, April 2018, pp. 5329–5333. 2007.
[12] S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, and L. Be- [26] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-
sacier, “Step-by-step and integrated approaches in broadcast news scale speaker identification dataset,” in Proc. INTERSPEECH,
speaker diarization,” Computer Speech & Language, vol. 20, no. August 2017.
2-3, pp. 303–330, 2006.
[27] Z. Zajı́c, M. Kunešová, J. Zelinka, and M. Hrúz, “ZCU-NTIS
[13] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, Speaker Diarization System for the DIHARD 2018 Challenge,”
M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watan- in Proc. INTERSPEECH, 2018, pp. 2788–2792.
abe et al., “Diarization is Hard: Some Experiences and Lessons
Learned for the JHU Team in the Inaugural DIHARD Challenge,” [28] I. Viñals, P. Gimeno, A. Ortega, A. Miguel, and E. Lleida, “Es-
Proc. INTERSPEECH, pp. 2808–2812, 2018. timation of the Number of Speakers with Variational Bayesian
PLDA in the DIHARD Diarization Challenge,” in Proc. INTER-
[14] S. Davis and P. Mermelstein, “Comparison of parametric repre-
SPEECH, 2018, pp. 2803–2807.
sentations for monosyllabic word recognition in continuously spo-
ken sentences,” IEEE Transactions on Acoustics, Speech, and Sig- [29] M. Diez, F. Landini, L. Burget, J. Rohdin, A. Silnova,
nal Processing, vol. 28, no. 4, pp. 357–366, Aug 1980. K. Žmolı́ková, O. Novotný, K. Veselý, O. Glembek, O. Plchot,
[15] H. Delgado, M. Todisco, M. Sahidullah, A. K. Sarkar, N. Evans, L. Mošner, and P. Matějka, “BUT System for DIHARD Speech
T. Kinnunen, and Z. H. Tan, “Further optimisations of constant Diarization Challenge 2018,” in Proc. INTERSPEECH, 2018, pp.
Q cepstral processing for integrated utterance and text-dependent 2798–2802.
speaker verification,” in 2016 IEEE Spoken Language Technology
Workshop (SLT), Dec 2016, pp. 179–185.
215
IberSPEECH 2018
EML Submission to Albayzin 2018 Speaker Diarization Challenge

Omid Ghahabi, Volker Fischer
EML European Media Laboratory GmbH, Berliner Straße 45, 69120 Heidelberg, Germany
omid.ghahabi@eml.org; volker.fischer@eml.org
Abstract in addition to the data provided in the challenge. We have par-

ticipated only in the close-set condition in this paper.
Speaker diarization, who is speaking when, is one of the most We have used the EML Online speaker diarization system
challenging tasks in speaker recognition, as usually no prior in- in this challenge. The system decides about the identity of the
formation is available about the identity and the number of the speakers at once every approximately 2 sec without any knowl-
speakers in an audio recording. The task will be more challeng- edge about the upcoming segments. The system uses the robust
ing when there is some noise or music on the background and Voice Activity Detection (VAD) proposed in [2]. The algorithm
the speakers are changed more frequently. This usually hap- used in the system is robust with a very low computational cost.
pens in broadcast news conversations. In this paper, we use the The whole speaker diarization process including feature extrac-
EML speaker diarization system as a participation to the recent tion is performed in an approximately 0.01×RT .
Albayzin Evaluation challenge. The EML system uses a real- The rest of the paper is organized as follows. Section 2
time robust algorithm to make decision about the identity of the describes clearly the databases used to build the system. Sec-
speakers approximately every 2 sec. The experimental results tion 3 describes briefly the algorithm and every part of the EML
on about 16 hours of the developing data provided in the chal- speaker diarization system. Section 4 explains the performance
lenge show a reasonable accuracy of the system with a very low measurement metric and the software used for the evaluation.
computational cost. The experimental results are summarized in section 5. Section 6
concludes the paper.
Index Terms: Speaker Diarization, Albayzin Evaluation, On-
line.
2. Database
1. Introduction Three sets of data are provided in the challenge for training and
development of speaker diarization (SD) systems. The first set
Speaker diarization is the process of identifying who is speak- is about 440 hours unlabeled broadcast news recordings. The
ing when during an audio recording. Compared to other speaker second one is about 75 hours automatically labeled with another
recognition tasks, speaker diarization is usually more difficult as SD system. The last dataset is about 16 hours of human-revised
there is no prior knowledge about the number and the identity labeled data. The final evaluation data set is also about 16 hours
of the speakers speaking in the audio recording. The task has recordings from other channels. This data are collected from
received more attention by the speech community with increas- RTVE2018 [1], Aragon Radio, and 3/24 TV channel databases.
ing the number of broadcast, meeting, and call center recordings The details are described as follows.
collected every year.
Albayzin evaluations are organized every two years and try 2.1. Unlabeled Data
to challenge the current unsolved problems in the speech pro-
cessing area and are supported by the Spanish Thematic Net- The train partition of RTVE2018 [1] is a collection of TV shows
work on Speech Technology (RTTH). One of the challenges this drawn from diverse genres and broadcast by RTVE from 2015
year has been speaker diarization of broadcast news recordings. to 2018. This partition is unlabeled and can be used for any
The task itself is not new but it gets really challenging when the evaluation task in Albayzin2018 [1]. The titles, duration and
data is noisy, there is a background music when the speakers content of the shows included on the RTVE2018 database can
talk, and when speakers speak at the same time. Unlike the last be found in [1].
Albayzin speaker diarization evaluation, the speech/nonspeech
labels of the audio recordings are not available which makes the 2.2. Automatically Labeled Data
task more difficult. This partition consists of the Aragon Radio, and 3/24 TV chan-
In addition to the data of the previous evaluations, the nel databases. The Aragon Radio database, which was do-
new data is provided by the public Spanish National Television nated by the Corporación Aragonesa de Radio y Televisión
(RTVE) this year. The database comprises different programs (CARTV), consists of around 20 hours of the Aragon Radio
broadcast by RTVE from 2015 to 2018. The programs cover broadcast. About 35% of audio recordings contain music along
a great variety of scenarios from studio to live broadcast, from with speech, 13% is noise along with speech and 22% is speech
read speech to spontaneous speech, different Spanish accents, alone.
including Latin-American accents and a great variety of con- The 3/24 TV channel is a Catalan broadcast news database
tents [1]. proposed for the Albayzin2010 Audio Segmentation Evaluation
Like in previous evaluations, two participating conditions [3]. The owner of the multimedia content allows its use for
are possible, close-set and open-set. In the close-set condition, technology research and development. The database consists
only the data provided in the challenge is allowed to be used for of around 87 hours of recordings in which about 40% of the
the system development while for the open-set condition, the time speech can be found along with noise and 15% of the time
participating sites can use also other publicly available datasets speech along with music.
216 10.21437/IberSPEECH.2018-44
The Rich Transcription Time Marked (RTTM) files con-
taining the segment information are generated automatically by
another SD system and are provided in the challenge along with
the audio recordings. VAD Feature
Extract
2.3. Human-Revised Labeled Data

0th Order UBM
The dev2 partition of RTVE2018 [1] contains 12 audio record- Stats (sp/nsp)
ings along with their human-revised RTTM files. This parti-
tion corresponds to two different debate shows, four programs N
sp? VAD
(7:26 hours) of La noche en 24H, where a group of political
analysts comments what has happened throughout the day, and Y
eight programs (7:42 hours) of Millenium where a group of ex- SPK Feature
perts debates about a current issue. The audio recordings consist Extract
of speech, music, noise, and a combination of them.
0th & 1st UBM
3. EML Speaker Diarization System Order Stats (sp)
We have used the EML Online speaker diarization system in
which the audio recording is processed every 0.1 sec and the Accumulate N max sp
decision for the speaker ID is made approximately every 2 Stats Duration?
sec. In summary, every 0.1 sec, the Voice Activity Detection Y
(VAD) algorithm decides if the current segment is speech or
Speaker Vector Extraction
nonspeech. If it is nonspeech, it will be discarded otherwise
the Baum-Welch statistics are computed and accumulated over
speech segments until the predefined maximum speech dura- Figure 1: Speaker vector extraction in the EML Online speaker
tion is reached. Then the accumulated statistics are converted to diarization system.
the supervector which is further mean normalized by the UBM
mean supervector. The resulting supervector is then converted
to a lower dimensional speaker vector given a transformation order Baum-Welch statistics obtained from the UBM. Given
matrix. The process of speaker vector extraction is shown in the speech/nonspeech labels from the automatically labeled data
Fig. 1. Although the same feature vectors can be used for both (Sec. 2.2), a single 64 dimensional vector of zero-order statis-
VAD and supervectors, we have used two different ones referred tics is obtained for each speech and nonspeech class. In the test
to as VAD and SPK features in Fig. 1. phase, the zero-order statistics of a given audio segment is com-
The speaker vector is compared with the current speaker puted and compared with speech and nonspeech vectors based
models using cosine similarity. If the speaker vector does not on the cosine similarity.
belong to any speaker, based on a speaker-dependent threshold,
a new model will be created. Before a new model creation, 3.3. Speaker Vectors
the speech segment is divided into two halves. For each half,
a speaker vector is created and compared with another using Speaker vectors in this algorithm are obtained by the trans-
cosine similarity. If the two halves are similar enough in terms formation of GMM supervectors. Supervectors are first trans-
of the speaker identity, the statistics are merged and the new formed by a Within-Class Covariance Normalization (WCCN)
model is created. Otherwise, each half is assigned to one of the matrix and then by a Linear Discriminant Analysis (LDA) ma-
current speaker models. In other words, a new speaker model is trix to lower dimensional vectors referred to as speaker vec-
created only if two halves are similar enough. In the algorithm, tors in this paper. The automatically labeled data (Sec. 2.2) is
every speaker has its own threshold which will be updated over used to train WCCN and LDA. The whole data is divided into
time. All the new created speaker models are assigned a fixed two non-overlapping datasets and chopped in 0.5, 1, 1.5, and
starting threshold. Then this threshold is updated based on the 2 sec segments. For each segment one supervector is created.
average scores of the assigned speaker vectors over time. More One dataset is used for training the WCCN and another for the
details about every part of the algorithm is given as follows. LDA. We have used 300 dimensional speaker vectors in this
challenge.
3.1. Feature Extraction and Universal background Models
3.4. Scoring
Feature vectors are extracted every 10 msec with a 25 msec win-
dow. For VAD, they are 16 dimensional MFCCs along with The speaker vectors are compared using cosine similarity. How-
their deltas, and for speaker vectors are 30 dimensional static ever, the cosine score is not stable enough to make a robust de-
MFCCs. Features are mean normalized with a 3 sec sliding cision. Therefore, we have created a set of background speaker
window. UBM for both VAD and speaker vectors are GMMs vectors to normalize these scores effectively before decision.
with 64 Gaussian mixtures each. They are trained using the un- The background speaker vectors are obtained on the unlabeled
labeled dataset described in Sec. 2.1. dataset (sec. 2.1). All the speech segments are chopped into 2
sec segments and converted to 300 dimensional speaker vectors.
3.2. Voice Activity Detection Afterwards, the resulting speaker vectors are clustered using a
two stage unsupervised clustering technique which was used to
We have used the hybrid supervised/unsupervised VAD pro- estimate the speaker labels of the background data for training
posed in [2]. In summary, the VAD model is based on zero- the Probabilistic Linear Discriminant Analysis (PLDA) in [4].
217
Table 1: The performance of the EML diarization system on the dev2 partition of RTVE2018 database.
DER(%)
Response Time Total Processing Time ×RT
La noche 24H Millenium Total
2 sec 32.65 12.24 22.12 00:12:37 0.014
4 sec 24.28 10.32 17.02 00:11:12 0.012
The first stage of the clustering algorithm is similar to the evaluation will be as follows,
Mean Shift based algorithm proposed in [5] and used success-
fully in [6]. In the second stage, the closer clusters obtained in
the first stage are combined. The second stage can be iterated md-eval-v22.pl -b -c 0.25 -r reference.rttm -s system.rttm (1)
for a few iterations or until no further merge is possible. In both
stages, speaker vectors are joined based on the cosine similar- 5. Experimental Results
ity considering a threshold which is set to 0.350 and 0.300 for
stages 1 and 2, respectively. After clustering, the centroids of We have used the human-revised labeled data (Sec. 2.3) for
the top 2048 clusters with higher number of members are con- the evaluation of the diarization system. It contains 12 audio
sidered as the final background speaker vectors. recordings from two different channels with a total duration of
Given the background speaker vectors, we perform a semi approximately 16 hours. The audio recordings include speech,
S-normalization on the scores before decision. The test speaker music, silence, noise, cross talks (some times more than two
vector and the speaker models are both compared with back- speakers at a same time), and speech over music.
ground speaker vectors using cosine similarity. The cosine Two different participating conditions are proposed in this
score between the test speaker vector and the speaker model challenge, a closed-set condition in which only data provided
is normalized once with the mean score of the top 10 closest within the Albayzin evaluation can be used for training and an
background speaker vectors to the test speaker vector and an- open-set condition in which external data can also be used for
other time with the mean score of the top 10 closest background training as long as they are publicly accessible to everyone. We
speaker vectors to the speaker model. The final score is the av- have participated only in the closed-set condition.
erage of these two normalized scores.
The EML speaker diarization system is primarily designed
for an Online application for which the robustness, computa-
4. Performance Measurement tional cost, and the response time is important. In the primary
As in the NIST RT Diarization evaluations, the Diarization Er- submitted system, the decision about the identity of the speakers
ror Rate (DER) will be used for the performance measurement is made every approximately 2 sec without looking at the future
in the challenge. The DER includes the time that is assigned to in the audio recording. As the algorithm divides the speaker
the wrong speaker, missed speech time and false alarm speech vectors into two halves before creating a new speaker model,
time. the resolution for the speaker change point detection is about
1 sec. However, as the response time is not important in the
The speaker error time is the amount of time that has been
challenge, we can increase it to a longer time but it would cor-
assigned to an incorrect speaker. This error can occur in seg-
respond to loosing fast speaker turns in the audio recording.
ments where the number of system speakers is greater than the
number of reference speakers, but also in segments where the The development set (sec. 2.3) includes audio recordings
number of system speakers is lower than the number of refer- from two Spanish programs, La noche en 24H and Millenium.
ence speakers whenever the number of system speakers and the The experimental results showed higher average DER on La
number of reference speakers are greater than zero. noche en 24H recordings than on Millenium recordings. It
The missed speech time refers to the amount of time that could be due to longer duration of audio signals (2 hours each
speech is present but not labeled by the diarizaton system in compared to 1 hour each for Millenium), faster speaker turns,
segments where the number of system speakers is lower than more cross talks, more music on the background, or something
the number of reference speakers. else which needs more investigation.
The false alarm time is the amount of time that a speaker Table 1 summarizes the average DER on each program,
has been labeled by the diarization system but is not present in the total DER obtained on the entire dev2 set of RTVE2018
segments where the number of system speakers is greater than dataset considering the duration of audio recordings, and the to-
the number of reference speakers. tal computational time used for processing all the recordings
As defined in the challenge [7], consecutive segments of from scratch including the feature extraction. The process-
the same speaker with a silence of less that 2 sec come together ing is made using a single core of an Intel(R) Xeon(R) CPU
and are considered as a single segment. A forgiveness collar of @2.10GHz.
0.25 sec, before and after each reference boundary, will be con-
sidered in order to take into account both inconsistent human 6. Conclusions
annotations and the uncertainty about when a speaker begins or
ends. Overlap regions where more than one speaker is present We used the EML Online speaker diarization system as a par-
are also taken into account for the evaluation. ticipation in the recent Albayzin speaker diarization evaluation.
The tool used for evaluating the diarization systems is We tried to take advantage of all the unlabeled and labeled data
the one developed for the RT diarization evaluations by NIST provided in the challenge in the close-set condition. The system
md-eval-v22.pl, available in the web site of the evaluation: showed a reasonable performance on the development data with
http://catedrartve.unizar.es/reto2018. The command line for the a very low computational cost.
218
7. Acknowledgement [4] O. Ghahabi and J. Hernando, “Deep learning backend for single
and multisession i-vector speaker recognition,” IEEE/ACM Trans-
We would like to thank Wei Zhou for the efficient implementa- actions on Audio, Speech, and Language Processing, vol. 25, no. 4,
tion of the Online algorithm. pp. 807–817, 2017.
[5] M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel, “A
8. References study of the cosine distance-based mean shift for telephone speech
diarization,” IEEE/ACM Transactions on Audio, Speech and Lan-
[1] E. Lleida, A. Ortega, A. Miguel, V. Bazan, guage Processing, vol. 22, no. 1, pp. 217–227, 2014.
C. Perez, M. Zotano, and A. Prada, “RTVE2018
database description,” 2018, [Online]. Available: [6] S. Novoselov, T. Pekhovsky, and K. Simonchik, “STC speaker
http://catedrartve.unizar.es/reto2018/RTVE2018DB.pdf. recognition system for the nist i-vector challenge,” in Odyssey: The
Speaker and Language Recognition Workshop, 2014, pp. 231–240.
[2] O. Ghahabi, W. Zhou, and V. Fischer, “A robust voice activity de-
tection for real-time automatic speech recognition,” in Proc. ESSV, [7] A. Ortega, I. Vinals, A. Miguel, E. Lleida, V. Bazan,
2018. C. Perez, M. Zotano, and A. Prada, “Albayzin evaluation:
IberSpeech-RTVE 2018 speaker diarization challenge,” 2018, [On-
[3] M. Zelenak, H. Schulz, and F. J. Hernando Pericas, “Albayzin 2010 line]. Available: http://catedrartve.unizar.es/reto2018/EvalPlan-
evaluation campaign: speaker diarization,” in VI Jornadas en Tec- SpeakerDiarization-v1.3.pdf.
nologı́a del Habla, 2010, pp. 301–304.
219
IberSPEECH 2018
In-domain Adaptation Solutions for the RTVE 2018 Diarization Challenge

Ignacio Viñals, Pablo Gimeno, Alfonso Ortega, Antonio Miguel, Eduardo Lleida
{ivinalsb, pablogj, ortega, amiguel, lleida}@unizar.es
Abstract ers in the different parts of the audio to make any decision.
For this reason diarization applies many methods developed for
This paper tries to deal with domain mismatch scenarios in the speaker recognition. Successful diarization systems consider-
diarization task. This research has been carried out in the con- ing these technologies are: Agglomerative Hierarchical Clus-
text of the Radio Televisión Española (RTVE) 2018 Challenge tering (AHC)[3] with ∆BIC [4], streams of eigenvoices[5] re-
at IberSpeech 2018. This evaluation seeks the improvement of segmented with HMMs [6], i-vectors [7] clustered with PLDA
the diarization task in broadcast corpora, known to contain mul- [8], [9] in [10]. Neural Networks are also contributing, firstly
tiple unknown speakers. These speakers are set to contribute in providing more reliable acoustic information [11] and more re-
different scenarios, genres, media and languages. The evalua- cently a new representation: the embeddings such as xvectors
tion offers two different conditions: A closed one with restric- [12].
tions in the resources to train and develop diarization systems,
and an open condition without restrictions to check the latest When moving from telephone data to other scenarios, such
improvements in the state-of-the-art. as broadcast or meetings, new difficulties arise. Specially rel-
evant are the estimation of the number of speakers and the do-
Our proposal is centered on the closed condition, specially
main mismatch. The first problem is caused by the presence of
dealing with two important mismatches: media and language.
an unknown number of speakers in the audio. This difficulty in-
ViVoLab system for the challenge is based on the i-vector
creases if the contributions per speaker are significantly unbal-
PLDA framework: I-vectors are extracted from the input audio
anced. Our proposed solution to deal with this problem is [13],
according to a given segmentation, supposing that each segment
which makes use of a penalized version of the Evidence Lower
represents one speaker intervention. The diarization hypothe-
Bound (ELBO) from a Variational Bayes solution, as reliabil-
ses are obtained by clustering the estimated i-vectors with a
ity metric. Another important problem is the domain mismatch.
Fully Bayesian PLDA, a generative model with latent variables
Broadcast data consists of several shows, belonging to multiple
as speaker labels. The number of speakers is decided by com-
genres and many differences in terms of locations, audio qual-
paring multiple hypotheses according to the Evidence Lower
ity or postprocessing details. This large variability in the audio
Bound (ELBO) provided by the PLDA, penalized in terms of
makes that one system is likely to lack in precision to cover the
the hypothesized speakers to compensate different modeling ca-
whole range of possibilities. However specific systems are also
pabilities.
unfeasible for practical reasons. Our approach[14] combines
Index Terms: adaptation, diarization, broadcast, i-vector,
both strategies: A single system is unsupervisedly adapted to
PLDA, Variational Bayes
the different shows to diarize with the same data to evaluate,
obtaining the diarization labels afterwards.
1. Introduction The paper is organized as follows: Section 2 describes the
The production of broadcast content has progressively aug- evaluation and the available data. ViVoLab system is presented
mented along the last years, becoming more and more neces- in Section 3. Section 4 is dedicated to present the obtained re-
sary the tools to process and label all these new data. One of sults. Finally, some conclusions are included in Section 5.
the required tasks is diarization, the indexation of some audio
according to the active speaker. Hence the goal of diarization is
the differentiation among the speakers by means of generic la-
2. RTVE 2018 Challenge
bels, leaving the identification of each speaker for further work, The RTVE 2018 Challenge is part of the 2018 edition of the
if necessary. Originally developed for telephone conversations, Albayzin [15], [16], [17], [18], [19] evaluations. These eval-
new domains such as broadcast audio and meetings are suitable uations are designed to promote the evolution of speech tech-
to be interested in this technique, adding new challenging draw- nologies in Iberian languages. In particular, RTVE 2018 Chal-
backs not present in the original scenario. lenge is focused on the extraction of relevant information from
Multiple approaches have been proposed to the diarization Broadcast data in Spanish language. This information, such as
problem since its origins, most of them following two main the identity of the person on screen and his speech, is intended
strategies: The Top-Down philosophy, which obtains the correct to help describing and labeling the multimedia data for further
labels by dividing an initial hypothesis with only one speaker, work. To accomplish all these goals, the evaluation provides
and the Bottom-Up strategy, which initially divides the input around 500 hours of shows from the Spanish Public TV corpo-
audio into acoustic segments containing only one speaker each, ration Radio Televisión Española (RTVE). The considered au-
and combining them afterwards. Further information is avail- dio tries to cover the widest possible range of Spanish variabil-
able in [1][2]. Both philosophies need to characterize the speak- ity, including varieties of Spanish from Spain and Latin Amer-
ica. In addition to the provided audio some metadata is pro-
This work has been supported by the Spanish Ministry of Economy vided, with different levels of reliability.
and Competitiveness and the European Social Fund through the 2015
FPI fellowship, the project TIN2017-85854-C4-1-R and Gobierno de The database is divided into 4 subsets, with different func-
Aragón /FEDER (research group T36 17R). tionality:
220 10.21437/IberSPEECH.2018-45
VAD Segment Ni φj yi
πθ
sn (t) Φn θn
MFCC i-vectors Clustering θj
M
µ W V
Figure 1: Schematic of ViVoLab diarization system α
Figure 2: Fully Bayesian PLDA

• Train 423 episodes from 16 shows. The included meta-
data consists of the broadcasted subtitles.
• Dev1 47 episodes from 5 shows, manually transcribed. acoustic segments is represented by an i-vector [7], with a 256-
• Dev2 12 episodes from 2 shows, manually transcribed Gaussian 100-dimension i-vector extractor exclusively trained
and with speaker time references. with 3/24 TV dataset. The obtained i-vectors have centering,
whitening and length normalization [22] applied.
• Test 61 episodes from 9 shows. Regarding the Diariza-
tion task, only 40 episodes from 4 shows are involved. 3.4. Clustering Method
RTVE dataset is complemented with data collected for pre- The i-vector clustering is performed by a 100-dimension
vious Albayzin evaluations. These data are: Fully Bayesian PLDA [9] [10] trained with 3/24 and CARTV
• 3/24 TV channel [20]. Approximately 84 hours of datasets. Its Bayesian Network is shown in Fig 2.
Broadcast TV from 3/24 channel in Catalan language. The high complexity of the model makes the Maximum
Ground truth speaker marks are provided as metadata. Likelihood solution difficult to obtain. Therefore, the original
paper suggests performing a Variational Bayes decomposition
• CARTV. 20 Hours of radio broadcast, obtained from instead. This decomposition approximates the likelihood of the
Corporación Aragonesa de Radio y Televisión broad- model by a product of factors q, each one depending on fewer
casted signal. The corresponding metadata consists of hidden variables than the original distribution. In our case, the
manually transcribed speaker time marks. proposed decomposition is:
3. System description P (Φ, Y, θ, πθ , µ, V, W, α) ≈

ViVoLab submission is based on the diarization system firstly q (Y) q (θ) q (πθ ) q (µ) q (V) q (W) q (α) (1)
developed in [10]. This system, based on a bottom-up strat-
egy, firstly divides the input audio into acoustic homogeneous From all these factors, there is a set (q (µ), q (V), q (W),
segments, clustered afterwards by means of a Fully Bayesian q (α)) related to the Bayesian interpretation of the PLDA model
PLDA. Its schematic is shown in Fig 1. parameters, while another set (q (θ),q (πθ ),q (y)), is related to
In the following lines we describe in details each part of the the assignment of i-vectors in terms of soft speaker labels. The
schematic. latter group is the base of our diarization system.
The whole clustering step consists of two differentiated
3.1. Feature Extraction steps. First we generate an initialization assignment for the i-
vectors, becoming a seed for the corresponding hidden variable
From the input audio sn (t) 20 MFCC features vectors are ex- θ. Then all the related factors (q (θ),q (πθ ),q (y)) are iteratively
tracted, including C0 (C0-C19), over a 25 ms hamming window reevaluated, reassigning the speaker labels in the process.
every 10 ms (15 ms overlap). No derivatives are considered.
The obtained features are normalized according to a Cepstral 3.5. Unsupervised Adaptation
Mean and Variance Normalization (CMVN). The mean and
variance are estimated taking into account the whole episode Due to the fact that known mismatches are present in the evalu-
to diarize. ation (Language regarding 3/24 dataset and media according to
CARTV dataset), some loses in performance could be expected.
3.2. Voice Activity Detection We include an unsupervised adaptation [14] to the system. Prior
to any evaluation, the out-of-domain PLDA is adapted to the
Voice Activity Detection (VAD) is performed by means of a evaluation domain considering the episode to diarize itself. Per-
2-layer BLSTM [21] network of 128 neurons, trained on the formed in the PLDA model, the adaptation requires speaker la-
3/24 database. This network estimates the speech detection in bels, which are obtained by unsupervised clustering. In this
3-second duration sequences, providing one label each 10 ms evaluation, the unsupervised labels are generated considering
of audio. The input parametrization is an 80-element Mel Filter the diarization labels without adaptation.
bank and the log energy, computed in 25 ms windows with a 10
ms advance shift. The obtained features are normalized in mean 3.6. Speaker number estimation
and variance according to the whole audio.
The considered PLDA and its Variational Bayes solution have
the ability to recombine and eliminate speakers at will, only
3.3. Segment Representation
(and strongly) depending on the initial clustering. This initial
The speaker change detection is carried out making use of a clustering determines the maximum possible clusters to create
∆BIC [4] analysis, modeling the hypotheses with Full Covari- as well as the first assignment of the i-vectors to these clusters.
ance Gaussian distributions. A sliding window strategy has Our solution to this problem is the consideration of multiple
been considered for this purpose. Each one of the obtained initializations, with a different number of speakers. In order to
221
best compare the results from each hypothesis, we make use of a from a new domain. Besides, when moving from ground truth
penalized version of the Evidence Lower Bound (ELBO) given VAD to a noisy one, a new sort of audio mismatch is intro-
by the PLDA model [13]. This metric, related to the likelihood, duced: i-vectors with non-speech. In this situation the unsuper-
indicates how well the speaker labels represent the given data, vised adaptation contributes learning from the evaluation data
but penalized in terms of the hypothesed number of speakers to and adapting the model to the new conditions. This adaptation
avoid unnecessary subclustering, i.e., estimating more clusters makes the system not to be degraded by the new scenario. In
than the real number of speakers. real conditions with noisy VAD the adaptive solution obtains
a 24% relative improvement respect to the non-adaptive con-
4. Results trastive system.
Regarding the estimation of the number of speakers, the
ViVoLab submission to the RTVE 2018 Diarization Challenge performed analysis indicates that our systems are likely to sub-
consists of two systems, a primary and a constrastive. Both cluster, i.e., hypothesize a larger number of clusters than the
follow the pipeline described previously, with the only differ- real value. If some of these extra clusters are dedicated to col-
ence that our primary system performs unsupervised adaptation lect strange segments, the main clusters keep pure and clean
while the contrastive does not. All the models were trained with and compensate any loss in performance. Therefore some sub-
3/24 and CARTV data and no extra knowledge was considered, clustering could help avoiding relevant mistakes (primary sys-
not even those provided by RTVE data. tem vs contrastive system using ground truth VAD). However,
The results obtained by the previously described configura- our results also indicate than an excesive subclustering causes a
tions are exhibited in Table 1. strong degradation of the performance (contrastive system with
BLSTM VAD). Further research should be done to provide a
Table 1: DER (%) Results for Primary and constrastive 1 sys- deeper understanding.
tems in the development set. Results presented with Ground
truth VAD and our BLSTM VAD for comparison reasons. Re-
sults include the DER term as well as its contributions: Miss 6. References
speech (MISS), False Alarm speech (F.A.) and Speaker Er- [1] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland,
ror(SPK). Overlap is considered for evaluation purposes and O. Vinyals, “Speaker Diarization: A Review of Recent Re-
search,” IEEE Transactions On Audio Speech And Language Pro-
SYSTEM MISS (%) F.A. (%) SPK(%) DER(%) cessing, vol. 20, no. 2, pp. 356–370, 2012.
ORACLE [2] S. E. Tranter and D. Reynolds, “An Overview of Automatic

Speaker Diarization Systems,” IEEE Transactions on Audio,
Primary 1.78 0.00 9.30 11.08 Speech and Language Processing, vol. 14, no. 5, pp. 1557–1565,
Contrastive 1.78 0.00 8.05 9.83 2006.
BLSTM [3] D. Reynolds and P. Torres-Carrasquillo, “Approaches and Appli-
Primary 2.51 1.74 6.69 10.94 cations of Audio Diarization,” ICASSP, IEEE International Con-
Contrastive 2.51 1.74 9.99 14.24 ference on Acoustics, Speech and Signal Processing - Proceed-
ings, vol. V, pp. 953–956, 2005.
An analysis of the obtained results in terms of the estima- [4] S. Chen and P. Gopalakrishnan, “Speaker, Environment and Chan-
nel Change Detection and Clustering Via the Bayesian Informa-
tion of the number of speakers can be done. A simple analysis
tion Criterion,” Proc. DARPA Broadcast News Transcription and
is available in Table2. Understanding Workshop, vol. 6, pp. 127–132, 1998.
[5] P. Kenny, “Bayesian Speaker Verification with Heavy-Tailed
5. Conclusions Priors,” Keynote presentation, Odyssey Speaker and Language
Recognition Workshop, 2010.
The ViVoLab submission to the RTVE 2018 Challenge includes
two systems that have satisfactory results when considering the [6] C. Vaquero, Robust diarization for speaker characterization. PhD
thesis, 2011.
development set.
According to the development set, both systems are ro- [7] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-
bust enough to deal with the subset mismatches, specially those let, “Front-end Factor Analysis for Speaker Verification,” IEEE
Transactions on Audio, Speech and Language Processing, vol. 19,
known during the training phase: Language and Media. The no. 4, pp. 788–798, 2011.
trained models perform well while evaluating unknown audios
[8] S. J. D. Prince and J. H. Elder, “Probabilistic Linear Discrimi-
nant Analysis for Inferences About Identity,” in Proceedings of
Table 2: Absolute and relative error in the estimation of the the IEEE International Conference on Computer Vision, 2007.
number of speakers for both the primary and contrastive sys- [9] J. Villalba and E. Lleida, “Unsupervised Adaptation of PLDA
tems. Error defined as #diar − #oracle . Ground truth (ORA- By Using Variational Bayes Methods,” in ICASSP, IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing
CLE) and BLSTM VAD conditions are studied. Analysis carried
- Proceedings, pp. 744–748, 2014.
out on the development set.
[10] J. Villalba, A. Ortega, A. Miguel, and E. Lleida, “Variational
Bayesian PLDA for Speaker Diarization in the MGB Challenge,”
System Absolute Error Relative Error (%) in 2015 IEEE Workshop on Automatic Speech Recognition and
ORACLE Understanding (ASRU), pp. 667–674, 2015.
Primary 12.91 97.87 [11] Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, “Deep
Contrastive 1.83 16.40 feature for text-dependent speaker verification,” Speech Commu-
BLSTM nication, vol. 73, pp. 1–13, 2015.
Primary 18.16 132.83 [12] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero,
Contrastive 7.5 56.12 Y. Carmiel, and S. Khudanpur, “Deep Neural Network-based
Speaker Embeddings for End-to-end Speaker Verification,” iEEE
222
Spoken Language Technology Workshop (SLT), pp. 165–170,
2016.
[13] I. Viñals, P. Gimeno, A. Ortega, A. Miguel, and E. Lleida, “Es-
timation of the Number of Speakers with Variational Bayesian
PLDA in the DIHARD Diarization Challenge,” Interspeech,
no. September, pp. 2803–2807, 2018.
[14] I. Viñals, A. Ortega, J. Villalba, A. Miguel, and E. Lleida,
“Domain Adaptation of PLDA models in Broadcast Diariza-
tion by means of Unsupervised Speaker Clustering,” Interspeech,
pp. 2829–2833, 2017.
[15] A. Ortega, D. Castan, A. Miguel, and E. Lleida, “The Albayzin
2012 Audio Segmentation Evaluation,” 2012.
[16] J. Tejedor and D. T. Toledano, “The ALBAYZIN 2014 Search on
Speech Evaluation,” no. November, 2014.
[17] A. Ortega, D. Castan, A. Miguel, and E. Lleida, “The Albayzin
2014 Audio Segmentation Evaluation,” 2014.
[18] J. Tejedor and D. T. Toledano, “The ALBAYZIN 2016 Search on
Speech Evaluation,” 2016.
[19] A. Ortega, I. Viñals, A. Miguel, and E. Lleida, “The Albayzin
2016 Speaker Diarization Evaluation,” 2016.
[20] M. Zelenák, H. Schulz, and J. Hernando, “Speaker diarization of
broadcast news in Albayzin 2010 evaluation campaign,” Eurasip
Journal on Audio, Speech, and Music Processing, vol. 2012, no. 1,
pp. 1–9, 2012.
[21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Computation, vol. 9, no. 8, pp. 1–32, 1997.
[22] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of I-
vector Length Normalization in Speaker Recognition Systems,”
in Proceedings of the Annual Conference of the International
Speech Communication Association, INTERSPEECH, pp. 249–
252, 2011.
223
IberSPEECH 2018
DNN-based Embeddings for Speaker Diarization in the AuDIaS-UAM System

for the Albayzin 2018 IberSPEECH-RTVE Evaluation
Alicia Lozano-Diez, Beltran Labrador, Diego de Benito, Pablo Ramirez, Doroteo T. Toledano
AuDIaS - Audio, Data Intelligence and Speech Research Group,

Universidad Autonoma de Madrid (UAM), Madrid, Spain
alicia.lozano@uam.es
Abstract (dev2) [3]. For training, we used all data from the first two
databases and 8 files (out of 12) from RTVE 2018. This set
This document describes the three systems submitted by the was segmented according to the time alignments specified by
AuDIaS-UAM team for the Albayzin 2018 IberSPEECH-RTVE the RTTM files.
speaker diarization evaluation. Two of our systems (primary
In order to evaluate the speaker diarization performance of
and contrastive 1 submissions) are based on embeddings which
our systems, we used 3 files from the RTVE 2018 development
are a fixed length representation of a given audio segment ob-
set (approximately 4 hours) not used for training.
tained from a deep neural network (DNN) trained for speaker
All the audio files were down-sampled to 16kHz.
classification. The third system (contrastive 2) uses the classical
i-vector as representation of the audio segments. The resulting
embeddings or i-vectors are then grouped using Agglomerative 3. Feature Extraction
Hierarchical Clustering (AHC) in order to obtain the diarization
All our systems are based on MFCC features extracted using
labels. The new DNN-embedding approach for speaker diariza-
Kaldi [4]. Each feature vector consists of 20 MFCCs (includ-
tion has obtained a remarkable performance over the Albayzin
ing C0), computed every 10 ms with a 25 ms “Povey” window
development dataset, similar to the performance achieved with
(default in Kaldi, similar to Hamming window).
the well-known i-vector approach.
For the i-vector system, these 20-dimensional features are
Index Terms: speaker diarization, embeddings, i-vectors, AHC
normalized using cepstral mean normalization over a 3 s sliding
window, and augmented with their first and second derivatives
1. Introduction (∆ and ∆∆), providing a final 60-dimensional feature vector.
The AuDIaS-UAM submission for the Speaker Diarization For the DNN-embedding systems, the raw 20-dimensional
(SD) evaluation consisted of three different systems, two of MFCC feature vectors are used to feed the network without ap-
them based on embeddings (also known as x-vectors) [1] ex- plying channel compensation or adding temporal information.
tracted from a Deep Neural Network (DNN) trained for speaker However, global zero-mean and unit-variance normalization is
classification, and a third one based on the classical total vari- performed over the whole training set.
ability i-vector model [2].
Our systems are submitted for the closed-set condition 4. DNN-based Embedding Systems
since they are trained using the training and development
datasets made available for this evaluation, briefly described in Two of our submitted systems are based on DNN-based embed-
Section 2. dings [1]. An embedding is a fixed-length representation of a
For all our systems, we extract frame-level features as de- given utterance or audio segment learned directly by a DNN.
scribed in Section 3. Then, using those features, we train ei- Typically, this DNN is trained for speaker classification.
ther a DNN or an i-vector extractor in order to obtain a fixed- In our case, we used an architecture based on Bidirec-
length representation of an audio segment (regardless its du- tional Long Short-Term Memory (BLSTM) recurrent neural
ration), as presented in Sections 4 and 5, respectively. These networks similar to the one used in [5] for language recog-
models are trained using a segmentation based on the reference nition, whose configuration was adjusted to the available data
labels (RTTM files) provided by the organizers for training and and the speaker diarization task. The architecture (sequence-
development purposes. summarizing DNN) used consists on a frame-level part com-
We kept three development files from RTVE dataset as held posed of two BLSTM layers (with 128 cells each) and a fully-
out set for diarization performance evaluation. For these record- connected layer of 500 hidden units. Then, a pooling layer com-
ings, in order to discard fragments where just music (without putes the mean and standard deviation over time to the outputs
speech) is present, we developed a DNN-based music detec- of the previous layer, followed by two fully connected layers
tor described in Section 6. Then, diarization labels are obtained (embeddings a and b, respectively) of 50 hidden units each and
by means of Agglomerative Hierarchical Clustering (AHC) per- a softmax output layer with 3124 output units, working on an
formed over non-music segments. This last step is summa- utterance-level basis. All the layers (except the output layer)
rized in Section 7. Finally, in Section 8 we show results of the use sigmoid non-linear activation. A graphical representation
Audias-UAM submitted systems over our development dataset. of the architecture is depicted in Figure 1.
The size of the output layer (3124) corresponds to the num-
ber of speakers considered in our training dataset. However,
2. Training and Development Datasets it should be pointed out that due to the lack of actual speaker
We used the three datasets provided by the organizers for this identification labels and the segmentation based on the RTTM
evaluation: Aragón Radio, 3/24 TV channel and RTVE 2018 labels for diarization, each recording was considered to have
224 10.21437/IberSPEECH.2018-46
Pooling
(mean, std)
Output
3124 audio signals, followed by one LSTM layer and a final fully-
Fully speakers connected layer prior to the output layer.
BLSTM BLSTM connected
Input 128 128
500 Emb_a Emb_b
The output layer has 4 output units which correspond to
sequence
50 50 the classification into music, speech, speech + music and none
of them. For this evaluation we used just the probabilities of
20 dim
MFCC Posterior belonging to the music class in order to filter out segments that
probabilities contain only music. This way, just the embeddings or i-vectors
corresponding to speech (or speech with music) segments were
used to perform the clustering stage.
In order to extract the probability of a segment to contain
music, Mel-spectrograms corresponding to test recordings have
Frame-by-frame Utterance (sequence) level
embedding representation
been computed and split into a stream of 10 second segments to
fit the input size of the music/speech classifier. The separation
between consecutive segments is 0.5 s in order to identify each
Figure 1: Architecture used for the DNN-based embedding sys-
Mel-spectrogram with an embedding or i-vector in the stream.
tems used as primary and contrastive 1 submissions to the
speaker diarization evaluation.
7. Agglomerative Hierarchical Clustering
In order to obtain the speaker diarization labels, we used Ag-
different speakers than the rest (which is usually not the case). glomerative Hierarchical Clustering (AHC) over the resulting
Then, even though two segments might be labeled as spoken by stream of either embeddings or i-vectors (depending on the sys-
different speakers, they could belong to the same person. tem) for a given development or test recording. Embeddings
The DNN was trained using stochastic gradient descend to or i-vectors corresponding to music segments according to our
minimize the multi-class cross-entropy criteria for speaker clas- music detector were discarded previously to the clustering step.
sification. For training purposes, the network was fed with 3 s This stage was implemented in Python using the scikit-learn
long sequences of 20-dimensional MFCC feature vectors. toolbox [9].
After training, embeddings were extracted for each 3 s frag- Thus, AHC is applied to the resulting sequence of vectors
ment of the development and test recordings, with a shift of corresponding to a specific audio file. We used cosine distance
0.5 s. Each segment was forwarded through the network up for i-vectors and euclidean distance for embeddings. The num-
to the first embedding layer (embedding a), providing a 50- ber of clusters is controlled by the threshold of the distance to
dimensional embedding every 0.5 s (corresponding to 3 s se- merge clusters, whose value was optimized on the development
quences). set. The linkage method used was the average of the vectors.
This system was implemented using Keras [6]. This clustering was applied once for the contrastive systems.
Embeddings obtained from this system were used for the However, for the primary system we applied first AHC over
primary system and the contrastive system 1, which differ in the whole set of vectors with a lower threshold to allow a bigger
the clustering stage as described in Section 7. number of clusters, and then, a second AHC stage was applied
to group the centroids of the previous clusters. This was done in
5. I-vector System order to help the clustering grouping speaker identities instead
of vectors similar due to their closeness in time. The centroids
As contrastive system 2, we used the classical total variability were computed as the mean vector over all the points labeled as
i-vector [2] modeling. belonging to the same cluster in the first AHC stage.
To develop this system, an UBM of 1024 Gaussian com- Finally, for all the systems, we post-processed the cluster-
ponents was trained using the 60-dimensional MFCC+∆+∆∆ ing labels by filtering out clusters that grouped less than 10 s of
features described in Section 3, and a 50-dimensional total vari- audio. This was done in order to reduce false alarm in terms
ability subspace was derived from the Baum-Welch statistics of clusters that do not group a different speaker but segments
of the training segments (obtained according to RTTM times- further than the chosen threshold.
tamps). The configuration was taken from previous speaker di-
arization systems developed in our research group.
8. Development Results
After training, each development and test recording was
processed in order to obtain a stream of i-vectors every 0.5 s Table 1 shows the results obtained by our systems in our devel-
(as with embeddings) with a sliding window of length 3 s. opment set (three files from RTVE 2018 dev 2 partition).
This system was implemented using Kaldi [4]. In our development set, both systems based on embeddings
The speaker diarization was performed using clustering on obtained similar results, especially in terms of missed and false
top of the resulting streams of i-vectors (see Section 7). alarm speaker time. The difference in these two metrics with
respect to the third system (based on i-vector) is also not signif-
icant, while the performance differs mainly due to the speaker
6. Music Detection error time. This might be related to the system settings and
In order to discard segments where just music was present, we thresholds selected for the clustering, which would merge dif-
developed a music/speech classifier based on DNNs [7]. ferent speakers into one cluster or vice-versa.
This system is trained using 150 h of audio from Google Even though the i-vector system obtained better perfor-
Audio Set [8], a dataset consisting of 10 seconds audio seg- mance in our development dataset than the embedding-based
ments extracted from YouTube videos. Our architecture is systems, we submitted as primary system one of these systems
composed of six bidimensional convolutional neural network due to the novelty of this technique with respect to the well-
(CNN) layers which operate on the Mel-spectrogram of the known i-vectors for speaker diarization and related tasks.
225
Table 1: Performance of the Audias-UAM submission for the Albayzin IberSPEECH-RTVE speaker diarization evaluation over the
development dataset (approximately 4 hours, 3 different recordings from 2 different shows. The performance is shown in % of scored
speaker time.
System Missed False Alarm Speaker Error DER

Embeddings + double AHC (primary) 2.2 1.2 14.6 18.02
Embeddings + simple AHC (contrastive 1) 2.1 1.2 15.3 18.65
i-vectors + simple AHC (contrastive 2) 2.9 1.3 12.9 17.22
9. Acknowledgements
This work was supported by project DSSL: Redes Profundas y
Modelos de Subespacios para Detección y Seguimiento de Lo-
cutor Idioma y Enfermedades Degenerativas a partir de la Voz
(TEC2015-68172-C2-1-P), funded by Ministerio de Economı́a
y Competitividad, Spain and FEDER.
10. References
[1] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep
neural network embeddings for text-independent speaker verifica-
tion,” in Proceedings of Interspeech 2017, 2017.
[2] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,
“Front-end factor analysis for speaker verification,” IEEE
Transactions on Audio, Speech & Language Processing,
vol. 19, no. 4, pp. 788–798, 2011. [Online]. Available:
http://dx.doi.org/10.1109/TASL.2010.2064307
[3] “Albayzin evaluation: Iberspeech-rtve 2018 speaker diariza-
tion challenge,” http://catedrartve.unizar.es/reto2018/EvalPlan-
SpeakerDiarization-v1.3.pdf.
J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recogni-
tion toolkit,” in IEEE 2011 Workshop on Automatic Speech Recog-
nition and Understanding. IEEE Signal Processing Society, Dec.
2011, iEEE Catalog No.: CFP11SRW-USB.
[5] A. Lozano-Diez, O. Plchot, P. Matějka, and J. Gonzalez-Rodriguez,
“Dnn based embeddings for language recognition,” in Proceedings
of ICASSP, April 2018.
[6] “Keras: The python deep learning library,” https://keras.io/.
[7] D. de Benito Gorrón, “Detección de voz y música
en un corpus a gran escala de eventos de audio,”
https://repositorio.uam.es/handle/10486/684843, June 2018.
[8] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen,
W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set:
An ontology and human-labeled dataset for audio events,” in Proc.
IEEE ICASSP 2017, New Orleans, LA, 2017.
[9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-
rot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,”
Journal of Machine Learning Research, vol. 12, pp. 2825–2830,
2011.
226
IberSPEECH 2018
CENATAV Voice-Group Systems for Albayzin 2018 Speaker Diarization

Evaluation Campaign
Edward L. Campbell Hernández, Gabriel Hernández Sierra , José R. Calvo de Lara
Voice Group, Advanced Technologies Application Center, CENATAV, Havana, Cuba

ecampbell@cenatav.co.cu, gsierra@cenatav.co.cu, jcalvo@cenatav.co.cu
Abstract 2. Robust Feature Extraction

A feature is robust when it has a stable effectiveness both in
Usually, the environment to record a voice signal is not
controlled or uncontrolled environment ( noise, reverberation,
ideal and, in order to improve the representation of the speaker
etc.), being the second condition the most common in the prac-
characteristic space, it is necessary to use a robust algorithm,
tice [5], so the use of a feature with this characteristic is rele-
thus making the representation more stable in the presence of
vant in real applications. A brief description of several robust
noise. A Diarization system that focuses on the use of robust
feature extraction techniques is provided in this section. These
feature extraction techniques is proposed in this paper. The pre-
techniques use a gammatone filter-bank (see Fig. 1), the design
sented features ( such as Mean Hilbert Envelope Coefficients,
of which was based on Patterson’s ear model [6], defining the
Medium Duration Modulation Coefficients and Power Normal-
impulse response at the channeli for the equation 1.
ization Cepstral Coefficients ) were not used in other Albayzin
Challenges. These robust techniques have a common charac-
teristic, which is the use of a Gammatone filter-bank for divid-
ing the voice signal in sub-bands as an alternative option to the
classical Triangular filter-bank used in Mel Frequency Cepstral
Coefficients. The experiment results show a more stable Di-
arization Error Rate in robust features than in classic features.
Index Terms: Speaker Diarization, Robust feature extraction,
Mean Hilbert Envelope Coefficients, Albayzin 2018 SDC
1. Introduction
This is the first participation of the CENATAV Voice Group in
the Albayzin Challenges, participating in the Speaker Diariza-
tion Challenge (SDC) task and developing a Diarization Sys-
tem focuses in robust feature extraction. A Speaker Diarization
System allows identifying ” Who spoke when ? ” on an audio
stream, which has been of interest for the scientific community
since the last century, with the emergence of the first works on
speaker segmentation and clustering [1][2]. The diarization can Figure 1: Gammatone filter-bank of 40 dimension
be used as a stage that enriches and improves the results of other
systems, for example: a Rich Transcription System uses the di-
arization for adding the information about who is speaking to γ ∗ tτ −1 ∗ cos(2π ∗ f ci ∗ t + θ)
h(t)i = , (1)
the speech transcription, or a Speaker Recognition System uses exp(−2π ∗ erbi ∗ t)
it when the test signal has several speakers, so diarization allows where:
finding the segments from the test signal with only one speaker
[3].
• γ: amplitude.
An issue of the diarization is the environment where the
speech is recorded, because noise is a natural condition in real • τ : filter order.
applications. The proposed system is focused on robust feature • erb: equivalent rectangular bandwidth.
extraction techniques for improving the results in a real applica-
• f ci : center frequency at the channeli .
tion. Robust techniques as Mean Hilbert Envelope Coefficients
(MHEC), Medium Duration Modulation Coefficients (MDMC) • θ: phase.
and Power Normalization Cepstral Coefficients (PNCC) are The next gammatone filter parameters were set in the pro-
analysed. posed system, following Glasberg and Moore’s recommenda-
The system was mainly developed on S4D tool [4], with tion [6], where:
the following structure: robust feature extraction, segmentation (fmax +EarQ∗minB)
• f ci = −(EarQ ∗ minB) +
(gaussian divergence and Bayesian Information Coefficient), exp(i∗0.5)/EarQ
speech activity detection (Support Vector Machine), clustering f ci
• erb = ( EarQ
τ
+ minB τ )1/τ
(Hierarchical Agglomerative Clustering) and the last stage is
the Re-segmentation (Viterbi algorithm). A system description • EarQ = 9.26449
is done in the next sections. • minB = 24.7
227 10.21437/IberSPEECH.2018-47
Earq is the asymptotic filter quality at high frequencies and FM component at the gammatone filter-bank output, rather the
minB is the minimum bandwidth for low frequencies channels. power at each sub-band is computed and transformed into the
The parameters θ, γ, τ were set to 0, 1 and 4 respectively. cepstral domain. There are two main approaches of PNCC, the
short and medium time approaches, being the first the approach
2.1. Mean Hilbert Envelope Coefficients used in this paper. For more details about this technique, see
[11]. The figure 4 shows the process for computing PNCC.
A gammatone filter modulates a signal in amplitude and fre-
quency [7], and to demodulate the output signal is a way for
recovering the information transmitted. Mean Hilbert Enve-
lope Coefficients (MHEC) extract this information by applying
Hilbert Transform for estimating the analytic signal, separating
the AM component from the modulated signal and assuming
that the FM component does not exist [7]. The extraction pro-
cess is shown in the figure 2.
Figure 4: Extraction process of PNCC
3. Proposed Diarization System
Figure 2: Extraction process of MHEC
A low-pass filtering is done on the estimated envelope in

order to eliminate rapid changes of the signal, which are usually
attributed to noise [8].
2.2. Medium Duration Modulation Coefficients

Medium Duration Modulation Coefficients (MDMC) is called
” medium duration ” because it employs a window of 52 ms, a
bigger length than the traditional windowing of 20-30 ms. How- Figure 5: Proposed Diarization System
ever, a length of 25 ms is used in this proposal for efficiency
purposes. This feature takes the same approach that MHEC. The proposed Diarization System (see figure 5) has six
It demodulates the gammatone output signal [9]. However, in stages:
MDMC is assumed that the FM component exists, applying the
Teager’s Operator (TEO) for estimating the AM component [9]. 1. Robust feature extraction: based on MHEC, MDMC and
TEO is a non-linear operator that tracks the energy of a signal, PNCC as test features.
which is a function of the amplitude and frequency [10]. The
figure 3 shows the process for extracting the MDMC. 2. Gaussian Divergence: a sliding window of 2.5 seconds
is applied along the signal, two Gaussians with diago-
nal covariance are computed from left and right half of
the window in each shift. A change point of acoustic
classes (different speakers, music and noise) is detected
on a middle point of the window when the gaussian di-
vergence score reaches a local maximum.
3. Bayesian Information Coefficient (BIC): Segmentation
stage for decreasing the over-segmentation originated in
the previous stage. A BIC approach is used for compar-
ing consecutive segments and merging the couple seg-
ments that belong to the same acoustic class.
4. Support Vector Machine (SVM): A SVM [12] is applied
Figure 3: Extraction process of MDMC as speech/non-speech classifier of the resulting segments
from the previous stage. The non-speech segments are
deleted.
2.3. Power Normalization Cepstral Coefficients
5. Hierarchy Agglomerative Clustering (HAC): The seg-
Power Normalization Cepstral Coefficients (PNCC) does not ments that belong to the same speaker are clustered for a
apply any technique for demodulating or separating the AM- traditional HAC, using BIC as comparative measure.
228
6. Re-segmentation: a HMM is trained on the whole signal
and a Viterbi re-segmentation is done for redefining the
change points.
4. Experiment
The tested robust feature extraction, LFCC and LPCC algo-
rithms are self-implementations, while MFCC implementation
is from SIDEKIT tool [13] and the SVM from pyAudioAnal-
ysis tool [12]. The Gaussian Divergence, BIC, HAC and Re-
segmentation algorithms are is from S4D tool [4]. The proposed
systems were submitted at closed condition and they do not use
training data, with the exception of SVM, for which a portion of
Albayzin SDC 2016 Database was used. The experiment was
developed on RTVE-2018 SDC Development Database, tun-
ing the default thresholds of BIC and HAC in S4D, and com-
paring robust (MHEC, MDMC, PNCC) and classic (MFCC,
LFCC, LPCC) feature extraction techniques. The tables 1 and 2 Figure 6: Diarization Error Rate of the proposed system by fea-
show the best configurations at each feature. The pre-emphasis ture.
(0.97), length window (0.025 sec), shift window (0.01 sec),
compression (logarithmic) and normalization (Cepstral Mean
Normalization) are equal in each feature.
The computational cost was computed in terms of real-time
Table 1: Robust features configuration factor. This measure represents the necessary time for process-
ing a second of signal. The experiment was done on Intel Core
i3-3110M CPU 2.40 GHz X 4 with 3.7 GB of memory. The
Parameters MHEC MDMC PNCC computational cost of each system submitted is shown in the
Filter-bank dimension 40 40 40 table 3, being the system with LFCC the most efficient.
Bandwidth (Hz) 0 - 7000 0 - 7000 0 - 7000
Cepstral coefficients 12 +∆ + ∆∆ 12 +∆ + ∆∆ 15 +∆ + ∆∆ Table 3: Real-time factor of each system submitted
Primary Contrastive-1 Contrastive-2

Table 2: Classic features configuration 0,30 0.23 0,04
Parameters MFCC LFCC LPCC

Filter-bank dimension 40 40 -
Bandwidth(Hz) 0 - 7500 0 - 7500 -
6. Conclusion
Prediction order - - 60 This paper was reported by the CENATAV Voice-Group, and
Cepstral coefficients 19 +∆ 19 +∆ 19 +∆ + ∆∆ submitted for Albayzin 2018 Speaker Diarization Challenge.
The main proposal was based on robust feature extraction, using
MHEC as primary algorithm of feature extraction, with the ob-
5. Results jective of providing a stable system. The Diarization Error Rate
of the primary system on development data was 15.23 %, with
The figure 6 presents the results of the proposed diarization a real-time factor of 0.3, being suitable for a real application.
system, using several feature extraction techniques and sepa-
rating these result for each TV-show of the SDC Development
Database, ”La noche en 24H ” and ” Millennium ”. 7. Acknowledgments
The objective is to find a stable feature, where the Diariza- Finally, we would like to thank the University of Zaragoza, in
tion Error Rate (DER) changes little between different signals. the person of the professor Eduardo Lleida Solano, for his sup-
The figure 6 shows that the robust features are more stable that port during the experiment development. And to thank Wilma
the classics features and MHEC is the most stable between these Recio for her help during the redaction of this paper.
robust features. The general system proposed is showed in fig-
ure 5, and the primary and contrastive systems are based in the
general system, using the following features:
8. References
[1] M. Diez, L. Burget, and P. Matejka, “Speaker diarization based
• Primary system: MHEC. The most stable feature and on bayesian hmm with eigenvoice priors,” in Odyssey 2018 The
the best performance on the ”La noche en 24H” group. Speaker and Language Recognition Workshop, 26-29 June 2018,
• Contrastive system-1: MDMC. The best general per- Les Sables dOlonne, France, 2018.
formance. [2] X. A. Miró, S. Bozonnet, N. W. D. Evans, C. Fredouille,
• Contrastive system-2: LFCC. The best performance on G. Friedland, and O. Vinyals, “Speaker diarization: A review
of recent research,” IEEE Trans. Audio, Speech & Language
the ” Millennium ” group. Processing, vol. 20, no. 2, pp. 356–370, 2012. [Online].
Despite MFCC is the most used feature extraction in speech Available: https://doi.org/10.1109/TASL.2011.2125954
processing, this algorithm does not present a better performance [3] E. L. Campbell, G. Hernandez, and J. R. Calvo, “Diarization sys-
than the robust features proposed in this paper. tem proposal,” in IV Conferencia Internacional en Ciencias Com-
229
putacionales e Informáticas (CICCI 2018), La Habana, Cuba,
2018.
[4] S. Meignier. (2015) Sd4. [Online]. Available: http://www-
lium.univ-lemans.fr/s4d/
[5] N. T. Hieu, “Speaker diarization in meetings domain,” Ph.D. dis-
sertation, School of Computer Engineering of the Nanyang Tech-
nological University, 2014.
[6] M. Slaney, “An efficient implementation of the patterson-
holdsworth auditory filter bank,” Apple Computer, Tech. Rep.,
1993.
[7] S. O. Sadjadi and J. H. L. Hansen, “Mean hilbert envelope coef-
ficients (MHEC) for robust speaker and language identification,”
Speech Communication, vol. 72, pp. 138–148, 2015.
[8] E. L. Campbell, G. Hernandez, and J. R. Calvo, “Feature extrac-
tion of automatic speaker recognition, analysis and evaluation in
real environment,” in International Workshop of Artificial Intel-
ligent and Pattern Recognition 2018,Lecture Note of Computer
Science, 2018.
[9] V. Mitra, H. Franco, M. Graciarena, and D. Vergyri, “Medium-
duration modulation cepstral feature for robust speech recogni-
tion,” in IEEE International Conference on Acoustics, Speech and
Signal Processing, ICASSP 2014, Florence, Italy, May 4-9, 2014,
2014, pp. 1749–1753.
[10] A. Potamianos and P. Maragos, “A comparison of the energy op-
erator and the hilbert transform approach to signal and speech de-
modulation,” 1995.
[11] C. Kim and R. M. Stern, “Power-normalized cepstral coefficients
(PNCC) for robust speech recognition,” IEEE/ACM Trans. Audio,
Speech & Language Processing, vol. 24, no. 7, pp. 1315–1329,
2016.
[12] T. Giannakopoulos. (2018) pyaudioanalysis. [Online]. Available:
https://github.com/tyiannak/pyAudioAnalysis
[13] A. Larcher, S. Meignier, and K. A. LEE. (2017) Sidekit. [Online].
Available: http://www-lium.univ-lemans.fr/sidekit/
230
IberSPEECH 2018
The Intelligent Voice System for the IberSPEECH-RTVE 2018

Speaker Diarization Challenge
Abbas Khosravani, Cornelius Glackin, Nazim Dugan, Gérard Chollet, Nigel Cannings
Intelligent Voice Limited, St Clare House, 30-33 Minories, EC3N 1BP, London, UK
Abstract segmentation, where the boundaries are further refined to pro-

duce the diarization results. The first stage intends to divide
This paper describes the Intelligent Voice (IV) speaker diariza- the speech into short segments of a few seconds with a single
tion system for IberSPEECH-RTVE 2018 speaker diarization dominant speaker. Thus, the quality of the speech activity de-
challenge. We developed a new speaker diarization built on the tection plays an important role in the performance [2, 3]. Us-
success of deep neural network based speaker embeddings in ing word boundaries generated by an Automatic Speech Recog-
speaker verification systems. In contrary to acoustic features nition (ASR) system to produce homogeneous short segments
such as MFCCs, deep neural network embeddings are much has been proposed in [4]. Neural-network based approached
better at discerning speaker identities especially for speech ac- has also been investigated in [5]. The basic principle of de-
quired without constraint on recording equipment and environ- tecting speaker turn is to use a short duration window of a
ment. We perform spectral clustering on our proposed CNN- few seconds and use a similarity measure to decide whether
LSTM-based speaker embeddings to find homogeneous seg- a speaker change has occurred or not. This window will then
ments and generate speaker log likelihood for each frame. A slides frame by frame over the entire audio to mark all the po-
HMM is then used to refine the speaker posterior probabilities tential speaker change points. This window needs to be long
through limiting the probability of switching between speakers enough to include speaker identifying information (usually 1-2
when changing frames. We present results obtained on the de- seconds long). The most common measures for change detec-
velopment set (dev2) as well as the evaluation set provided by tion includes, Bayesian Information Criterion (BIC) [6], local
the challenge. Gaussian Divergence [7] and more recently deep neural net-
Index Terms: speaker diarization, CNN, LSTM, IberSPEECH- works [8, 9, 10].
RTV, speaker embedding
In the clustering stage, features extracted from same
speaker segments should ideally go to one cluster. These
1. Introduction features could be the basic MFCCs, speaker factors [11], i-
Speaker diarization is the task of marking speaker change vectors [12, 13], or more recently d-vectors [14]. For many
points in a speech audio and categorizing segments of speech years, i-vector based systems have been the dominating ap-
bounded by these change points according to the speaker iden- proach in speaker verification [15]. However, thanks to the re-
tity. Speaker diarization which is also referred to as who speaks cent progress of deep neural networks, neural network based
when, is an important front-end processing for a wide variety audio embeddings (d-vector) could significantly outperforms
of applications including information retrieval and automatic previously state-of-the-art techniques based on i-vectors, espe-
speech recognition (ASR). cially, on short segments [16, 17, 18, 19, 20]. The LSTM re-
The speaker diarization performance largely varies by the current neural network [21] has been incorporated to produce
application scenario including the broadcast news audio, in- speaker embeddings for the task of speaker verification [19]
terview speech, meetings audio (with distance microphone), and later for speaker diarization [14]. In this work, we explore
pathological speech, child speech, and conversational telephone a similar text-independent d-vector based approach to speaker
speeches. Each domain presents a different quality of the diarization. The proposed speaker embeddings using deep neu-
recordings (bandwidth, microphones, noise), types of back- ral networks at the frame level is used for the task of speaker
ground sources, number of speakers, duration of speaker turns, diarization.
as well as the style of the speech. Apart from its similarity to The predominant approach for clustering is Agglomerative
speaker identification in classifying homogeneous segments as Hierarchical Clustering (AHC) [6] where a stopping criteria
spoken by the same or different speakers, each domain presents could be used to determine the number of clusters. Other pop-
unique challenges to speaker diarization. Since there is no prior ular clustering approaches include Gaussian Mixture Models
knowledge regarding the speakers involved in an audio speech, (GMMs) [22], mean-shift clustering [23] and spectral clustering
the number of speakers as well as the approximate speaker [13, 22, 14]. We found the effectiveness of spectral clustering
change points need also to be addressed. Moreover, speech seg- algorithm which relies on analyzing the eigen-structure of an
ments may be of very short duration, making i-vectors as the affinity matrix in our proposed diarization framework.
most common speaker representation not an appropriate option In the final stage, the initial rough boundaries are then
[1]. refined. This is usually done using Viterbi algorithm at the
The most widely used approach in speaker diarization in- acoustic feature level. For each cluster a Gaussian Mixture
volves: (1) speech segmentation, where speech activity detec- Model (GMM) is estimated to calculate posterior probabilities
tion (SAD) or speaker change point detection is used to find of speakers at the frame level and the process usually iterates
rough boundaries around each speaker’s regions of speech; (2) until convergence. The Viterbi re-segmentation on raw MFCC
segment clustering, where same speaker segments as well as features was found to be effective [11]. Re-segmentation in a
the number of speakers will be determined; and finally (3) re- factor analysis subspace has also been investigated in [24, 11]
231 10.21437/IberSPEECH.2018-48
tures are centered using a short-term (3s window) cepstral mean
and variance normalization (ST-CMVN).
2.2. Speaker Embeddings

The i-vector based systems have been the dominating approach
for both speaker verification and diarization applications. How-
ever, with the recent success of deep neural networks, a lot of
efforts have been made into learning fixed-dimensional speaker
embeddings (d-vectors) using an end-to-end network architec-
ture that could be more effective relative to i-vectors on short
segments [26, 20, 19, 27]. We employed a generalized end-to-
end model using a convolutional neural networks (CNNs) and
LSTM recurrent neural network. The network architecture is
shown in Fig 2. It consists of two CNN layers and two dense
layers at frame level followed by a bi-directional LSTM and
two dense layers at utterance level. The LSTM layer maps a se-
quence of input feature vectors into an embedding vector. The
output of the LSTM layer is then followed by two more dense
layer and a length-normalization layer to produce a fixed dimen-
sional representation for the input segment. Training is based on
processing a large number of utterances in the form of a batch
that contains N speakers and M utterances each. Each utter-
ance could be of arbitrary duration. But to train the network
in batch, they need to be of the same duration. We used vari-
able length speech segments ranging from 10-20 seconds and
construct batches with 30 speakers, each having 10 different
segments. In the loss layer, a generalized end-to-end (GE2E)
loss [19] builds a similarity matrix that is defined based on the
Figure 1: System diagram for the proposed diarization system. cosine similarity between each pair of input utterances. Dur-
ing the training, we want the embedding of each utterance to be
similar to the centroid of all that speaker’s embeddings, while at
where Variational Bayes (VB) system [25] proved to be the
the same time, far from other speakers’ embeddings. A detailed
most effective. This approach implicitly performs a soft speaker
description of GE2E training can be found at [19].
clustering in a way which avoids making premature hard deci-
During evaluation, for every test utterance we apply a slid-
sions. An extension to this was proposed at the 2013 Center for
ing window of 5 seconds (500 frames) with 80 percent overlap
Language and Speech Processing (CLSP) Summer Workshop at
(we would have an speaker embedding every 100 frames). We
Johns Hopkins University, where temporal continuity was mod-
compute the d-vector for each window. No speech activity de-
eled by an HMM. This extension will constrain speaker tran-
tection has been used for this processing. Finally, a principle
sitions and defines the speaker posterior probabilities [24]. In
component analysis (PCA) is incorporated to reduce the dimen-
this paper, we intend to perform diarization with the help of
sions of the resulting length-normalized embeddings (we used 8
deep neural network speaker embeddings. The log likelihoods
dimension in our experiments) so as to be ready for clustering.
for frame-level speaker embedding is estimated using a spectral
clustering algorithm. In another way, HMM will serve as the
re-segmentation for the spectral clustering algorithm. 2.3. Clustering
The remainder of this paper is organized as follows: We We employed a spectral clustering which is able to handle
outline the specific architecture in our proposed diarization sys- unknown cluster shapes. It is based on analyzing the eigen-
tem in Section 2 which includes CNN-LSTM based speaker structure of an affinity matrix. A more detailed analysis of the
embedding extraction, clustering and re-segmentation. Exper- algorithm is presented in [28]. We used an Euclidean distance
imental results on the development set of the IberSPEECH- measure to form a nearest neighbor affinity matrix on the frame-
RTVE 2018 speaker diarization challenge are presented in Sec- level embeddings. To estimate the number of clusters a simple
tion 3. We will then conclude the paper in Section 5. heuristic based on the eigenvalues of the affinity matrix is used
[13]. To mitigate the computational complexity of the spectral
2. Diarization Framework clustering, especially when the number of frames are too large,
we can employ sampling at a specific rate.
In this section, we review the key sub-tasks used to build current To estimate the number of speakers, a simple heuristic
speaker diarization system based on DNN embeddings. The based on the Calinski & Harabasz criterion [29] has been in-
flowchart of our diarization system is shown in Fig. 1. corporated. We cluster d-vectors using K-means clustering al-
gorithm using different value of cluster number and choose the
2.1. Acoustic Features one that maximize Calinski & Harabasz score.
For speech parameterization we used 20-dimensional Mel-
2.4. Re-segmentation
Frequency Cepstral Coefficients (MFCCs). These features are
extracted at 8kHz sample frequency using Kaldi toolkit with 25 The clustering algorithms are typically followed by a re-
ms frame length and 10 ms overlap. For each utterance, the fea- segmentation algorithm that refines the speaker transition
232
Figure 2: Deep neural network architecture used to extract speaker embeddings.
boundaries. This could be either in the feature space like MFCC • music: A single music file is randomly selected from
or in the factor analysis subspace [24]. Speaker diarization in MUSAN, trimmed or repeated as necessary to match du-
factor analysis space allows us to take advantages of speaker ration, and added to the original signal (5-15dB SNR).
specific information. By contrast, lower-level acoustic features
such as MFCCs are not quite as good for discerning speaker • noise: MUSAN noises are added at one second intervals
identities, but can only provide sufficient temporal resolution throughout the recording (0-15dB SNR).
to witness local speaker changes. The proposed framework • reverb: The training recording is artificially reverberated
for diarization provides a stronger speaker representation at the via convolution with simulated RIRs.
frame level. As a result, when combined with a HMM to refine
the speaker posterior probabilities through limiting the speaker
3.2. Performance Metrics
transitions [24], the system is able to detect speaker change
points. The speaker log likelihoods for the HMM are computed We measured performance with Diarization Error Rate (DER),
by the spectral clustering algorithm as described in section 2.3. the standard metric for diarization. It is measured as the total
percentage of reference speaker time that is not correctly at-
Table 1: DER (%) on the development data (dev2) as well as
tributed to a speaker. More concretely, DER is defined as:
evaluation data of the IberSPEECH-RTVE 2018 speaker di-
arization challenge.
F A + M ISS + ERROR
DER = (1)
T OT AL
DER(%) Err(%) FA(%) Miss(%)
where F A is the total system speaker time not attributed to a
dev2 15.96 10.5 3.6 1.8 reference speaker, M ISS is the total reference speaker time not
eval 30.96 25.2 4.8 0.9 attributed to a system speaker, and ERROR is the total refer-
ence speaker time attributed to the wrong speaker. Like the tra-
ditional conventions used in evaluating diarization performance
3. Experiments [11], a forgiveness collar of 0.25 seconds will be applied before
and after each reference boundary prior to scoring. The DER is
3.1. Training Data reported based on the NIST RT Diarization evaluations [34].
Switchboard corpora (LDC2001S13, LDC2002S06,
LDC2004S07, LDC98S75, LDC99S79) and NIST SRE 4. Results
2004-2010 which consists of conversational telephone and
microphone speech data at 8kHz sample frequency from around The IberSPEECH-RTVE 2018 Speaker Diarization is a new
5k speakers were used for training the system. Augmentation challenge in the ALBAYZIN evaluation series. This evaluation
increases the amount and diversity of the existing training data. consists of segmenting broadcast audio documents according to
Our strategy employs additive noises and reverberation. Re- different speakers and linking those segments which originate
verberation involves convolving room impulse responses (RIR) from the same speaker. We used two Intel Xeon CPU (E5-2670
with audio. We use the simulated RIRs described in [30], and @ 2.60GHz and 8 cores), 64G of DDR3 memory, 400G disk
the reverberation itself is performed with the multi-condition storage and an NVIDIA TITAN X GPU (12G of memory) to
training tools in the Kaldi ASpIRE recipe [31]. For additive train the network. Keras API with tensorflow backend has been
noise, we use the MUSAN dataset, which consists of over 900 used for system development. Training takes almost a week to
noises, 42 hours of music from various genres and 60 hours of process around half a million segments of 10-20 seconds long.
speech from twelve languages [32]. Both MUSAN and the RIR To process a single 20 minute recording the system execution
datasets are freely available from http://www.openslr.org. We times is around 7 seconds. We report the performance of our
use a 3-fold augmentation that combines the original “clean” proposed diarization framework on the development set (dev2)
training list with two augmented copies [33]. To augment a using the provided speaker marks and also the result of the sub-
recording, we choose between one of the following randomly: mitted system on the evaluation set in Table 1. Our system
was trained on publicly accessible data which totally differ from
• babble: Three to seven speakers are randomly picked both the development and evaluation data (open-set condition).
from MUSAN speech, summed together, then added to The results indicate the effectiveness of the proposed approach
the original signal (13-20dB SNR). on challenging domains.
233
5. Conclusion [13] S. Shum, N. Dehak, and J. Glass, “On the use of spectral and iter-
ative methods for speaker diarization,” in Thirteenth Annual Con-
The IberSPEECH-RTVE 2018 Speaker Diarization has proven ference of the International Speech Communication Association,
to be a highly challenging contest especially in the detection 2012.
of the number of speakers and dealing with background noise. [14] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and
We have presented our system and reported the results on the I. L. Moreno, “Speaker diarization with lstm,” arXiv preprint
development set as well as the evaluation set of the challenge. arXiv:1710.10468, 2017.
We found deep neural network embeddings much better at dis- [15] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-
cerning speaker identities especially for speech acquired with- let, “Front-end factor analysis for speaker verification,” Audio,
out constraint on recording equipment and environment. Our Speech, and Language Processing, IEEE Transactions on, vol. 19,
strategy to employ additive noises and reverberation for data no. 4, pp. 788–798, 2011.
augmentation plays an important role in the success of our sys- [16] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,
tem on challenging domain. We will perform research on the “Deep neural network embeddings for text-independent speaker
evaluation set once the labels are released to gain insights on verification,” Proc. Interspeech 2017, pp. 999–1003, 2017.
the real effects of the approaches presented in the paper. [17] C. Zhang and K. Koishida, “End-to-end text-independent speaker
verification with triplet loss on short utterances,” in Proc. of Inter-
speech, 2017.
6. Acknowledgement [18] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end
The research leading to the results presented in this paper has text-dependent speaker verification,” in Acoustics, Speech and
been (partially) granted by the EU H2020 research and innova- Signal Processing (ICASSP), 2016 IEEE International Confer-
tion program under grant number 769872. ence on. IEEE, 2016, pp. 5115–5119.
[19] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “General-
ized end-to-end loss for speaker verification,” arXiv preprint
7. References arXiv:1710.10467, 2017.
[1] A. Khosravani and M. M. Homayounpour, “Nonparametrically [20] J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matejka, and L. Bur-
trained plda for short duration i-vector speaker verification,” Com- get, “End-to-end dnn based speaker recognition inspired by i-
puter Speech & Language, vol. 52, pp. 105–122, 2018. vector and plda,” arXiv preprint arXiv:1710.02369, 2017.
[2] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, [21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
and M. Liberman, “First dihard challenge evaluation plan,” 2018. Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[Online]. Available: https://zenodo.org/record/1199638 [22] S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsuper-
[3] X. Anguera, M. Aguilo, C. Wooters, C. Nadeu, and J. Hernando, vised methods for speaker diarization: An integrated and iterative
“Hybrid speech/non-speech detector applied to speaker diariza- approach,” IEEE Transactions on Audio, Speech, and Language
tion of meetings,” in Speaker and Language Recognition Work- Processing, vol. 21, no. 10, pp. 2015–2028, 2013.
shop, 2006. IEEE Odyssey 2006: The. IEEE, 2006, pp. 1–6. [23] M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel, “A
[4] D. Dimitriadis and P. Fousek, “Developing on-line speaker di- study of the cosine distance-based mean shift for telephone speech
arization system,” in Proc. Interspeech, 2017, pp. 2739–2743. diarization,” IEEE/ACM Transactions on Audio, Speech and Lan-
guage Processing (TASLP), vol. 22, no. 1, pp. 217–227, 2014.
[5] S. Thomas, G. Saon, M. Van Segbroeck, and S. S. Narayanan,
“Improvements to the ibm speech activity detection system for the [24] G. Sell and D. Garcia-Romero, “Diarization resegmentation in
darpa rats program,” in Acoustics, Speech and Signal Processing the factor analysis subspace,” in Acoustics, Speech and Signal
(ICASSP), 2015 IEEE International Conference on. IEEE, 2015, Processing (ICASSP), 2015 IEEE International Conference on.
pp. 4500–4504. IEEE, 2015, pp. 4794–4798.
[6] S. Chen, P. Gopalakrishnan et al., “Speaker, environment and [25] F. Valente and C. Wellekens, “Variational bayesian methods for
channel change detection and clustering via the bayesian infor- audio indexing,” in International Workshop on Machine Learning
mation criterion,” in Proc. DARPA broadcast news transcription for Multimodal Interaction. Springer, 2005, pp. 307–319.
and understanding workshop, vol. 8. Virginia, USA, 1998, pp. [26] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree,
127–132. “Speaker diarization using deep neural network embeddings,” in
Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE
[7] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, “Automatic seg-
mentation, classification and clustering of broadcast news audio,”
in Proc. DARPA speech recognition workshop, vol. 1997, 1997. [27] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-
nan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker
[8] V. Gupta, “Speaker change point detection using deep neural embedding system,” arXiv preprint arXiv:1705.02304, 2017.
nets,” in Acoustics, Speech and Signal Processing (ICASSP), 2015
IEEE International Conference on. IEEE, 2015, pp. 4420–4424. [28] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering:
Analysis and an algorithm,” in Advances in neural information
[9] R. Yin, H. Bredin, and C. Barras, “Speaker change detection processing systems, 2002, pp. 849–856.
in broadcast tv using bidirectional long short-term memory net-
works,” in Proc. Interspeech 2017, 2017, pp. 3827–3831. [29] T. Caliński and J. Harabasz, “A dendrite method for cluster anal-
ysis,” Communications in Statistics-theory and Methods, vol. 3,
[10] M. Hrúz and Z. Zajı́c, “Convolutional neural network for speaker no. 1, pp. 1–27, 1974.
change detection in telephone speaker diarization system,” in
Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE [30] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur,
International Conference on. IEEE, 2017, pp. 4945–4949. “A study on data augmentation of reverberant speech for robust
speech recognition,” in Acoustics, Speech and Signal Processing
[11] P. Kenny, D. Reynolds, and F. Castaldo, “Diarization of telephone (ICASSP), 2017 IEEE International Conference on. IEEE, 2017,
conversations using factor analysis,” IEEE Journal of Selected pp. 5220–5224.
Topics in Signal Processing, vol. 4, no. 6, pp. 1059–1070, 2010.
[12] G. Sell and D. Garcia-Romero, “Speaker diarization with plda i- N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
vector scoring and unsupervised calibration,” in Spoken Language “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on
Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 413– automatic speech recognition and understanding. IEEE Signal
417. Processing Society, 2011.
234
[32] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and
noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[33] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-
pur, “X-vectors: Robust dnn embeddings for speaker recognition,”
Submitted to ICASSP, 2018.
[34] “The 2009 (rt-09) rich transcription meeting recognition eval-
uation plan,” http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/
rt09-meeting-eval-plan-v2.pdf, accessed on June 2, 2016.
235
IberSPEECH 2018
JHU Diarization System Description

Zili Huang, Paola Garcı́a, Jesús Villalba, Daniel Povey, Najim Dehak
Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA
{hzili1,pgarci27,jvillal7}@jhu.edu,dpovey@gmail.com,ndehak3@jhu.edu
Abstract Each of these items will be discussed in detail in the fol-

lowing sections.
We present the JHU system for Iberspeech-RTVE Speaker Di-
arization Evaluation. This assessment combines Spanish lan-
guage and broadcast audio in the same recordings, conditions 2. Datasets
in which our system has not been tested before. To tackle For the closed condition we only used the Albayzin2016 to train
this problem, the pipeline of our general system, developed en- the SAD, the Universal Background Model (UBM), the i-vector
tirely in Kaldi, includes an acoustic feature extraction, a SAD, extractor and PLDA models.
an embedding extractor, a PLDA and a clustering stage. This The datasets used for training the open condition are:
pipeline was used for both, the open and the closed conditions
(described in the evaluation plan). All the proposed solutions • Data set 1: Voxceleb1 and Voxceleb2 data (no augmen-
use wide-band data (16KHz) and MFCCs as their input. For the tation). [1][2]
closed condition, the system trains a DNN SAD using the Al- • Data set 2: SRE12-micphn, MX6-micphn, VoxCeleb and
bayzin2016 data. Due to the small amount of data available, the SITW-dev-core.
i-vector embedding extraction was the only approach explored
• Data set 3: Voxceleb1 and Voxceleb2 with augmenta-
for this task. The PLDA training utilizes Albayzin data fol-
tion.
lowed by an Agglomerative Hierarchical Clustering (AHC) to
obtain the speaker segmentation. The open condition employs • Data set 4: Fisher database (only database not in 16KHz)
the DNN SAD obtained in the closed condition. Four types of • Albayzin2016: In-domain data set used in previous
embeddings were extracted, x-vector-basic, x-vector-factored, evaluations, which includes Aragon Radio database (20
i-vector-basic and BNF-i-vector. The x-vector-basic is a TDNN hours) and 3/24 TV channel database (87 hours).
trained on augmented Voxceleb1 and Voxceleb2. The x-vector-
factored is a factored-TDNN (TDNN-F) trained on SRE12- • Albayzin2018: RTVE broadcast data, which includes
micphn, MX6-micphn, VoxCeleb and SITW-dev-core. The i- train, dev1, dev2 and test. The speaker diarization la-
vector-basic was trained on Voxceleb1 and Voxceleb2 data (no bel is only offered for dev2, so the dev2 part is used for
augmentation). The BNF-i-vector is a BNF-posterior i-vector development purposes and tunning our system.
trained with the same data as x-vector-factored. The PLDA
training for the new scenario uses the Albayzin2016 data. The 3. Feature Extraction and Speech Activity
four systems were fused at the score level. Once again, the AHC Detection
computed the final speaker segmentation.
We tested our systems in the Albayzin2018 dev2 data and All the systems employ wide-band data (16KHz). 1 MFCCs
observed that the SAD is of importance to improve the results. were extracted for a 25ms window and 10ms frame rate. For
Moreover, we noticed that x-vectors were better than i-vectors, the open condition, the MFCC feature dimension of x-vector-
as already observed in previous experiments. basic, x-vector-factored, i-vector-basic and BNF-i-vector is 30,
Index Terms: speaker diarization, DNN, SAD, i-vectors, x- 40, 24, 40 respectively. For the closed condition, the MFCC
vectors, PLDA configuration is the same as i-vector-basic.
1. Introduction 3.1. Speaker Activity Detection
We present JHU’s speaker diarization system for Iberspeech Di- After computing the MFCCs for each case, the system trained
arization Evaluation. Our main goal is to test our current di- a TDNN SAD model on the Albayzin2016 labeled data follow-
arization system in other databases and possible scenarios. We ing the Aspire recipe in Kaldi [3]. The network consists of 5
provide a system solution for the open and closed condition sce- TDNN layers and 2 layers of statistics pooling[4]. The overall
nario, using Kaldi. We follow a basic pipeline with specific context of the neural network is around 1s, with around 0.8s of
characteristics for each scenario. The pipeline for our system is left context and 0.2s of right context. This approach is suitable
described as follows. for our purposes since it can include a wider context not affect-
ing the number of parameters. For this special case, we trained
• Audio feature extraction the DNN with two classes: speech and non-speech. The speech
• Speech activity detection (SAD) segments include both the clean voice and the voice with noises.
Other parts of the audio are considered as non-speech, which
• Embedding extraction may include music, noise and silence. A simple Viterbi decod-
• PLDA ing on a HMM with duration constraints of 0.3s for speech and
• Score fusion (if possible) 0.1s for silence is used to get speech activity labels for the test
• Clustering 1 except set 4 that is in 8 KHz to train bottleneck DNN
236 10.21437/IberSPEECH.2018-49
data recordings. The energy based SAD was also tried for our Callhome diarization recipe [3]. To obtain an accurate estima-
experiment, but the results were worse overall. tion of the number of speakers and have a better speaker seg-
mentation, the system scans for several thresholds until it finds
4. Embeddings an optimum on a hold-out dataset. We evaluated this approach
using the Albayzin2018 dev2 dataset.
We computed two different sets of embeddings depending on
the condition. For the closed condition we focused on the i-
vectors. This approach computes i-vectors in the traditional
7. Experiments
way; it trains a T-matrix with Albayzin2016-only data. After- In this section, we describe some experiments that give us a
wards, we obtained the i-vectors for the Albayzin2018 dev2 and clue of the overall performance of our system. We evaluated
test set. We tried other DNN possibilities, but due to the few our systems with Diarization Error Rate (DER), which is the
amount of data available the results were not promising. most common metric for speaker diarization. The diarization
For the open condition we examined four types of embed- error can be decomposed into speaker error, false alarm speech,
dings. The i-vectors-basic, trained on data set 3, obtained base- missed speech and overlap speaker. Our DER tolerated errors
line results for Albayzin2018 dev2 and test. These i-vectors are within 250ms of a speaker transition and only scored the non-
of dimension 400. overlapping part of the segments because our model outputs sin-
The BNF-i-vectors (of dimension 600) use the bottleneck gle label for each frame. Our systems were evaluated on the
feature computed from data set 2 to refine the GMM alignments. Albayzin2018 dev2.
The rest of the i-vector pipeline remains the same; the T-matrix We employed the Albayzin2018 dev2 set as the initial part
was also trained on data set 2. of our experiments. This set was divided into two parts, and we
We explored two types of DNN based embedding architec- tune the parameters on one part and compute the DER perfor-
tures. The first one, the deafult Kaldi recipe for Voxceleb, is a mance on the other, which was similar to the Kaldi Callhome
TDNN for x-vector-basic [5, 6]. In this approach, each MFCC diarization recipe [3].
frame is passed through a sequence of TDNN layers. Then, a The DER results of different systems for the open and
pooling layer accounts for the utterance level process and com- closed condition are shown in Table 1 and Table 2 respectively.
putes the mean and standard deviation of the TDNN output over We compared three different calibration strategies: supervised
time in a pooling layer. This intermediate representation, known calibration, oracle calibration and more than 10s. In the super-
as embedding, is projected to a lower dimension (512 in this vised calibration we chose the optimal thresholds on the held-
case). The DNN output are the posterior probabilities of the out cross validation set. For oracle calibration, we used the or-
training speakers. The objective function is cross entropy. We acle number of speakers for AHC. However, unlike traditional
employed data set 3 for training the TDNN. The augmentation speaker diarization dataset like CALLHOME dataset, the utter-
is performed as described in [7] using MUSAN noises2 . ances in the Albayzin2018 dev2 set were long and contained
For the second x-vector approach the pre-pooling layers are more speakers. The speech segments of some speakers were
changed to factorized TDNNs (TDNN-F) with skip connections so limited that we didn’t want to create a new cluster for these
[8]. This new architecture reduces the number of parameters in speakers. The third column shows the DER results when we
the network by factorizing the weight matrix of each TDNN clustered with the oracle number of speakers that have more
layer into the product of two low-rank matrices. The first fac- than 10 seconds speech in the segment. It should be noted that
tor is forced to be semi-orthogonal that will prevent the lost of since the number of speakers are unknown for the test set, our
information when projecting from high to low dimension. As final submission only used supervised calibration and we re-
in other architectures, skip connections are an option for this ported the DER results of oracle calibration on the development
TDNN-F. Some input layers receive as input the output of the set just for reference.
previous layer and other prior layers. The best solution so far is As shown in Table 1, x-vector based systems outperform
to have skip connection between low-rank interior layers in the the i-vector based ones, which is consistent with previous stud-
TDNN-F. The x-vectors are of dimension 600. ies. Among the four systems, the TDNN-F based x-vector per-
forms the best. It outperforms the basic x-vector and i-vector by
5. PLDA and Score Fusion 1.27% and 3.90% absolute. Equal weighted score fusion fur-
ther reduces the DER to 9.39%. It is interesting that the DER
For the closed condition we observed that the number of speak-
performance of x-vector based systems degrades when cluster-
ers estimated by the current approach was very high for the Al-
ing with the actual number of speakers while it improves for the
bayzin2018 dev2 set. We decided to use PCA as in [9]. With
i-vector based systems. This indicates that i-vector based sys-
this tuning strategy the system was able to take into account ev-
tems require more prior knowledge of the number of speakers.
ery recording for PCA rotation, instead of only the global PCA.
Table 2 shows the DER results for the closed condition. The
This strategy also maintained the number of speaker in a desir-
i-vector system achieves a DER of 24.03%, which is further im-
able range.
proved to 22.26% if clustering with the oracle number of speak-
For the open condition, we used the traditional PLDA work-
ers. However, the performance of the x-vector is not as good as
flow, and the PLDA was trained on the Albayzin2016 data. We
i-vector. We believe the reason is that we cannot obtain enough
obtained 4 different types of scores that addressed the four type
data to train a discriminative neural network. Even after data
of embeddings. We fused the four systems with equal weights.
augmentation with the music, noise and speech we extracted
from Albayzin2016 dataset, the training set only contained 332
6. Clustering hours of speech which was much smaller than the usual amount
The system performed an Agglomerative Hierarchical Cluster- of data to train the x-vector system. Besides, since the record-
ing (AHC) to obtain a segmentation of the recordings following ings were from TV programs, a large number of speakers didn’t
have enough corpus. The score fusion didn’t improve the sys-
2 http://www.openslr.org/resources/17 tem performance for the closed condition.
237
Table 1: DER (%) comparison of different systems for the open condition
system supervised calibration oracle calibration more than 10s

x-vector-basic 13.39 13.46 13.68
x-vector-factored 12.12 13.48 13.12
i-vector-basic 16.02 15.69 15.08
BNF-i-vector 16.40 15.88 14.83
fusion 9.39 10.83 11.87
Table 2: DER (%) comparison of different systems for the closed condition
system supervised calibration oracle calibration more than 10s

i-vector 24.03 22.26 23.46
x-vector 34.69 35.58 34.85
fusion 25.34 22.39 22.76
From our experiment, we also observe that the SAD is of condition, due to the small amount of training data. For the open
vital importance. Since we don’t know the oracle SAD marks, condition, the best results were obtained by the x-vector based
the quality of the SAD is directly associated with the DER per- system. However, having a score fusion, before the clustering
formance. Three different SAD models were evaluated, among gave noticeable improvements. We are still planning to do some
which the 5-layer TDNN model trained on in-domain data per- re-segmentation in future versions.
forms the best. It outperforms the TDNN model trained on the
Librispeech with same network architecture by 2.27% absolute. 10. Acknowledgements
The energy based SAD is simple but the performance is worse
than the TDNN models by a large margin. In our final system, The authors would like to thank David Snyder for his help in
we use the TDNN SAD trained on Albayzin2016 for both the this project.
open and closed condition.
11. References
Table 3: DER (%) of basic x-vector system with different SAD [1] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:
a large-scale speaker identification dataset,” arXiv preprint
SAD DER arXiv:1706.08612, 2017.
energy based SAD 21.52 [2] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep
TDNN SAD trained on Librispeech 15.66 speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
TDNN SAD trained on Albayzin2016 13.39 [3] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recog-
nition toolkit,” in Proceedings of the IEEE Workshop on Automatic
8. Future Work Speech Recognition and Understanding, ASRU2011. Waikoloa,
HI, USA: IEEE, dec 2011, pp. 1–4.
Although we largely reduce the diarization error with in-domain
[4] P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur, “Acous-
SAD and system fusion, there are still many problems to inves- tic modelling from the signal domain using cnns.” in INTER-
tigate. The first is the overlap problem. Our current system SPEECH, 2016, pp. 3434–3438.
cannot handle the overlapping speech, since it predicts a single
[5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-
speaker label for each frame. However, solving this problem is pur, “X-Vectors : Robust DNN Embeddings for Speaker Recog-
not easy. The whole procedure of the diarization might change nition,” in Proceedings of the IEEE International Conference on
to predict multiple labels for one frame. Second, as discussed in Acoustics, Speech and Signal Processing, ICASSP 2018. Alberta,
the former part, the number of speakers estimated by the super- Canada: IEEE, apr 2018, pp. 5329–5333.
vised calibration is not very close to the actual number. Besides, [6] G. Sell, D. Snyder, A. Mccree, D. Garcia-Romero, J. Vil-
clustering with the oracle number of speakers sometimes even lalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey,
degrades the system. Whether there exists better methods to S. Watanabe, and S. Khudanpur, “Diarization is Hard:
control the clustering process, especially for the condition with Some Experiences and Lessons Learned for the JHU Team
many speakers, requires further studies. Third, due to the time in the Inaugural DIHARD Challenge,” in Proceedings of
the 19th Annual Conference of the International Speech
limit, we didn’t include the re-segmentation process in our sys-
Communication Association, INTERSPEECH 2018, Hyderabad,
tem. We will add this part later to see if it can further boost the India, sep 2018, pp. 2808—-2812. [Online]. Available:
system performance. http://www.danielpovey.com/files/2018 interspeech dihard.pdf
http://dx.doi.org/10.21437/Interspeech.2018-1893
9. Conclusions [7] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and
noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
This is the submission for the JHU Diarization system. We tried
[8] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmo-
our out-of-the-box system in this new scenario that contained hamadi, and S. Khudanpur, “Semi-Orthogonal Low-Rank
broadcast news in a new language. Two main solutions were Matrix Factorization for Deep Neural Networks,” in Pro-
proposed for the closed and the open conditions: i-vector and ceedings of the 19th Annual Conference of the Interna-
x-vector. I-vector was showed to be best suited for the closed tional Speech Communication Association, INTERSPEECH
238
2018, Hyderabad, India, sep 2018. [Online]. Available:
http://danielpovey.com/files/2018 interspeech tdnnf.pdf
[9] C. Vaquero, A. Ortega, J. Villalba, A. Miguel, and E. Lleida, “Con-
fidence measures for speaker segmentation and their relation to
speaker verification,” in Eleventh Annual Conference of the Inter-
national Speech Communication Association, 2010.
239
IberSPEECH 2018
GTM-IRLab Systems for Albayzin 2018 Search on Speech Evaluation

Paula Lopez-Otero1 , Laura Docio-Fernandez2
Universidade da Coruña - CITIC, Information Retrieval Lab

1
2
Universidade de Vigo - atlanTTic Research Center, Multimedia Technology Group
paula.lopez.otero@udc.gal, ldocio@gts.uvigo.es
Abstract 2. Spoken term detection system

This paper describes the systems developed by the GTM-IRLab The proposed STD system consists in the fusion of two subsys-
team for the Albayzin 2018 Search on Speech evaluation. The tems: one is based in a LVCSR system while the other is a PS
system for the spoken term detection task consists in the fu- system that adapts the probabilistic retrieval model for informa-
sion of two subsystems: a large vocabulary continuous speech tion retrieval to the STD task.
recognition strategy that uses the proxy words approach for
out-of-vocabulary terms, and a phonetic search system based 2.1. LVCSR system
on the probabilistic retrieval model for information retrieval. An LVCSR system was built using the Kaldi open-source
The query-by-example spoken term detection system is the re- toolkit [1]. Deep neural network (DNN) based acoustic mod-
sult of fusing four subsystems: three of them are based on dy- els were used; specifically, a DNN-based context-dependent
namic time warping search using different representations of the speech recognizer was trained following Karel Veselý’s DNN
waveforms, namely Gaussian posteriorgrams, phoneme posteri- training approach [10]. The input acoustic features to the neu-
orgrams and a large set of low-level descriptors; and the other ral network are 40 dimensional Mel-frequency cepstral coeffi-
one is the phonetic search system used for spoken term detec- cients (MFCCs) augmented with three pitch and voicing related
tion with some modifications to manage spoken queries. features [11], and appended with their delta and acceleration
Index Terms: Spoken term detection, query-by-example coefficients. The DNN has 6 hidden layers with 2048 neurons
spoken term detection, large vocabulary continuous speech each. Each speech frame is spliced across ±5 frames to pro-
recognition, out-of-vocabulary terms, phoneme posteriorgrams, duce 1419 dimensional vectors which are the input to the first
Gaussian posteriorgrams, probabilistic information retrieval, layer, and the output layer is a soft-max layer representing the
phonetic search log-posteriors of the context-dependent HMM states.
The Kaldi LVCSR decoder generates word lattices [12] us-
1. Introduction ing the above DNN-based acoustic models. These lattices are
processed using the lattice indexing technique described in [2]
In this paper, the systems developed by the GTM-IRLab team so that the lattices of all the utterances in the search collection
for the Albayzin 2018 Search on Speech evaluation are de- are converted from individual weighted finite state transducers
scribed. Systems were submitted to the spoken term detec- (WFST) to a single generalized factor transducer structure in
tion (STD) and the query-by-example spoken term detection which the start-time, end-time and lattice posterior probabil-
(QbESTD) tasks. ity of each word token is stored as a 3-dimensional cost. This
In the STD task, a fusion of two subsystems was proposed. factor transducer is actually an inverted index of all word se-
The first system consists in a strategy based on large vocabulary quences seen in the lattices. Thus, given a list of keywords or
continuous speech recognition (LVCSR). This LVCSR system phrases, a simple finite state machine is created such that it ac-
was built using the Kaldi toolkit [1] to train a set of acoustic cepts the keywords/phrases and composes them with the factor
models, to generate the output lattices and to perform lattice transducer to obtain all the occurrences of the keywords/phrases
indexing and term search [2]. The proxy words strategy de- in the search collection.
scribed in [3] was used to deal with out-of-vocabulary (OOV) The proxy words strategy included in Kaldi [3] was used
terms. The second system performs phonetic search (PS) using for OOV term detection. This approach uses phone confusion
an approach that adapts the probabilistic retrieval model [4] for to find the in-vocabulary (INV) term that is the most similar, in
information retrieval to the search on speech task similarly as terms of its phonetic content, to the corresponding OOV term,
described in [5, 6]. and search is performed using the INV term.
For the QbESTD task, the proposed system consists in a fu- The data used to train the acoustic models of this LVCSR
sion of four different systems. Three of them rely on dynamic system were extracted from the Spanish material used in the
time warping (DTW) search with different representations of 2006 TC-STAR automatic speech recognition evaluation cam-
the speech data, namely phoneme posteriorgrams [7], low-level paign1 and from the Galician broadcast news database Transcri-
descriptors [8] and Gaussian posteriorgrams [9]. The fourth sys- gal [13]. It must be noted that all the non-speech parts as well as
tem is an adaptation of the PS approach used for STD that copes the speech parts corresponding to transcriptions with pronunci-
with spoken queries. ation errors, incomplete sentences and short speech utterances
The rest of this paper is organized as follows: Section 2 were discarded, so in the end the acoustic training material con-
and 3 describe the systems for the STD and QbESTD tasks, re- sisted of approximately 104 hours and 30 minutes.
spectively; Section 4 presents the preliminary results obtained The language model (LM) was constructed using a text
for the different tasks on the development data; and Section 5 database of 150 MWords composed of material from several
presents some conclusions extracted from the experimental val-
idation of the different systems. 1 http://www.tc-star.org
240 10.21437/IberSPEECH.2018-50
sources (transcriptions of European and Spanish Parliaments improved discriminative and well-calibrated scores [20]. Cal-
from the TC-STAR database, subtitles, books, newspapers, on- ibration and fusion training was performed using the Bosaris
line courses and transcriptions of the Mavir sessions included toolkit [21].
in the development set2 [14]). Specifically, two fourgram-based
language models were trained following the Kneser-Ney dis-
counting strategy using the SRILM toolkit [15], and the final
3. Query-by-example spoken term
LM was obtained by mixing both LMs using the SRILM static detection system
n-gram interpolation functionality. One of the LMs was trained The primary system submitted for the QbESTD evaluation con-
using the RTVE2018 subtitles data provided for the Albayzin sists in the fusion of four systems. Three of those systems
2018 Text-to-Speech challenge and the other LM was built us- follow the same scheme: first, feature extraction is performed
ing the other text corpora. The LM vocabulary size was limited in order to represent the queries and documents by means of
to the most frequent 300K words and, for each search task, the feature vectors; then, the queries are searched within the docu-
set of OOV keywords were removed from the language model. ments using a search approach based on DTW; finally, a score
normalization step is performed. The other system is an adap-
2.2. PS system tation of the PS system described above to the QbESTD task.
A system based on phonetic search following the probabilistic
retrieval model for information retrieval was developed for the 3.1. DTW-based systems
STD task:
3.1.1. Speech representation
• Indexing. First, the phone transcription of each docu-
ment is obtained, and then the documents are indexed in Three different approaches for speech representation were used;
terms of phone n-grams of different size [16, 5]. Accord- given a query Q with n frames (and equivalently, a document
ing to the probabilistic retrieval model, each document D with m frames), these representations result in a set Q =
is represented by means of a language model [4]. In this {q1 , . . . , qn } of n vectors of dimension U (and equivalently, a
case, given that the phone transcriptions have errors, sev- set D = {d1 , . . . , dm } of m vectors of dimension U ):
eral hypotheses for the best transcription are used to im-
• Phoneme posteriorgram (PhnPost). One subsystem re-
prove the quality of the language model [6]. The start
lies on phoneme posteriorgrams [7] for speech represen-
time and duration of each phone are also stored in the
tation: given a speech document and a phoneme recog-
index.
niser with U phonetic units, the a posteriori probability
• Search. First, a phonetic transcription of the query is ob- of each phonetic unit is computed for each time frame,
tained using the grapheme-to-phoneme model for Span- leading to a set of vectors of dimension U that repre-
ish included in Cotovia [17]. Then, the query is searched sent the probability of each phonetic unit at every time
within the different indices, and a score for each doc- instant. The English (EN) phone decoder developed by
ument is computed following the query likelihood re- the Brno University of Technology was used to obtain
trieval model [18]. It must be noted that this model sorts phoneme posteriorgrams; in this decoder, each phonetic
the documents according to how likely they contain the unit has three different states and a posterior probability
query, but the start and end times of the match are re- is output for each of them, so they were combined in or-
quired in this task. To obtain these times, the phone tran- der to obtain one posterior probability for each unit [22].
scription of the query is aligned to that of the document After obtaining the posteriors, a Gaussian softening was
by computing their minimum edit distance, and this al- applied in order to have Gaussian distributed probabili-
lows the recovery of the start and end times since they are ties [23].
stored in the index. In addition, the minimum edit dis-
tance is used to penalize the score returned by the query • Low-level descriptors (LLD). A large set of features,
likelihood retrieval model as described in [6]. summarised in Table 1, was used to represent the
queries and documents; these features, obtained using
The minimum and maximum size of the n-grams were set to the OpenSMILE feature extraction toolkit [24], were ex-
1 and 5, respectively, according to [5]. The different hypotheses tracted every 10 ms using a 25 ms window, except for
for the phone transcriptions of the documents were extracted F0, probability of voicing, jitter, shimmer and HNR, for
from the phone lattice obtained employing the LVCSR system which a 60 ms window was used.
described above, and the number of hypotheses to be used for
indexing was empirically set to 40. Indexing and search were • Gaussian posteriorgram (GP). Gaussian posteriorgrams
performed using Lucene3 . [9] were used to represent the audio documents and
queries. Given a Gaussian mixture model (GMM) with
2.3. Fusion U Gaussians, the a posteriori probability of each Gaus-
sian is computed for each time frame, leading to a set
Discriminative calibration and fusion were applied in order to of vectors of dimension U that represent the probability
combine the outputs of the different STD systems [19]. The of each Gaussian at every time instant. In this system,
global minimum score produced by the system for all queries 19 MFCCs were extracted from the waveforms, accom-
was used to hypothesize the missing scores. After normaliza- panied with their energy, delta and acceleration coeffi-
tion, calibration and fusion parameters were estimated by lo- cients. Feature extraction and Gaussian posteriorgram
gistic regression on a development dataset in order to obtain computation were performed using the Kaldi toolkit [1].
The GMM was trained using MAVIR training and devel-
2 http://cartago.lllf.uam.es/mavir/index.pl?m=descargas
opment data, as well as RTVE development recordings.
3 http://lucene.apache.org
241
Table 1: Acoustic features used in the proposed search on speech system.
Description # features
Sum of auditory spectra 1
Zero-crossing rate 1
Sum of RASTA style filtering auditory spectra 1
Frame intensity 1
Frame loudness 1
Root mean square energy and log-energy 2
Energy in frequency bands 250-650 Hz (energy 250-650) and 1000-4000 Hz 2
Spectral Rolloff points at 25%, 50%, 75%, 90% 4
Spectral flux 1
Spectral entropy 1
Spectral variance 1
Spectral skewness 1
Spectral kurtosis 1
Psychoacoustical sharpness 1
Spectral harmonicity 1
Spectral flatness 1
Mel-frequency cepstral coefficients 16
MFCC filterbank 26
Line spectral pairs 8
Cepstral perceptual linear predictive coefficients 9
RASTA PLP coefficients 9
Fundamental frequency (F0) 1
Probability of voicing 1
Jitter 2
Shimmer 1
log harmonics-to-noise ratio (logHNR) 1
LCP formant frequencies and bandwidths 6
Formant frame intensity 1
Deltas 102
Total 204
3.1.2. Search algorithm less likely. One approach to overcome this issue consists in de-
tecting a given number of candidate matches nc : every time a
The search stage was carried out using the subsequence DTW
warping path, that ends at frame b∗ , is detected, M(n, b∗ ) is set
(S-DTW) [25] variant of the classical DTW approach. To per-
to ∞ in order to ignore this element in the future.
form S-DTW, first a cost matrix M ∈ <n×m must be defined, in
which the rows and columns correspond to the query and docu- A score must be assigned to every detection of a query Q
ment frames, respectively: in a document D. First, the cumulative cost of the warping path
Mn,b∗ is length-normalized [27] and, after that, z-norm is ap-

 c(qi , dj ) if i=0 plied so that all the scores of all the queries have the same dis-
Mi,j = c(qi , dj ) + Mi−1,0 if i > 0, j = 0 (1) tribution [28].
 c(q , d ) + M∗ (i, j) else
i j
3.2. PS system
where c(qi , dj ) is a function that defines the cost between the
The system described in Section 2.2 was also used for QbESTD.
query vector qi and the document vector dj , and
Since, in this experimental setup, the queries are spoken, the
M∗ (i, j) = min (Mi−1,j , Mi−1,j−1 , Mi,j−1 ) (2) LVCSR system described in Section 2.1 was used to obtain
phone transcriptions of the queries. In this system, the number
Pearson’s correlation coefficient r [26] was the metric used of transcription hypotheses of the documents was empirically
to define the cost function by mapping it into the interval [0,1] set to 50.
applying the following transformation:
3.3. Fusion
1 − r(qi , dj )
c(qi , dj ) = (3) The fusion strategy described in Section 2.3 was used to com-
2
bine the QbESTD systems described in this section.
Once matrix M is computed, the end of the best warping
path between Q and D is obtained as
4. Preliminary Results
∗
b = arg min M(n, b) (4) The systems described in the previous sections were evaluated
b∈1,...,m
in terms of the average term weighted value (ATWV) and max-
The starting point of the path ending at b∗ , namely a∗ , is imum term weighted value (MTWV), which are the evaluation
computed by backtracking, hence obtaining the best warping metrics defined for Albayzin 2018 evaluation. The results in-
path P(Q, D) = {p1 , . . . , pk , . . . , pK }, where pk = (ik , jk ), cluded in this section were achieved using the development
i.e. the k-th element of the path is formed by qik and djk , and data provided by the organizers. Since two different datasets
K is the length of the warping path. (MAVIR and RTVE) were used for development, and in or-
It is possible that a query Q appears several times in a doc- der to avoid overfitting when choosing the decision threshold,
ument D, especially if D is a long recording. Hence, not only the groundtruth labels of MAVIR and RTVE were joined into
the best warping path must be detected but also others that are a single set (namely MAVIR+RTVE) to compute the decision
242
Table 2: STD results on development data 5. Conclusions and future work
This paper presented the systems developed for the STD and
MAVIR RTVE MAVIR+RTVE QbESTD tasks of Albayzin 2018 Search on Speech evaluation.
System MTWV ATWV MTWV ATWV MTWV ATWV The STD system consists in a fusion of a LVCSR system with
LVCSR (con1) 0.5314 0.5179 0.5976 0.5798 0.5992 0.5991 a phonetic search approach based on the probabilistic retrieval
PS (con2) 0.4828 0.4739 0.6286 0.5993 0.6173 0.6167 model for information retrieval. The LVCSR system relied on
LVCSR-NP (con3) 0.5068 0.4079 0.5801 0.5794 0.5704 0.5700 the proxy words approach for OOV words, which were also
Fusion (pri) 0.5470 0.5290 0.6550 0.6183 0.6826 0.6791
managed by the phonetic search system. The QbESTD sys-
tem is a fusion of three DTW-based systems with the phonetic
search system used in the STD task.
The performance obtained in STD and QbESTD tasks are
not straightforwardly comparable because the queries used to
Table 3: QbESTD results on development data
compute the evaluation metrics are not the same for both tasks,
but the results suggest that spoken queries lead to better results
in RTVE dataset. This might be caused by a greater amount of
MAVIR RTVE MAVIR+RTVE
OOV words, so this will be investigated by further analysis of
System MTWV ATWV MTWV ATWV MTWV ATWV
the results.
PhnPost (con2) 0.1971 0.1742 0.7145 0.7081 0.5180 0.5160
In future work, a system that combines word-level and
LLD 0.2017 0.1774 0.7136 0.7114 0.5156 0.5136
phone-level representations with the probabilistic retrieval
GP 0.1877 0.1628 0.6731 0.6718 0.4856 0.4841
model for information retrieval will be assessed. This idea
PS (con3) 0.2383 0.2029 0.3540 0.3528 0.3519 0.3507
is motivated by the fact that, according to the results exhib-
Fusion DTW (con1) 0.2699 0.2649 0.7211 0.7076 0.5471 0.5451
ited in the STD task, the LVCSR and phonetic search systems
Fusion (pri) 0.2896 0.2470 0.7273 0.6964 0.6195 0.6174 are strongly complementary, and designing smart combination
strategies might improve the performance of logistic regression
fusion.
The DTW-based systems for QbESTD used in this paper
are language-independent, i.e. the system can be used regard-
threshold, which was subsequently applied to each dataset indi- less the language spoken in the recordings. Given that a LVCSR
vidually. system for Spanish was trained for the STD system, the use
of the activations of the LVCSR network will be investigated
4.1. STD experiments in future work in order to assess QbESTD performance in a
language-dependent setting.
Table 2 shows the results achieved using the systems described
in Section 2. The Table also includes an additional sys- 6. Acknowledgements
tem, namely LVCSR-NP, which consists in the aforementioned This work has received financial support from i) “Ministerio de
LVCSR without using the proxy words strategy for OOV terms; Economı́a y Competitividad” of the Government of Spain and
this means that the LVCSR-NP system does not detect any OOV the European Regional Development Fund (ERDF) under the
terms. Comparing LVCSR-NP and LVCSR systems, it can be research projects TIN2015-64282-R and TEC2015-65345-P, ii)
seen that using the proxy words strategy is beneficial specially Xunta de Galicia (projects GPC ED431B 2016/035 and GRC
when dealing with MAVIR data. The table also shows that the 2014/024), and iii) Xunta de Galicia - “Consellerı́a de Cultura,
PS system outperforms the LVCSR system on RTVE dataset, Educación e Ordenación Universitaria” and the ERDF through
and it also leads to a better overall result. The combination of the 2016-2019 accreditations ED431G/01 (“Centro singular de
both systems achieves a significant improvement in all the ex- investigación de Galicia”) and ED431G/04 (“Agrupación es-
perimental conditions, which suggests that both strategies are tratéxica consolidada”).
strongly complementary.
7. References
4.2. QbESTD experiments
Table 3 shows the results achieved by the QbESTD systems de- J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recog-
scribed in Section 3. The best performance in MAVIR data was nition toolkit,” in IEEE 2011 Workshop on Automatic Speech
achieved with the PS system, which also exhibited the lowest Recognition and Understanding. IEEE Signal Processing So-
performance in RTVE data. PhnPost and LLD systems achieved ciety, 2011.
almost the same results for RTVE and MAVIR+RTVE data. [2] D. Can and M. Saraclar, “Lattice indexing for spoken term detec-
tion.” IEEE Transactions on Audio, Speech and Language Pro-
The table also displays the results obtained when fusing the cessing, vol. 19, no. 8, pp. 2338–2347, 2011.
three DTW approaches (Fusion DTW) and when fusing the four [3] G. Chen, O. Yilmaz, J. Trmal, D. Povey, and S. Khudanpur, “Us-
systems (Fusion). The MTWV is always higher when fusing ing proxies for OOV keywords in the keyword search task,” in
the four systems but, for the individual datasets, the ATWV IEEE Workshop on Automatic Speech Recognition & Understand-
is higher when fusing only the DTW systems. Nevertheless, ing, ASRU, 2013, pp. 416–421.
the overall result is better when combining the four systems, so [4] J. Ponte and W. Croft, “A language modeling approach to informa-
this system was selected as the primary (pri) for this evaluation, tion retrieval,” in Proceedings of ACM SIGIR, 1998, pp. 275–281.
while the fusion of the three DTW systems was presented as [5] P. Lopez-Otero, J. Parapar, and A. Barreiro, “Efficient query-by-
contrastive (con1). example spoken document retrieval combining phone multigram
243
representation and dynamic time warping,” Information Process- [22] L. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. Bordel,
ing and Management, vol. 56, pp. 43–60, 2019. and M. Diez, “GTTS systems for the SWS task at MediaEval
2013,” in Proceedings of the MediaEval 2013 Workshop, 2013.
[6] ——, “Probabilistic information retrieval models for query-by-
example spoken document retrieval,” Speech Communication [23] A. Varona, M. Penagarikano, L. Rodriguez-Fuentes, and G. Bor-
(submitted), 2018. del, “On the use of lattices of time-synchronous cross-decoder
phone co-occurrences in a SVM-phonotactic language recogni-
[7] T. Hazen, W. Shen, and C. White, “Query-by-example spoken
tion system.” in 12th Annual Conference of the International
term detection using phonetic posteriorgram templates,” in IEEE
Speech Communication Association (Interspeech), 2011, pp.
Workshop on Automatic Speech Recognition & Understanding,
2901–2904.
ASRU, 2009, pp. 421–426.
[24] F. Eyben, M. Wöllmer, and B. Schuller, “OpenSMILE - the Mu-
[8] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo, nich versatile and fast open-source audio feature extractor,” in
“Finding relevant features for zero-resource query-by-example Proceedings of ACM Multimedia (MM), 2010, pp. 1459–1462.
search on speech,” Speech Communication, vol. 84, pp. 24–35,
2016. [25] M. Müller, Information Retrieval for Music and Motion.
Springer-Verlag, 2007.
[9] Y. Zhang and J. Glass, “Unsupervised spoken keyword spotting
via segmental DTW on Gaussian posteriorgrams,” in IEEE Work- [26] I. Szöke, M. Skácel, and L. Burget, “BUT QUESST2014 system
shop on Automatic Speech Recognition & Understanding, ASRU, description,” in Proceedings of the MediaEval 2014 Workshop,
2009, pp. 398–403. 2014.
[10] K. Veselý, A. Ghoshal, L. Burget, and D. Povey, “Sequence- [27] A. Abad, R. Astudillo, and I. Trancoso, “The L2F spoken web
discriminative training of deep neural networks,” in Proceed- search system for Mediaeval 2013,” in Proceedings of the Medi-
ings of the 14th Annual Conference of the International Speech aEval 2013 Workshop, 2013.
Communication Association (Interspeech 2013)., no. 8, 2013, pp. [28] I. Szöke, L. Burget, F. Grézl, J. C̆ernocký, and L. Ondel, “Calibra-
2345–2349. tion and fusion of query-by-example systems - BUT SWS 2013,”
[11] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal, in Proceedings of the 37th International Conference on Acoustics,
and S. Khudanpur, “A pitch extraction algorithm tuned for auto- Speech and Signal Processing (ICASSP), 2014, pp. 7899–7903.
matic speech recognition,” in Proceedings of ICASSP, 2014, pp.
2494–2498.
[12] D. Povey, M. Hannemann, G. Boulianne, L. Burget, A. Ghoshal,
M. Janda, M. Karafiát, S. Kombrink, P. Motlı́cek, Y. Qian,
K. Riedhammer, K. Veselý, and N. T. Vu, “Generating exact lat-
tices in the WFST framework,” in IEEE International Conference
on Acoustics, Speech, and Signal Processing, 2012, pp. 4213–
4216.
[13] C. Garcia-Mateo, J. Dieguez-Tirado, L. Docio-Fernandez, and
A. Cardenal-Lopez, “Transcrigal: A bilingual system for auto-
matic indexing of broadcast news,” in in Proc. Int. Conf. on Lan-
guage Resources and Evaluation, 2004.
[14] A. M. Sandoval and L. C. Llanos, “MAVIR: a corpus of sponta-
neous formal speech in Spanish and English,” in Iberspeech 2012:
VII Jornadas en Tecnologı́a del Habla and III SLTech Workshop,
2012.
[15] A. Stolcke, J. Zheng, W. Wang, and V. Abrash, “SRILM at Six-
teen: Update and outlook,” in Proc. IEEE Automatic Speech
Recognition and Understanding Workshop, December 2011.
[16] J. Parapar, A. Freire, and A. Barreiro, “Revisiting n-gram based
models for retrieval in degraded large collections,” in Proceed-
ings of the 31st European Conference on Information Retrieval
Research: Advances in Information Retrieval, ser. Lecture Notes
in Computer Science, vol. 5478. Springer International Publish-
ing, 2009, pp. 680–684.
[17] E. Rodrı́guez-Banga, C. Garcia-Mateo, F. Méndez-Pazó,
M. González-González, and C. Magariños, “Cotovı́a: an open
source TTS for Galician and Spanish,” in Proceedings of Iber-
speech 2012, 2012, pp. 308–315.
[18] C. Manning, P. Raghavan, and H. Schütze, Introduction to Infor-
mation Retrieval. Cambridge University Press, 2008.
[19] A. Abad, L. J. Rodrı́guez-Fuentes, M. Peñagarikano, A. Varona,
and G. Bordel, “On the calibration and fusion of heterogeneous
spoken term detection systems.” in Proceedings of Interspeech,
2013, pp. 20–24.
[20] N. Brümmer and D. van Leeuwen, “On calibration of language
recognition scores,” in IEEE Odyssey 2006: The Speaker and
Language Recognition Workshop, 2006, pp. 1–8.
[21] N. Brümmer and E. de Villiers, “The BOSARIS toolkit user
guide: Theory, algorithms and code for binary classifier
score processing,” Tech. Rep., 2011. [Online]. Available:
https://sites.google.com/site/nikobrummer
244
IberSPEECH 2018
AUDIAS-CEU: A Language-independent approach for the Query-by-Example

Spoken Term Detection task of the Search on Speech ALBAYZIN 2018
evaluation
Maria Cabello1 , Doroteo T. Toledano1 , Javier Tejedor2
1
AUDIAS, Universidad Autónoma de Madrid
2
Universidad San Pablo-CEU, CEU Universities
m.cabello.a@gmail.com, doroteo.torre@uam.es, javier.tejedornoguerales@ceu.es
Abstract query/utterance representation. These phoneme posteriorgrams

are computed from neural networks trained for the English lan-
Query-by-Example Spoken Term Detection is the task of guage. Next, the Subsequence Dynamic Time Warping (S-
detecting query occurrences within speech data (henceforth ut- DTW) algorithm generates the query occurrences. Our system
terances). Our submission is based on a language-independent is largely based on the winner system of the QbE STD task of
template matching approach. First, queries and utterances are the Search on Speech ALBAYZIN 2016 evaluation [23].
represented as phonetic posteriorgrams computed for English
language with the phoneme decoder developed by the Brno Uni- The rest of the paper is organized as follows: Section 2
versity of Technology. Next, the Subsequence Dynamic Time presents the system submitted to the evaluation, Section 3
Warping algorithm with a modified Pearson correlation coef- presents the experiments and the results obtained in the develop-
ficient as cost measure is employed to hipothesize detections. ment data provided by the organizers, and Section 4 concludes
Results on development data showed an ATWV=0.1774 with the paper.
MAVIR data and an ATWV=0.0365 with RTVE data.
Index Terms: language-independent QbE STD, template 2. QbE STD system
matching, SDTW
The QbE STD system, whose architecture is presented in Fig-
1. Introduction ure 1, integrates two different stages: feature extraction and
query detection, which are explained next.
The large amount of heterogeneous speech data stored in audio
and audiovisual repositories makes it necessary to develop ef-
ficient methods for speech information retrieval. There are dif-
ferent speech information retrieval tasks, including spoken doc-
ument retrieval (SDR), keyword spotting (KWS), spoken term
detection (STD), and query-by-example spoken term detection Figure 1: QbE STD system architecture.
(QbE STD). The advance of the technology in these tasks have
been evaluated through different international evaluations re-
lated to SDR, STD, and QbE STD [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
for different languages (English, Arabic, Mandarin, Spanish, 2.1. Feature extraction
Japanese, and low-resource language such as Swahili, Tamil,
and Vietnamese). Specifically, Spanish language has been em- The English phoneme recognizer development by the Brno Uni-
ployed with the STD/QbE STD ALBAYZIN evaluations held in versity of Technology [24] has been employed to compute 3-
2012, 2014 and 2016 [7, 8, 9, 10, 11]. state phoneme posteriorgrams that represent both the queries
Query-by-example Spoken Term Detection aims to retrieve and the utterances. This phoneme recognizer contains 39 units,
data from a speech repository (henceforth utterance) given which correspond to the 39 phonemes in English plus a non-
an acoustic query containing the term of interest as input. speech unit to represent some other phenomena in speech such
QbE STD has been mainly addressed from three different ap- as laugh, noise, short silences, etc. These phoneme posterior-
proaches: methods based on the word/subword transcription grams have been computed each 10ms of speech, which have
of the query that typically employ a word/phone-based speech been further processed to compute a single posterior probabil-
recognition system for query detection [12, 13], methods based ity per phoneme. This posterior probability is obtained by sum-
on template matching of features that are typically based on ming the posterior probabilities of the three states correspond-
posteriorgram-based units and DTW-like search for query de- ing to the given phoneme.
tections [14, 15, 16, 17], and hybrid approaches that take ad-
vantage of both approaches [18, 19, 20, 21]. 2.2. Query detection
This paper presents the system submitted by the AUDIAS-
CEU research team to the QbE STD task of the Search on Query detection involves different stages, as presented in Fig-
Speech ALBAYZIN 2018 evaluation [22], which deals with ure 2.
the Spanish language. However, our submission does not em- First, a cost matrix that stores the similarity between ev-
ploy the target language, and hence is based on a language- ery query/utterance pair is computed. The Pearson correlation
independent approach that can be used for any target lan- coefficient [25] has been employed to build the cost matrix, as
guage. First, phoneme posteriorgrams are computed for presented in Equation 1:
245 10.21437/IberSPEECH.2018-51
Figure 2: Query detection stages.
will be maximum, hence promoting the differences between

aligned and non-aligned sequences in the next stage.
U (xn · ym ) − ||xm ||||ym || Figure 4 shows the cost matrix example with this modifica-
r(xn , ym ) = p , (1)
U ||x2n || − ||xn ||2 )(U ||ym
2 || − ||y ||2 )
m tion of the Pearson correlation coefficient computation. This
modification leads to more differences in the costs between
where xn represents the query phoneme posteriorgrams, ym query and utterance frames.
represents the utterance phoneme posteriorgrams, and U rep-
resents the number of phoneme units (40 in our case).
Then, the Pearson correlation coefficient is mapped into the
interval [0 1], as given in Equation 2:
1 − r(xn , ym )
c(xn , ym ) = , (2)
2
where c(xn , ym ) represents the cost matrix used during the S-
DTW search. Figure 4: Cost matrix example with the modified Pearson cor-
Therefore, the cost c(xn , ym ) can take the values of 1 relation coefficient.
(when r = −1), 0.5 (when r = 0), or 0 (when r = 1). Figure 3
represents the cost matrix example with the standard Pearson
correlation coefficient computation. 2.3. Subsequence Dynamic Time Warping-based search
The S-DTW algorithm [26] has been used to hipothesize
query detections within the utterances. From the cost matrix
c(xn , ym ), the accumulated cost matrix employed within the
search is computed as given in Equation 3:

 c(xn , ym ) if n=0
Dn,m = c(xn , ym ) + Dn−1,0 if n > 0, m = 0 (3)
Figure 3: Cost matrix example with the standard Pearson cor-  c(x , y ) + D∗ (n, m) else,
n m
relation coefficient.
where
The final cost used during the search has been modified as D∗ (n, m) = min (Dn−1,m , Dn−1,m−1 , Dn,m−1 ) , (4)
follows: When r <= 0, r has been assigned the value of 0.
Next, c(xn , ym ) = 1 − r(xn , ym ). Therefore, for all the Pear- which implies that only horizontal, vertical, and diagonal
son correlation coefficient values lower or equal to 0, the cost path movements are allowed in the search.
246
Figure 5 shows the accumulated cost matrix from the cost
matrix presented in Figure 3 (i.e., with the standard Pearson cor-
relation coefficient computation), and Figure 6 shows that of the
cost matrix presented in Figure 4 (i.e., with the modified Pear-
son correlation coefficient). The accumulated cost matrix from
the modified Pearson correlation coefficient shows more cost in
non-occurrence regions, which favors the final query detection.
Figure 7: Query detection example from the accumulated cost
matrix in Figure 6.
comprises a set of talks extracted from the Spanish MAVIR

workshops1 held in 2006, 2007, and 2008 (Corpus MAVIR
2006, 2007, and 2008) corresponding to Spanish language, and
Figure 5: Accumulated cost matrix from Figure 3 with the stan- RTVE database, which comprises different Radio TeleVisión
dard Pearson correlation coefficient. Española (RTVE) programs recorded from 2015 to 2018. For
the MAVIR database, about 1 hour of speech material in to-
tal, extracted from 2 audio files was provided by the organiz-
ers, in which 102 queries extracted from the same development
data were searched. For the RTVE database, about 15 hours
of speech, extracted from 12 audio files, were provided by the
organizers. In these RTVE data, 103 queries were searched. Or-
ganizers also provided with training data. However, since our
submission is based on a previously trained phoneme recognizer
for English language, the system does not employ any training
data.
Figure 6: Accumulated cost matrix from Figure 4 with the mod-
ified Pearson correlation coefficient. Results are shown in Table 1 for MAVIR and RTVE de-
velopment data. These results show moderate performance for
MAVIR data and a worse performance on RTVE data. This
To hypothesize detections, a distance function ∆ needs to
worse performance on RTVE data may be due to the optimal pa-
be defined. This value corresponds to the last row/column of the
rameters found on MAVIR data were employed for the RTVE
accumulated cost matrix D, and is used as the initial value from
data, and no additional tuning on these data were carried out.
which the minimum values that suggest possible paths are com-
Better performance should be obtained in case RTVE data were
puted. Then, the S-DTW computes a global minimum bmin
also fine-tuned.
(also referred as b∗ in Figure 2) in the accumulated cost ma-
trix from which all the possible query detections are considered.
This bmin has to be lower than a predefined threshold τ tuned Table 1: System results for MAVIR and RTVE development data.
on MAVIR development data. Next, the S-DTW computes all
the local minima b∗ that appear in the accumulated cost matrix. Database MTWV ATWV p(FA) p(Miss)
These local minima b∗ need to be lower than a second threshold MAVIR 0.1823 0.1774 0.00002 0.801
τ2 = τ ∗ bmin to be considered as optimal paths where query RTVE 0.0365 0.0365 0.00000 0.963
detections reside. This second threshold τ2 has been also tuned
on MAVIR development data.
For each of values of b∗ that meets the afore-mentioned
conditions, the optimal paths (a∗, b∗) that represent the query 4. Conclusions
detections are found as follows: Let p = (p1 , ..., pl ) be a pos-
sible optimal path. Starting at pl = b∗, a reverse path that ends This paper presents the AUDIAS-CEU submission for the QbE
at n = 1 (i.e., the first frame of the query) is computed as pre- STD task of the Search on Speech ALBAYZIN 2018 evalua-
sented in Equation 5: tion. The system relies on a language-independent approach for
QbE STD, since no prior information of the Spanish language
is employed for system building. Phoneme posteriorgrams and
pl−1 = argmin(D(n−1, m−1), D(n−1, m), D(n, m−1)).
Subsequence Dynamic Time Warping with a modified Pearson
(5)
correlation coefficient as cost measure were employed for sys-
Finally, a neighbourhood search is carried out so that all
tem construction. System design was largely based on the win-
the paths (i.e., query detections) which overlap 500ms from a
ner system of the QbE STD task of the Search on Speech AL-
previously obtained optimal path are rejected in the final system
BAYZIN 2016 evaluation [23].
output. An example of an optimal path found is presented in
Figure 7. Future work will include some fusion techniques to get ad-
vantage of the query detections from different phoneme de-
coders, and feature selection techniques to retain the most
3. Experiments and results meaningful phoneme units for each language.
Experiments are carried out on the development data pro-
vided by the organizers for the QbE STD task. Two different
databases were experimented with: MAVIR database, which 1 http://www.mavir.net
247
5. Acknowledgements [15] A. H. H. N. Torbati and J. Picone, “A nonparametric bayesian
approach for spoken term detection by example query,” in Proc.
This work was partially supported by the project “DSSL: of Interspeech, 2016, pp. 928–932.
Redes Profundas y Modelos de Subespacios para Detec- [16] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Unsupervised
cin y Seguimiento de Locutor, Idioma y Enfermedades De- bottleneck features for low-resource Query-by-Example spoken
generativas a partir de la Voz” (TEC2015-68172-C2-1-P, term detection,” in Proc. of Interspeech, 2016, pp. 923–927.
MINECO/FEDER). [17] Y. Yuan, C.-C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Pair-
wise learning using multi-lingual bottleneck features for low-
6. References resource Query-by-Example spoken term detection,” in Proc. of
ICASSP, 2017, pp. 5645–5649.
[1] J. G. Fiscus, J. Ajot, J. S. Garofolo, and G. Doddingtion, “Results
of the 2006 spoken term detection evaluation,” in Proc. of SSCS, [18] S. Oishi, T. Matsuba, M. Makino, and A. Kai, “Combining
2007, pp. 45–50. state-level spotting and posterior-based acoustic match for im-
proved query-by-example spoken term detection,” in Proc. of In-
[2] NIST, The Ninth Text REtrieval Conference (TREC 9), Accesed: terspeech, 2016, pp. 740–744.
February, 2018. [Online]. Available: http://trec.nist.gov
[19] M. Obara, K. Kojima, K. Tanaka, S. wook Lee, , and Y. Itoh,
[3] H. Joho and K. Kishida, “Overview of the NTCIR-11 Spoken- “Rescoring by combination of posteriorgram score and subword-
Query&Doc Task,” in Proc. of NTCIR-11, 2014, pp. 1–7. matching score for use in Query-by-Example,” in Proc. of Inter-
speech, 2016, pp. 1918–1922.
[4] X. Anguera, F. Metze, A. Buzo, I. Szöke, and L. J. Rodriguez-
Fuentes, “The spoken web search task,” in Proc. of MediaEval, [20] C.-C. Leung, L. Wang, H. Xu, J. Hou, V. T. Pham, H. Lv,
2013, pp. 921–922. L. Xie, X. Xiao, C. Ni, B. Ma, E. S. Chng, and H. Li, “Toward
high-performance language-independent Query-by-Example spo-
[5] X. Anguera, L. J. Rodriguez-Fuentes, I. Szöke, A. Buzo, and ken term detection for MediaEval 2015: Post-Evaluation analy-
F. Metze, “Query by example search on speech at Mediaeval sis,” in Proc. of Interspeech, 2016, pp. 3703–3707.
2014,” in Proc. of MediaEval, 2014, pp. 351–352.
[21] H. Xu, J. Hou, X. Xiao, V. T. Pham, C.-C. Leung, L. Wang, V. H.
[6] NIST, Draft KWS14 Keyword Search Evaluation Plan, Do, H. Lv, L. Xie, B. Ma, E. S. Chng, and H. Li, “Approximate
National Institute of Standards and Technology (NIST), search of audio queries by using DTW with phone time bound-
Gaithersburg, MD, USA, December 2013. [Online]. Available: ary and data augmentation,” in Proc. of ICASSP, 2016, pp. 6030–
https://www.nist.gov/sites/default/files/documents/itl/iad/mig/KWS14- 6034.
evalplan-v11.pdf
[22] J. Tejedor and D. T. Toledano, The ALBAYZIN 2018 Search on
[7] J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio- Speech Evaluation Plan, Universidad San Pablo CEU, Universi-
Fernandez, C. Garcia-Mateo, A. Cardenal, J. D. Echeverry- dad Autónoma de Madrid, Madrid, Spain, June 2018. [Online].
Correa, A. Coucheiro-Limeres, J. Olcoz, and A. Miguel, “Spoken Available: http://iberspeech2018.talp.cat/index.php/albayzin-
term detection ALBAYZIN 2014 evaluation: overview, systems, evaluation-challenges/search-on-speech-evaluation/
results, and discussion,” EURASIP, Journal on Audio, Speech and
[23] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo,
Music Processing, vol. 2015, no. 21, pp. 1–27, 2015.
“GTM-UVigo systems for albayzin 2016 search on speech evalu-
[8] J. Tejedor, D. T. Toledano, X. Anguera, A. Varona, L. F. Hurtado, ation,” in Proc. of IberSPEECH, 2016, pp. 306–314.
A. Miguel, and J. Colás, “Query-by-example spoken term detec- [24] P. Schwarz, “Phoneme recognition based on long temporal con-
tion ALBAYZIN 2012 evaluation: overview, systems, results, and text,” Ph.D. dissertation, FIT, BUT, Brno, Czech Republic, 2008.
discussion,” EURASIP, Journal on Audio, Speech, and Music Pro-
cessing, vol. 2013, no. 23, pp. 1–17, 2013. [25] I. Szöke, M. Skacel, and L. Burget, “BUT QUESST 2014 system
description,” in Proc. of MediaEval, 2014, pp. 621–622.
[9] J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez,
and C. Garcia-Mateo, “Comparison of ALBAYZIN query-by- [26] M. Muller, Information Retrieval for Music and Motion. New
example spoken term detection 2012 and 2014 evaluations,” York: Springer-Verlag, 2007.
EURASIP, Journal on Audio, Speech and Music Processing, vol.
2016, no. 1, pp. 1–19, 2016.
L. Serrano, I. Hernaez, A. Coucheiro-Limeres, J. Ferreiros, J. Ol-
coz, and J. Llombart, “Albayzin 2016 spoken term detection eval-
uation: an international open competitive evaluation in spanish,”
EURASIP, Journal on Audio, Speech and Music Processing, vol.
2017, no. 22, pp. 1–23, 2017.
J. Proença, F. P. ao, F. Garcı́a-Granada, E. Sanchis, A. Pompili,
and A. Abad, “Albayzin query-by-example spoken term detection
2016 evaluation,” EURASIP, Journal on Audio, Speech, and Music
[12] N. Sakamoto, K. Yamamoto, and S. Nakagawa, “Combination of
syllable based N-gram search and word search for spoken term
detection through spoken queries and IV/OOV classification,” in
Proc. of ASRU, 2015, pp. 200–206.
[13] R. Konno, K. Ouchi, M. Obara, Y. Shimizu, T. Chiba, T. Hirota,
and Y. Itoh, “An STD system using multiple STD results and mul-
tiple rescoring method for NTCIR-12 SpokenQuery&Doc task,”
in Proc. of NTCIR-12, 2016, pp. 200–204.
[14] P. Lopez-Otero, L. Docio-Fernandez, and C. Garcia-Mateo, “Pho-
netic unit selection for cross-lingual Query-by-Example spoken
term detection,” in Proc. of ASRU, 2015, pp. 223–229.
248
IberSPEECH 2018
GTTS-EHU Systems for the Albayzin 2018 Search on Speech Evaluation

Luis Javier Rodríguez-Fuentes, Mikel Peñagarikano, Aparo Varona, Germán Bordel
Departamento de Electricidad y Electrónica UPV/EHU

Barrio Sarriena s/n, 48940 Leioa, Bizkaia, Spain
luisjavier.rodriguez@ehu.eus
Abstract tections for target trials. In particular, the threshold given by the
application model parameters (Ptarget = 0.0001, Cmiss = 1
This paper describes the systems developed by GTTS-EHU for and Cf a = 0.1) is too high compared to the MTWV threshold.
the QbE-STD and STD tasks of the Albayzin 2018 Search on In the case of STD, the lack of detections is even more re-
Speech Evaluation. Stacked bottleneck features (sBNF) are markable, with Pmiss ≈ 0.8 for extremely low thresholds. This
used as frame-level acoustic representation for both audio doc- could be due to an acoustic mismatch between the synthesized
uments and spoken queries. In QbE-STD, a flavour of segmen- queries and the test audio signals, which might be blocking the
tal DTW (originally developed for MediaEval 2013) is used DTW-based search, since the score yielded by the best match
to perform the search, which iteratively finds the match that would fall below the established threshold. The low amount of
minimizes the average distance between two test-normalized detections (especially, for target trials) not only yields high miss
sBNF vectors, until either a maximum number of hits is ob- error rates but also makes the calibration model to be poorly es-
tained or the score does not attain a given threshold. The STD timated. This may explain why the MTWV threshold for STD
task is performed by synthesizing spoken queries (using pub- scores on development data is much lower that that provided by
licly available TTS APIs), then averaging their sBNF represen- the application model.
tations and using the average query for QbE-STD. A publicly In both cases (QbE-STD and STD), the parameters of the
available toolkit (developed by BUT/Phonexia) has been used DTW-based search should be further tuned in order to get a
to extract three sBNF sets, trained for English monophone and larger amount of target and non-target detections.
triphone state posteriors (contrastive systems 3 and 4) and for
multilingual triphone posteriors (contrastive system 2), respec-
tively. The concatenation of the three sBNF sets has been also 2. QbE-STD systems
tested (contrastive system 1). The primary system consists of a In the design of our previous QbE-STD systems, the front-
discriminative fusion of the four contrastive systems. Detection end exploited existing software (e.g. the BUT phone decoders
scores are normalized on a query-by-query basis (qnorm), cal- for Czech, Hungarian and Russian [4]), whereas the backend
ibrated and, if two or more systems are considered, fused with (search, calibration and fusion) was almost entirely developed
other scores. Calibration and fusion parameters are discrimi- by our group (and collaborators) [3][5]. For this evaluation, we
natively estimated using the ground truth of development data. have applied an external VAD module and extracted a new set
Finally, due to a lack of robustness in calibration, Yes/No deci- of frame-level features.
sions are made by applying the MTWV thresholds obtained for
the development sets, except for the COREMAH test set. In this 2.1. Voice Activity Detection (VAD)
case, calibration is based on the MAVIR corpus, and the 15%
highest scores are taken as positive (Yes) detections. In our previous QbE-STD systems, VAD was performed by us-
ing the posteriors provided BUT phone decoders: first, the pos-
Index Terms: Spoken Term Detection, Query-by-Example teriors of non-phonetic units were added; then, if this aggre-
Spoken Term Detection, Bottleneck features, Dynamic Time gated non-phonetic posterior was higher than any other (pho-
Warping netic) posterior, the frame was labeled as non-speech; otherwise
it was labeled as speech. In this evaluation, we are not using
1. Introduction posteriors anymore, so we cannot apply the same procedure.
The main goal of the participation of GTTS in the Albayzin Instead, we apply the Python interface to a VAD module devel-
2018 Search on Speech Evaluation was to upgrade the QbE- oped by Google for the WebRTC project [6], based on Gaussian
STD systems that we developed for MediaEval 2013 and 2014 distributions of speech and non-speech features. Given an au-
[1] [2] by: (1) using an external VAD module; (2) replacing dio file, our VAD module produces two output files: the first
phonetic posteriors by bottleneck features as frame-level fea- one (required by the feature extraction module) is an HTK .lab
tures (which might imply changing some aspects of the method- file specifying speech and non-speech segments, whereas the
ology); and (3) handling the case of no development data by second one (used by the search procedure) is a text file (.txt)
applying cross-condition calibration and heuristic thresholding. containing a sequence of 1’s and 0’s (one per line) indicating
Also, as a proof-of-concept attempt to overcome the issue of speech/non-speech frames.
OOV words, we have developed STD systems by applying pub-
2.2. Bottleneck features
lic TTS APIs to synthesize a number of spoken instances of
each term, then computing an average query (following the ap- We have updated the feature extraction module of our previous
proach developed for MediaEval 2013 [3]) and using it to per- QbE-STD systems with the stacked botteneck features (sBNF)
form QbE-STD. recently presented by BUT/Phonexia [7]. Actually, three differ-
QbE-STD results on development data reveal that calibra- ent neural networks are applied, each one trained to classify a
tion models are not well estimated, probably due to a lack of de- different set of acoustic units and later optimized to language
249 10.21437/IberSPEECH.2018-52
recognition tasks. The first network was trained on telephone Note that d(v, w) ≥ 0, with d(v, w) = 0 if and only if v and w
speech (8 kHz) from the English Fisher corpus [8] with 120 are aligned and pointing in the same direction, and d(v, w) =
monophone state targets (FisherMono); the second one was also +∞ if and only if v and w are aligned and pointing in opposite
trained on the Fisher corpus but with 2423 triphone tied-state directions.
targets (FisherTri); the third network was trained on telephone The distance matrix computed according to Eq. 1 is nor-
speech (8 kHz) in 17 languages taken from the IARPA Babel malized with regard to the audio document x, as follows:
program [9], with 3096 stacked monophone state targets for the d(q[i], x[j]) − dmin (i)
17 languages involved (BabelMulti). dnorm (q[i], x[j]) = (2)
dmax (i) − dmin (i)
The architecture of these networks consists of two stages.
The first one is a standard bottleneck network fed with low- where:
level acoustic features spanning 10 frames (100 ms), the bottle- dmin (i) = min d(q[i], x[j]) (3)
j=1,...,n
neck size being 80. The second stage takes as input five equally
spaced BNFs of the first stage, spanning 31 frames (310 ms), dmax (i) = max d(q[i], x[j]) (4)
j=1,...,n
and is trained on the same targets as the first stage, with the
same bottleneck size (80). The bottleneck features extracted In this way, matrix values are in the range [0, 1] and a per-
from the second stage are known as stacked bottleneck features fect match would produce a quasi-diagonal sequence of zeroes.
(sBNF). Alternatively, instead of sBNF, the extractor can output This can be seen as test nomalization since, given a query q,
target posteriors. distance matrices take values in the same range (and with the
The operation of BUT/Phonexia sBNF extractors requires same relative meaning), no matter the acoustic conditions, the
an external VAD module providing speech/non-speech informa- speaker, etc. of the audio document x.
tion through an HTK .lab file. If no external VAD is provided, a Note that the chunking process described above makes the
simple energy-based VAD is computed internally. In this eval- normalization procedure differ from that applied in [3], since
uation, we have applied the WebRTC VAD described above. dmin (i) and dmax (i) are not computed for the whole audio
Our first aim was to replace old BUT by new BUT/Phonexia document but for each chunk independently. On the other hand,
posteriors, but the huge size of FisherTri (2423) and BabelMulti considering chunks of 5 minutes might be beneficial, since nor-
(3096) targets required some kind of selection, clustering or malization is performed in a more local fashion, that is, more
dimensionality reduction approach. So, given that —at least suited to the speaker(s) and acoustic conditions of each particu-
theoretically— the same information is conveyed by sBNF’s, lar chunk.
with a suitably low dimensionality (80), we decided to switch The best match of a query q of length m in an audio doc-
from posteriors to sBNF’s. We were aware that this change may ument x of length n is defined as that minimizing the average
make us pay a high price. Posteriors have a clear meaning, they distance in a crossing path of the matrix dnorm . A crossing
can be linearly combined and their values suitably fall within path starts at any given frame of x, k1 ∈ [1, n], then traverses
the range [0,1], which makes the − log cos(α) distance also a region of x which is optimally aligned to q (involving L vec-
range in [0,1] (α being the angle between two vectors of pos- tor alignments), and ends at frame k2 ∈ [k1 , n]. The average
teriors), with very good results reported in our previous works. distance in this crossing path is:
L
On the other hand, bottleneck layer activations have no clear 1X
meaning, we don’t really know if they can be linearly combined davg (q, x) = dnorm (q[il ], x[jl ]) (5)
L
l=1
(e.g for computing an average query from multiple query in-
stances), and their values are unbounded, so the − log cos(α) where il and jl are the indices of the vectors of q and x in the
distance does no longer apply. Is there any other distance work- alignment l, for l = 1, 2, . . . , L. Note that i1 = 1, iL = m,
ing fine with sBNF? This evaluation poses a great opportunity j1 = k1 and jL = k2 . The optimization procedure is Θ(n·m·d)
to address these issues. in time (d: size of feature vectors) and Θ(n · m) in space. For
details, we refer to [3].
2.3. DTW-based search The detection score is computed as 1 − davg (q, x), thus
ranging from 0 to 1, being 1 only for a perfect match. The
To perform the search of spoken queries in audio documents,
starting time and the duration of each detection are obtained by
we basically follow the DTW-based approach presented in [3].
retrieving the time offsets corresponding to frames k1 and k2 in
In the following, we summarize the approach and the modifica-
the VAD-filtered audio document.
tions introduced for this evaluation.
This procedure is iteratively applied to find not only the best
Given two sequences of sBNF’s corresponding to a spo-
match but also less likely matches in the same audio document.
ken query and an audio document, we first apply VAD to dis-
To that end, a queue of search intervals is defined and initialized
card non-speech frames, but keeping the timestamp of each
with [1, n]. Let us consider an interval [a, b], and assume that
frame. To avoid memory issues, audio documents are splitted
the best match is found at [a0 , b0 ], then the intervals [a, a0 − 1]
into chunks of 5 minutes, overlapped 5 seconds, and processed
and [b0 + 1, b] are added to the queue (for further processing)
independently. This chunking process is key to the speed and
only if the following conditions are satisfied: (1) the score of
feasibility of the search procedure.
the current match is greater than a given threshold T (in this
Let us consider the VAD-filtered sequences corresponding evaluation, T = 0.85); (2) the interval is long enough (in this
to a query q = (q[1], q[2], . . . , q[m]) and an audio document evaluation, half the query length: m/2); and (3) the number of
x = (x[1], x[2], . . . , x[n]), of length m and n, respectively. matches (those already found + those waiting in the queue) is
Since sBNF’s (theoretically) range from −∞ to +∞, we de- less than a given threshold M (in this evaluation, M = 7). An
fine the distance between any pair of vectors, q[i] and x[j], as example is shown in Figure 1. Finally, the list of matches for
follows: each query is ranked according to the scores and truncated to

q[i] · x[j] the N highest scores (in this evaluation, N = 1000, though it
d(q[i], x[j]) = − log 1 + + log 2 (1) effectively applied only in a few cases).
|q[i]| · |x[j]|
250
length(x[k2+1,n]) < m/2
not searched
spoken query (q)
k3 k4
audio document (x) k1 k2
1 n
1
Normalized
distance matrix
m
Second best match Best match

score(q,x[k3,k4]) < T score(q,x[k1,k2]) > T
x[1,k3-1] and x[k4+1,k1-1] x[1,k1-1] and x[k2+1,n] searched
not searched for other query matches
Figure 1: Example of the iterative DTW procedure: (1) the best match of q in x[1, n] is located in x[k1 , k2 ]; (2) since the score is greater
than the established threshold T , the search continues in the surrounding segments x[1, k1 − 1] and x[k2 + 1, n]; (3) x[k2 + 1, n] is
not searched, because it is too short; (4) the best match of q in x[1, k1 − 1] is located in x[k3 , k4 ]; (5) but its score is lower than T , so
the surrounding segments x[1, k3 − 1] and x[k4 + 1, k1 − 1] are not searched. The search procedure outputs the segments x[k1 , k2 ]
and x[k3 , k4 ].
2.4. Calibration and fusion of system scores As a result, the calibration/fusion model would be poorly esti-
mated and the Bayes optimal threshold (in this evaluation, 6.9)
The scores produced by our systems are transformed according would not produce good results.
to a discriminative calibration/fusion approach commonly ap-
plied in speaker and language recognition, that we adapted to 3. STD systems
STD tasks for MediaEval 2013, in collaboration with Alberto
In this evaluation, we have exploited some publicly available
Abad, from L2 F, the Spoken Language Systems Laboratory,
Text-to-Speech (TTS) API’s to perform text-in-audio search as
INESC-ID Lisboa. In the following paragraphs, we just sum-
audio-in-audio search. This is just a proof-of-concept aimed at
marize the procedure. For further details, see [5].
overcoming the Out-Of-Vocabulary (OOV) word issue.
First, the so-called q-norm (query normalization) is ap- We have applied the Google TTS (gTTS) Python library
plied, so that zero-mean and unit-variance scores are obtained and command-line interface (CLI) tool [12], which provides
per query. Then, if n different systems are fused, detections two different female (es-ES and es-US) voices, and the Cocoa
are aligned so that only those supported by k or more systems interface to speech synthesis in MacOS [13], which provides 5
(1 ≤ k ≤ n) are retained for further processing (in this eval- different voices (three male, two female) including both Euro-
uation, we use k = 2). To build the full set of trials (potential pean and American Spanish.
detections) we assume a rate of 1 trial per second (which is con- In this way, for each textual term, we synthesize 7 spoken
sistent with the evaluation script). Now, let us consider one of queries: q1 , q2 , . . . , q7 . These spoken queries are downsampled
those detections of a query q supported by at least k systems, to 8 kHz and applied VAD and sBNF extraction as described
and a system A that did not provide a score for it. There could in Sections 2.1 and 2.2. The longest query is then taken as
be different ways to fill up this hole. We use the minimum score reference and optimally aligned to the other queries by means
that A has output for query q in other trials. In fact, the min- of a standard DTW procedure. Let us consider the sequence
imum score for the query q is hypothesized for all target and of VAD-filtered sBNF vectors for the reference query: ql of
non-target trials of query q for which system A has not output length ml , and the sequence corresponding to another synthe-
a detection score. When a single system is considered (n = 1), sized query: qi of length mi . The alignment starts at [1, 1] and
the majority voting scheme is skipped but qnorm and the fill- ends at [ml , mi ] and involves L alignments, such that each fea-
ing up of missing scores are still applied. In this way, a com- ture vector of ql is aligned to a sequence of vectors of qi . This
plete set of scores is prepared, which besides the ground truth is repeated for all the synthesized queries, such that we end up
(target/non-target labels) for a development set of queries, can with a set of feature vectors Sj aligned to each feature vector
be used to discriminatively estimate a linear transformation that ql [j], for j = 1, 2, . . . , ml . Then, each ql [j] is averaged with
will hopefully produce well-calibrated scores. the feature vectors in Sj to get a single average query, as fol-
The calibration/fusion model is estimated on the develop- lows:  
ment set and then applied to both the development and test 1 X
qavg [j] = ql [j] + v  j = 1, 2, . . . , ml
sets, using the BOSARIS toolkit [10][11]. Under this ap- 1 + |Sj | v∈S
proach, the Bayes optimal threshold, given the effective prior j
(6)
(in this evaluation, P̂target = Cmiss Ptarget /(Cmiss Ptarget +
Finally, the average query obtained in this way is used to
Cf a (1 − Ptarget )) = 0.001), would be applied and —at least
search for occurrences in the audio documents, just in the same
theoretically— no further tunings would be necessary. In prac-
way (using the same configuration) as we do in the QbE-STD
tice, however, if a system yielded a small amount of detections,
task.
we would be using hypothesized scores for most of the trials.
251
4. Experimental setup and results Table 2: ATWV/MTWV performance on the development sets
of MAVIR and RTVE for the STD systems submitted by GTTS-
Since BUT/Phonexia sBNF extractors operate on 8 kHz signals, EHU. The ATWV threshold is set to 6.9. Along with the MTWV
all the query and test audio signals have been downsampled to score, the MTWV threshold is shown in parentheses.
8 kHz and stored as 16 bit little-endian signed-integer single-
channel WAV files. Audio conversion was performed under MAVIR RTVE
MacOS, using the following command: MTWV (Thr) ATWV MTWV (Thr) ATWV
con2 0.0396 (5.12) 0.0000 0.0933 (5.30) 0.0026
afconvert audio.<ext> -o audio_8k.wav con3 0.0463 (4.82) 0.0000 0.0951 (4.80) 0.0000
-d LEI16@8000 -c 1 -f WAVE con4 0.0512 (4.31) 0.0000 0.0916 (4.98) 0.0000
con1 0.0398 (5.28) 0.0000 0.0843 (5.21) 0.0000
pri 0.0464 (5.43) 0.0000 0.0809 (5.91) 0.0265
where <ext> represents any audio format extension (such as
mp3, aac, wav, etc.).
Two different datasets have been provided for training and
development of QbE-STD and STD systems in the Albayzin The QbE-STD and STD results obtained by our systems on the
2018 Search on Speech Evaluation [14]: MAVIR [15] and development sets of MAVIR and RTVE may indicate that the
RTVE [16]. We have not used the training dataset at all, nor search procedure is detecting few occurrences of the queries,
the dev1 set of RTVE. Only the dev set of MAVIR and the dev2 yielding high miss error rates and making it difficult the esti-
set of RTVE have been used to estimate the calibration/fusion mation of good (robust) calibration/fusion models, since most
models, and later to search for query occurrences. Search has of the trials are missing from system output and we have to hy-
been also performed on the test sets: COREMAH [17], MAVIR pothesize scores for them. To get a larger amount of target and
and RTVE, applying the calibration/fusion models and the opti- non-target detections, three of the parameters of the DTW-based
mal (MTWV) thresholds obtained on development. In the case search must be relaxed: the maximum amount M of hits per au-
of COREMAH (for which no development data was provided), dio chunk (currently, M = 7), the minimum score T required to
MAVIR calibration/fusion models have been used and heuristic keep searching (currently, T = 0.85) and the maximum number
thresholding has been applied, by making Yes decisions for the N of detections per query for the whole set of audios (currently,
15% of detections with the highest scores. N = 1000).
One primary (pri) and four contrastive (con1, con2, con3 QbE-STD detection scores, though badly calibrated, seem
and con4) systems have ben submitted to each combination to work fine for RTVE but not so well for MAVIR. The differ-
of task (QbE-STD, STD), condition (development, test) and ence in performance might be related to a higher acoustic vari-
dataset (COREMAH, MAVIR, RTVE). BabelMulti, Fisher- ability or more adverse conditions (reverberation, noise, etc.)
Mono and FisherTri sBNF’s were used for contrastive systems for MAVIR.
2, 3 and 4, respectively. The concatenation of the three sBNF’s The trick of synthesizing spoken queries from textual terms
(with 80 × 3 = 240 dimensions) was used as acoustic repre- to perform STD as QbE-STD seems to be failing, with two
sentation for contrastive system 1. Finally, the primary system possible causes: (1) an acoustic mismatch between the synthe-
was obtained as the discriminative fusion of the four contrastive sized queries and the test audios might lead to low scores and
systems. block the iterative DTW detection procedure; and (2) the use of
bottleneck layer activations as frame-level acoustic representa-
Tables 1 and 2 show ATWV/MTWV performance of the 5 tion might be incompatible with the query averaging procedure
GTTS-EHU systems on the development sets of MAVIR and (which worked fine with phone posteriors).
RTVE for the QbE-STD and STD tasks, respectively. In all Future developments may involve some sort of data aug-
cases, ATWV is obtained for the Bayes optimal threshold (6.9). mentation in QbE-STD, such as the use of pseudo-relevance
Along with the MTWV score, the MTWV threshold is shown feedback, that is, the use of top matching query occurrences
too (in parentheses). As noted above, the systems eventually as additional examples. Also, though query averaging is com-
submitted for MAVIR and RTVE (in all tasks and conditions) putationally cheap, using it with sBNF representations might
were applied the MTWV threshold. be unfeasible and other more expensive strategies to take ad-
vantage of the synthesized queries should be explored, such as
Table 1: ATWV/MTWV performance on the development sets carrying out multiple searches and fusing the results [18]. Alter-
of MAVIR and RTVE for the QbE-STD systems submitted by natively, we could return to posteriors, by combining the high-
GTTS-EHU. The ATWV threshold is set to 6.9. Along with the dimensional sets of posteriors provided by BUT/Phonexia net-
MTWV score, the MTWV threshold is shown in parentheses. works with some sort of feature clustering or feature selection
approach.
MAVIR RTVE
MTWV (Thr) ATWV MTWV (Thr) ATWV 6. Acknowledgements
con2 0.1291 (4.06) 0.0000 0.4893 (4.75) 0.0097
con3 0.1327 (4.12) 0.0000 0.5722 (5.14) 0.0802 We thank Javier Tejedor and Doroteo T. Toledano for organizing
con4 0.1590 (4.09) 0.0000 0.5227 (5.09) 0.0437 a new edition of the Search on Speech Evalution and for their
con1 0.1278 (4.14) 0.0000 0.5159 (5.13) 0.0421 help during the development process. We also thank Eduardo
pri 0.1577 (4.59) 0.0000 0.5352 (5.69) 0.3043 Lleida and the ViVoLab team for the huge effort of collecting
and annotating RTVE broadcasts for the ALBAYZIN 2018 eval-
uations. This work has been partially funded by the UPV/EHU
under grant GIU16/68.
252
7. References [17] COREMAH corpus, Laboratorio de Lingüística Informática, Uni-
versidad Autónoma de Madrid, http://www.lllf.uam.es/coremah/.
[1] X. Anguera, L. J. Rodriguez-Fuentes, F. Metze, I. Szöke, A. Buzo,
and M. Penagarikano, “Query-by-example spoken term detection [18] T. Hazen, W. Shen, and C. White, “Query-By-Example Spo-
on multilingual unconstrained speech,” in Interspeech 2014, Sin- ken Term Detection Using Phonetic Posteriorgram Templates,” in
gapore, September 14-18 2014, pp. 2459–2463. ASRU, Merano, Italy, December 13-17, 2009, pp. 421–426.
[2] X. Anguera, L. J. Rodriguez-Fuentes, A. Buzo, F. Metze,
I. Szöke, and M. Penagarikano, “Quesst2014: Evaluating query-
by-example speech search in a zero-resource setting with real-life
queries,” in IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP 2015), Brisbane, Australia, April
19-24 2015, pp. 5833–5837.
[3] L. J. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. Bor-
del, and M. Diez, “High-performance query-by-example spoken
term detection on the sws 2013 evaluation,” in IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP
2014), Florence, Italy, May 4-9 2014, pp. 7819–7823.
[4] P. Schwarz, “Phoneme recognition based on long temporal con-
text,” Ph.D. dissertation, Faculty of Information Technology, Brno
University of Technology, http://www.fit.vutbr.cz/, Brno, Czech
Republic, 2008.
[5] A. Abad, L. J. Rodriguez-Fuentes, M. Penagarikano, A. Varona,
M. Diez, and G. Bordel, “On the calibration and fusion of het-
erogeneous spoken term detection systems,” in Interspeech 2013,
Lyon, France, August 25-29 2013, pp. 20–24.
[6] Python interface to the WebRTC (https://webrtc.org/) Voice Activ-
ity Detector (VAD), https://github.com/wiseman/py-webrtcvad.
[7] A. Silnova, P. Matejka, O. Glembek, O. Plchot, O. Novotny,
F. Grezl, P. Schwarz, L. Burget, and J. H. Cernocky,
“BUT/Phonexia Bottleneck Feature Extractor,” in Odyssey 2018:
The Speaker and Language Recognition Workshop, Les Sables
D’Olonne, France, June 26-29 2018, pp. 283–287.
[8] C. Cieri, D. Miller, and K. Walker, “The Fisher Corpus: a Re-
source for the Next Generations of Speech-to-Text,” in LREC,
2004, pp. 69–71.
[9] Babel Program, Intelligence Advanced Research Projects
Activity (IARPA), https://www.iarpa.gov/index.php/research-
programs/babel.
[10] N. Brümmer and E. de Villiers, The BOSARIS Toolkit User Guide:
Theory, Algorithms and Code for Binary Classifier Score Process-
ing, 2011, https://sites.google.com/site/bosaristoolkit/.
[11] ——, “The BOSARIS Toolkit: Theory, Algorithms and Code for
Surviving the New DCF,” arXiv.org, 2013, presented at the NIST
SRE’11 Analysis Workshop, Atlanta (USA), December 2011,
https://arxiv.org/abs/1304.2865.
[12] gTTS (Google Text-to-Speech): Python library and CLI tool to
interface with Google Translate’s text-to-speech API, Download
and installation: https://pypi.org/project/gTTS/. Documentation:
https://gtts.readthedocs.io/en/latest/.
[13] NSSpeechSynthesizer: The Cocoa interface to speech
synthesis in macOS (AppKit module of PyObjC
bridge), Apple Developer Objective C Documentation:
https://developer.apple.com/documentation/appkit/nsspeech-
synthesizer. Stackoverflow example of use (with Python 2):
https://stackoverflow.com/questions/12758591/python-text-to-
speech-in-macintosh.
[14] J. Tejedor and D. T. Toledano, The ALBAYZIN 2018 Search on
Speech Evaluation Plan, IberSpeech 2018: X Jornadas en Tec-
nologías del Habla and V Iberian SLTech Workshop, Barcelona,
Spain, November 21-23 2018, http://iberspeech2018.talp.cat/wp-
content/uploads/2018/06/EvaluationPlanSearchonSpeech.pdf.
[15] MAVIR corpus, Laboratorio de Lingüística In-
formática, Universidad Autónoma de Madrid,
http://www.lllf.uam.es/ESP/CorpusMavir.html.
[16] E. Lleida, A. Ortega, A. Miguel, V. Bazán, C. Pérez, M. Zotano,
and A. de Prada, RTVE2018 Database Description, Vivolab,
Aragon Institute for Engineering Research (I3A), University of
Zaragoza and Corporación Radiotelevisión Española, June 2018,
http://catedrartve.unizar.es/reto2018/RTVE2018DB.pdf.
253
IberSPEECH 2018
Cenatav Voice Group System for Albayzin

2018 Search on Speech Evaluation
Ana R. Montalvo, Jose M. Ramirez, Alejandro Roble and Jose R. Calvo
Voice Group, Advanced Technologies Application Center, CENATAV, Havana, Cuba
{amontalvo, jsanchez, aroble, jcalvo}@cenatav.co.cu
development data set. Finally, Section 4 concludes the paper
Abstract and describes the work being under development.
This paper presents the system employed in the Albayzin 2018
“Search on Speech” Evaluation by the Voice Group of 2. Cenatav Voice Group STD Kaldi System
CENATAV. The system used in the Spoken Term Detection
The following section describes more detailed the proposed
(STD) task consists on an Automatic Speech Recognizer (ASR)
system.
and a module to detect the terms. The open source Kaldi toolkit
is used to build both modules. ASR acoustic models are based 2.1. Large Vocabulary Continuous Speech Recognition
on DNN-HMM, S-GMM or GMM-HMM, trained with audio (LVCSR) module
data provided by the organizers and other obtained from ELDA.
The lexicon and trigram language model are obtained from the The first module of the system is a Large Vocabulary
text associated to the audio. The ASR generates the lattices and Continuous Speech Recognizer implemented following an
the word alignments required to detect the terms. Results with adaptation of s5 WSJ recipe of Kaldi. The acoustic features
development data shown that DNN-HMM model brings up a used are 13 Mel-Frequency Cepstral Coefficients (MFCC) with
behavior better or similar to obtained in previous challenges. cepstral normalization (CMVN) to reduce the effects of the
Index Terms: Spoken Term Detection, Automatic Speech channel.
Recognition, Kaldi The training of the acoustic models begins with a flat
initialization of context-independent phone Hidden Markov
1. Introduction Models (HMM). Then several re-training and alignment of
acoustic models to obtain context-dependent phone HMM, are
This is the first participation of the Voice Group of Cenatav in done following the transformation of the acoustic features:
the Albayzin Challenges. We participated in the STD task of
• The 13-dimensional MFCC features are spliced across +/-
the Albayzin 2018 Search on Speech Evaluation, developing a
4 frames to obtain 117 dimensional vectors.
system described in the next section. Spoken Term Detection
(STD), is defined by NIST [1] as “searching vast, • Then linear discriminant analysis (LDA) is applied to
heterogeneous audio archives for occurrences of spoken terms”. reduce the dimensionality to 40, using context-dependent
Iberian institutions have been conducted researches recently on HMM states as classes for the acoustic model estimation.
this task, as shown in previous Albayzin Challenges [2, 3, 4, 5]. • Maximum likelihood linear transform (MLLT) is applied
In STD task, a list of written terms must be detected in to the resulting features, making them more accurately
different audio files. Although many of the terms may belong modelled by diagonal-covariance Gaussians.
to the train vocabulary (INV), the term list is known after • Then, feature-space maximum likelihood linear
processing the audio, which provokes a specific treatment of regression (fMLLR) is applied to normalize inter-speaker
out-of-vocabulary (OOV) words. The words lexicon and variability of the features.
language models of the ASR systems employed to detect OOV Obtained phone models consist of three HMM states each
words, were obtained from texts where a specific OOV terms in a tied-pdf cross-word tri-phone context This GMM-HMM
list, provided by the evaluators were removed, to force the model was trained with 15000 Gaussians and 2129 senones.
systems to deal with real OOV words.
Then the GMM-HMM model is speaker adapted in a sub-
The evaluation of the system is performed with three space of Gaussian Mixtures (S-GMM), following [11], using
different databases in native Spanish. The first one is the test fMLLR features and sharing the same Gaussian Model and
partition of the MAVIR [6] corpus, the second one is the SoS same Gaussians quantity per state of the model. This S-GMM
test partition of RTVE database [7], while the third is known as contains 9000 Gaussians and 7000 branches.
COREMAH [8] database.
The GMM-HMM model is used too, as an alignment for
The system presented for the task consists of two modules: training a DNN-based acoustic model (DNN-HMM) following
A Large Vocabulary Continuous Speech Recognition (LVCSR) Dan’s DNN implementation, with 2 hidden layers with 300
module and a Spoken Term Detection (STD) module. The nodes each. The number of spliced frames was nine to produce
acoustic modeling and words decoding were implemented 360 dimensional vectors as input to the first layer. The output
using the Kaldi toolkit [9], while the trigram language models layer is a soft-max layer representing the log-posteriors of the
were computed with SRILM [10]. Firstly, the test audio files are context-dependent 2129 states. The total number of parameters
processed by a LVCSR based on Deep Neural Network. This is 1703600.
produces word lattices which contain the most probable
transcriptions of the audio files. The LVCSR decoder generates word lattices [12] using any
of the mentioned models, these lattices contain the most
Section 2 describes the details of the implemented system. probable transcriptions of the test utterances where the term
Section 3 shows the experimental results obtained using the
254 10.21437/IberSPEECH.2018-53
search will be performed. These lattices and the transcriptions and eight programs “La Noche en 24H”), of this partition,
obtained are the primary input to the STD module. were manually segmented and labeled by speaker, by us,
The lattices are processed using the lattice indexing in expressions less than 40 seconds to obtain a training
technique described in [13], where the lattices of all the test dataset of 2013 expressions with 139,983 words with a
utterances are converted from individual weighted finite state duration of 13:45 hours. This set was used for training the
transducers (WFST) to a single generalized factor transducer acoustic models and the language model. The
structure in which the start- time, end-time and lattice posterior development data is in the partition “dev2” and is about
probability of each word is stored as a 3-dimensional cost. This 15 hours in twelve RTVE programs.
factor transducer is an inverted index of all word sequences seen During the manual segmentation of transcriptions and audio
in the lattices. of RTVE programs, we observed that the transcript files did not
Thus, given a list of terms to detect, a simple finite state include many speech expressions that happen in the audio and
machine is created such that it accepts the terms and composes also some speaker voices overlapping, typical of the
it with the factor transducer to obtain all its occurrences in the spontaneity and level of improvisation during the conversations
tests utterances, along with the utterance ID, start-time, end- among the journalists participating in the programs. We
time and lattice posterior probability of each occurrence. All eliminate the voices overlapping when segmenting the audio
those occurrences are sorted according to their posterior and its corresponding transcription, however only some not
probabilities and a YES/NO decision is assigned to each transcribed speech expressions, were transcribed by us when we
instance. detected them, listening carefully and replaying many times the
audio file, during the segmenting process.
2.2. Train and development data All the textual training material of the three databases was
Here is a description of the databases used to train and develop revised and corrected, carry to uppercase, substituting the
numbers and acronyms for their transcription and finally
our system for the STD task.
grouped in a database of 23029 different words and 44:45 hours
• TC-STAR: We obtained free from ELDA-ELRA a set of of duration.
audios and transcriptions of Spanish partition of
TCSTAR 2005-2007 [14], corresponding to the 2.3. Vocabulary and Lexicon
Evaluation Package database. It contains 26:40 hours of
audio and consists of 17163 expressions with 241412 The dictionary used by the LVSCR is composed only by words
words. This set was used for training the acoustic models from the transcriptions of the training data. Multilingual G2P
and the language model. transcriber [16] was used to obtain the phonetic transcription of
each word. We obtain a general lexicon of 23029 different
• MAVIR: This corpus was provided by the challenge words.
organizers, and corresponds to talks held by the MAVIR
consortium in 2006, 2007 and 2008 [6]. The Spanish 2.4. Language models
training data is contained in “SoS2018_training”,
contains 4: 20 hours and consists in 5 talks segmented in To train the language model used by the LVCSR, we used only
2400 expressions with 44423 words. This set was used the transcriptions of the training data corpus. It consists of
for training the acoustic models and the language model. 21575 expressions and 23029 different words. This text has
The MAVIR development data is contained in been supplied to the SRILM tool to create an Arpa format,
“SoS2018_development (1)” and is about one hour in two trigram language model with 23002 unigrams, 156778 bigrams
talks. and 38628 trigrams.
• RTVE: The Challenge organizers provided this corpus
2.5. INV and OOV Terms
and its structure is explained in [15]. The corpus is
divided in 4 partitions, a “train” one, two development This Challenge evaluation defines two sets of terms for STD
“dev1”, “dev2” and one “test”. The audio files of the task: an in-vocabulary (INV) set of terms and an out of-
“train” partition do not have human-revised vocabulary (OOV) set of terms. The OOV set of terms will be
transcriptions. Partition “dev1” contains about 53 hours composed by out-of-vocabulary words for the LVCSR system,
of audios and their corresponding human-revised so these OOV terms must be removed from the system
transcriptions and can be used for either development or dictionary and consequently from the lexicon and the language
training. So, transcriptions (files “trn”) and audio (files model.
“aac”) of twelve RTVE programs (four programs “20H”
Table 1: STD scores for MAVIR and RTVE development corpus.
Development set Kaldi Models MTWV ATWV Pfa Pmiss

MAVIR tri3b (GMM-HMM) 0.5705 0.5556 0,00006 0.373
MAVIR sgmm2_4 (S-GMM) 0.6045 0.5897 0.00005 0.348
MAVIR tri4_nnet (DNN_HMM) 0.5974 0.5946 0.00008 0.322
RTVE tri3b (GMM-HMM) 0.2943 0.2939 0.00002 0.690
RTVE sgmm2_4 (S-GMM) 0.2889 0.2868 0.00001 0.699
RTVE tri4_nnet (DNN_HMM) 0.3303 0.3295 0.00002 0.648
255
References
3. Experimental results
[1] NIST. The spoken term detection (STD) 2006 evaluation plan.
Table 1 contains the STD scores obtained with the three National Institute of Standards and Technology (NIST),
proposed models (GMM-HMM, S-GMM and DNN-HMM), Gaithersburg, MD, USA, 10 edn., 2006.
using the DEV set of MAVIR and RTVE corpus, evaluating http://www.nist.gov/speech/tests/std.
[2] Tejedor, J., Toledano, D.T., Anguera, X., Varona, A., Hurtado,
with the NIST STDEval-0.7 tool, provided by the Challenge
L.F., Miguel, A., Colas, J.: Query-by-example spoken term
organizers. detection Albayzin 2012 evaluation: overview, systems, results,
This tool provides the probabilities of False Acceptances and discussion. EURASIP Journal on Audio, Speech, and Music
(Pfa) and Misses (Pmiss) of the STD system, and two metrics Processing, 2013, 2013(1):23.
that integrates both probabilities [17]: [3] Tejedor, J., Toledano, D.T., Lopez-Otero, P., Docio-Fernandez,
• Actual Term Weighted Value (ATWV) that integrates Pfa L., Garcia-Mateo, C., Cardenal, A., Echeverry-Correa, J.D.,
and Pmiss for each term, and averages over all the terms, Coucheiro-Limeres, A., Olcoz, J., Miguel, A.: Spoken term
representing the term weighted value for a threshold set by detection Albayzin 2014 evaluation: overview, systems, results,
and discussion. EURASIP Journal on Audio, Speech, and Music
the system tuned on development data Processing, (2015) 2015 (1):21
• Maximum Term Weighted Value (MTWV) that is the [4] Tejedor, J., Toledano, D.T., Lopez-Otero, P., Docio-Fernandez,
maximum TWV achieved by the system for all possible L., Garcia-Mateo, C.: Comparison of Albayzin query-byexample
thresholds, not depending on the tuned threshold, spoken term detection 2012 and 2014 evaluations. EURASIP
representing an upper bound of the system performance. Journal on Audio, Speech, and Music Processing, 2016(1):1.
Comparing with results obtained with the same MAVIR [5] Tejedor, J., Toledano, D.T., Lopez-Otero, P., Docio-Fernandez,
dataset in Albayzin 2016 spoken term detection evaluation [5], L., Serrano, L., Hernaez, I., Coucheiro-Limeres, A., Ferreiros, J.,
Olcoz, J., Llombart, J.: Albayzin 2016 spoken term detection
our results, for all the proposed models, are similar to obtained
evaluation: an international open competitive evaluation in
by the best system with the DEV set (first line of table 9 of [5]) spanish. EURASIP Journal on Audio, Speech, and Music
and the TEST set (first line of table 10 of [5]). Additionally, Processing, 2017(1):22.
proposed DNN-HMM model surpasses the behavior of the best [6] Sandoval, A.M., Llanos, L.C.: MAVIR: a corpus of spontaneous
system in that evaluation. formal speech in Spanish and English. In: Iberspeech 2012: VII
However, results shown that the behavior of the evaluated Jornadas en Tecnología del Habla, 2012. \MAVIR corpus:
methods is worst with the RTVE set, probably by: http://www.lllf.uam.es/ESP/CorpusMavir.html".
• the differences between the speech of the training set and [7] RTVE corpus: http://catedrartve.unizar.es/reto2018.html.
[8] COREMAH corpus: http://www.lllf.uam.es/coremah/".
their transcriptions, explained above, and [9] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O.,
• the differences in spontaneity and level of improvisation Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P.,
between MAVIR and RTVE sets. Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech
recognition toolkit. In: IEEE Workshop on Automatic Speech
4. Conclusions and future work Recognition and Understanding. IEEE Signal Processing Society,
2011.
Due to the inexperience of the participants in the use of the [10] Stolcke, A. et al.: SRILM-an extensible language modeling
Kaldi tool, the lack of the necessary time, and difficulties in the toolkit. In: ISCA Seven International Conference of Speech
exploitation of the computational resources that we have, it was Technologies, ICSLP 2002, pp 901-904.
[11] Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal,
impossible for us to carry out, before the dead line, the
A., Glembek, O., Goel, N., Karafiat, M., Rastrow, A., Rose, R.,
evaluation of the proposed systems applying the Kaldi proxy Schwarz, P., and Thomas, S.: The subspace Gaussian mixture
method, for the DEV and TEST set. model: A structured model for speech recognition. Computer
Taking into account the best results of DNN-HMM model Speech & Language, 25(2):404–439, 2011.
in the RTVE set, its lowest Pmiss and lowest performance gap [12] Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal,
between MTWV and ATWV metrics, indication of well A., Janda, M., Karafiát, M., Kombrink, S., Motlícek, P., Qian, Y.,
calibrated term detection scores, we decided to send as a Riedhammer, K., Veselý, K., Vu, N.T.: Generating exact lattices
primary system the one obtained with the DNN-HMM model, in the WFST framework. In: IEEE International Conference on
Acoustics, Speech, and Signal Processing. pp. 4213–4216, 2012.
and as contrastive systems 1 and 2, those obtained with the
[13] Can, D., Saraclar, M.: Lattice indexing for spoken term detection.
models S-GMM and GMM-HMM, respectively. IEEE Transactions on Audio, Speech and Language Processing
We consider the participation in this Challenge very useful. 19(8), 2338–2347, 2011.
The experience acquired by all the participants will serve us for [14] TC-STAR Technology and Corpora for Speech to Speech
next competitions and more importantly, for the development Translation. http://www.tcstar.org/pages/main.htm
of our research in the fields of ASR and STD. [15] Lleida E., Ortega A., Miguel A., Bazán V., Pérez C., Zotano M.
We propose to continue and conclude the experiments and De Prada A.: RTVE2018 Database Description
evaluating the Kaldi proxy method and refining the transcripts [16] Multilingual Grapheme to Phoneme.
https://github.com/jcsilva/multilingual-g2p
of the RTVE database samples, to make them available to the
[17] Fiscus, J. G., Ajot, J., Garofolo, J. S., & Doddingtion, G.: Results
Spanish Thematic Network on Speech Technology (RTTH). of the 2006 spoken term detection evaluation. In Proc. of
workshop on searching spontaneous conversational speech, pp.
5. Acknowledgments 45–50, 2007.
Finally, we would like to thank the University of Zaragoza and

the University of Vigo, in the persons of the professors Eduardo
Lleida Solano and Carmen Garcia Mateo, for their support in
the elaboration of the textual training data of our system.
256
IberSPEECH 2018
MLLP-UPV and RWTH Aachen Spanish ASR Systems for the

IberSpeech-RTVE 2018 Speech-to-Text Transcription Challenge
Javier Jorge1 , Adrià Martı́nez-Villaronga1 , Pavel Golik2 , Adrià Giménez1 ,
Joan Albert Silvestre-Cerdà1 , Patrick Doetsch2 , Vicent Andreu Cı́scar3 ,
Hermann Ney2 , Alfons Juan1 and Albert Sanchis1
1
Departament de Sistemes Informàtics i Computació, Universitat Politècnica de València (Spain)
2
Human Language Technology and Pattern Recognition, RWTH Aachen University (Germany)
3
Escola Tècnica Superior d’Enginyeria Informàtica, Universitat Politècnica de València (Spain)
{jjorge,amartinez1,agimenez,jsilvestre,ajuan,josanna}@dsic.upv.es
{golik,doetsch,ney}@cs.rwth-aachen.de
vicismar@inf.upv.es
Abstract 2. RTVE database

The RTVE (Radio Televisión Española) database consists of a
collection of 15 different TV shows broadcast by the Spanish
This paper describes the Automatic Speech Recognition sys- public TV station between 2015 and 2018. It comprises 569
tems built by the MLLP research group of Universitat hours of audio data, from which 460 hours are provided with
Politècnica de València and the HLTPR research group of subtitles, and 109 hours have been human-revised. In addi-
RWTH Aachen for the IberSpeech-RTVE 2018 Speech-to-Text tion, the database includes 3 million sentences extracted from
Transcription Challenge. We participated in both the closed and the subtitles of 2017 broadcast news at the RTVE 24H Channel.
the open training conditions.
The database is provided in five partitions: subs-C24H, the
The best system built for the closed condition was an hy- 3M text dataset from the RTVE 24H channel; train, that com-
brid BLSTM-HMM ASR system using one-pass decoding with prises 463 hours of audio data with non-verbatim subtitles from
a combination of a RNN LM and show-adapted n-gram LMs. 16 TV shows; and dev1, dev2, test, consisting of 57, 15 and 40
It was trained on a set of reliable speech data extracted from the hours from 5, 2 and 8 different TV shows, respectively. dev1
train and dev1 sets using MLLP’s transLectures-UPV toolkit and dev2 sets were provided with manual corrected transcripts,
(TLK) and TensorFlow. This system achieved 20.0% WER on while the test set was used to gauge the performance of the par-
the dev2 set. ticipant systems. It is important to note that there is a certain
For the open condition we used approx. 3800 hours of out- overlap between dev1, dev2, test and the train set in terms of
of-domain training data from multiple sources and trained a TV-shows.
one-pass hybrid BLSTM-HMM ASR system using open-source In order to have available as many training data as possible
tools RASR and RETURNN developed at RWTH Aachen. This during the development stage of the closed-condition system,
system scored 15.6% WER on the dev2 set. The highlights of we decided to split dev1 into two subsets: dev1-train, compris-
these systems include robust speech data filtering for acoustic ing 43 hours of raw audio to be used for acoustic and language
model training and show-specific language modeling. model training; and dev1-dev, 15 hours to be used internally as
development data to optimize different components of the sys-
Index Terms: Automatic Speech Recognition, Acoustic Mod- tem. This split was done at the file level, trying to satisfy that
eling, Language Modeling, Speech Data Filtering. dev1-dev have similar size to dev2, and that both dev1 subsets
include files from the same TV shows. We used dev2 as a test
1. Introduction set to measure the performance of the system on unseen data.
This paper describes the joint collaboration between the Ma- 3. Closed-condition system
chine Learning and Language Processing (MLLP) research
group from the Universitat Politècnica de València (UPV) 3.1. Speech data filtering
and the Human Language Technology and Pattern Recognition Under the closed training conditions, it is extremely important
(HLTPR) research group from the RWTH Aachen University for to make the most of the provided training data, specially when
the participation in the IberSpeech-RTVE 2018 Speech-to-Text it is scarce and/or noisy. This is the case of the RTVE database:
transcription challenge, that will be held during the IberSpeech on the one hand, the train set comprises only 463 hours of au-
2018 conference in Barcelona, Spain. Our participation con- dio, which is not much compared with the amount of data used
sisted of the submission of three systems: one primary and one to train current state-of-the-art systems [1, 2]. On the other
contrastive for the closed system training condition, and one hand, training data is not provided with verbatim transcripts
primary for the open condition. but with approximate subtitles. This becomes a major concern
The rest of the paper is structured as follows. First, Sec- when using recurrent neural networks for acoustic modeling, as
tion 2 describes the RTVE database that was provided by the the accuracy drops significantly when using noisy training data.
organizers of the challenge. Second, in Section 3 we describe Therefore, a robust speech data filtering procedure becomes a
the ASR systems we developed under the closed training condi- key point to achieve high ASR performance.
tions. Next, Section 4 details the ASR system that participated After examining some random samples from train, dev1
in the open training conditions. Finally, Section 5 provides a and dev2 sets, we first-hand checked that the provided subti-
summary of the work and gives some concluding remarks. tles are far from being verbatim transcripts, but also noted the
257 10.21437/IberSPEECH.2018-54
presence of 1) subtitle/transcription gaps, 2) subtitle files with Table 2: Corpus statistics of the text data used for LM training.
no timestamps, 3) audio files considerably larger than their cor-
responding subtitle files, 4) subtitle files covering timestamps Sentences Running words Vocabulary
that exceed the length of their corresponding audio file, or 5) train 340K 4.3M 80K
human transcription errors in dev1, among others. subs-C24H 3.1M 57M 160K
RNN-train 1.8M 35M 176K
For all these reasons, we applied the following speech data
filtering pipeline. As subtitle timestamps 1) are not reliable in dev1-dev 9.9K 160K 13K
the train set, 2) are not given in dev1, and 3) are given only at dev2 7.7K 150K 12K
speaker-turn level in dev2, we first force-aligned each audio file
to its corresponding subtitle/transcript text. We did this using 3.2. Acoustic modeling
a preexisting hybrid CD-DNN-HMM ASR [3] system in which The acoustic models (AM) used during the development of the
the search space was constrained to recognize the exact text (no MLLP-RWTH c1-dev closed system were trained using filtered
language model was involved in this procedure), with the only speech data from train and dev1-train sets, that is, 205 hours
freedom of exploring the different word pronunciations given of training speech data. We extracted 16-dim. MFCC features
by the lexicon model and of using an optional silence phoneme augmented by the full first and second time derivatives, resuling
at the beginning of each word. In this way we computed the best in 48-dim. features.
alignment between the input frame sequence to the sequence Our acoustic models were based on the hybrid approach [4,
of HMM states inferred from the subtitle/transcript text. Then, 5]. We first trained a conventional context-dependent Gaus-
we applied a heuristic post-filtering based on state-level frame sian mixture model hidden Markov model (CD-GMM-HMM)
occupation and word-level alignment scores: if either an HMM with three left-to-right states. The state-tying schema was esti-
state is aligned to more frames than the observed average state mated following a phonetic decision tree approach [6], resulting
frame occupation + two times the observed standard deviation, in 8.9K tied states. The GMM acoustic model was used to force
or a word whose average alignment score is lower than a given align the training data. We then trained a context-dependent
threshold, then the corresponding word alignment is considered feed-forward DNN-HMM using a context window of 11 frames,
noisy and the word is removed. Next, we completely discarded six hidden layers with ReLU activation functions and 2048 units
those files in which more than two thirds of the words were per layer. We used the transLectures-UPV toolkit (TLK) [7] to
filtered out in the previous step. Finally, we built a clean training train both GMM and DNN acoustic models.
corpus by joining words into segments whose boundaries were
Apart from the feed-forward model, we also trained a
delimited by large-enough silences and deleted words.
BLSTM-HMM model [5]. The DNN was used to refine the
alignment between input acoustic features and HMM states. We
Table 1: Number of raw, aligned raw, aligned speech, and fil- then trained the BLSTM-HMM model using the open source
tered speech hours as a result of applying the speech data filter- toolkit TensorFlow [8] and TLK. The BLSTM network con-
ing pipeline to the whole RTVE database. sisted of four bidirectional hidden layers with 512 LSTM cells
Aligned Filtered per layer and per direction.
Raw In order to increase the amount of training data, the final
Raw Speech Speech
train 463 438 252 187 submitted system (MLLP-RWTH p-final closed) was retrained
dev1-train 43 31 24 18 on a total of about 218 hours from sets train, dev1 and dev2.
dev1-dev 14 12 9 7
dev2 15 12 9 6 3.3. Language modeling
Overall 535 493 294 218 Our language model (LM) for the closed condition consists of
a combination of several n-gram models and a recurrent neural
Table 1 shows the result of applying this speech data fil- network (RNN) model. Also, since TV shows of each audio file
tering pipeline to the train, dev1 and dev2 sets. The second are known in advance, we performed an LM adaptation at the
column shows the raw audio length in hours of each set. The n-gram model level.
third refers to the amount of raw hours that could be aligned to First, we extracted sentences from all .srt and .trn files.
the corresponding subtitle/transcript text by our alignment sys- Then we applied a common text processing pipeline to normal-
tem. It must be noted that there were some audio files that the ize capitalization, remove punctuation marks, expand contrac-
system was not capable to align. This happens when none of tions (i.e. sr. → señor) and transliterate numbers. As already
the active hypotheses can reach the final HMM state at the last mentioned, we split dev1 into two subsets, dev1-train and dev1-
time step, due to an excessive histogram pruning or due to an dev, in order to include dev1-train in training. Thus, in this
non-matching transcript. For this reason, 42 hours of audio data section, we will refer to the combination of train and dev1-train
could not be aligned. The fourth column gives the total amount simply as train. For LSTM LM training, we concatenated the
of aligned speech data after removing non-speech events that train and subs-C24H sets into a single training file and removed
were aligned to the silence phoneme. Surprisingly, the original redundancy by discarding repeated sentences. Also, sentences
438 raw hours from the train set were reduced to 252 hours of were shuffled after each epoch to allow better generalization. To
speech data, i.e. we detected 186 hours of non-speech events. carry out TV-show LM adaptation experiments, we randomly
After some manual analysis of the alignments, we found that extracted 500 sentences of each TV show from the train set to
a significant portion of these 186 hours is explained by non- be used as validation data in the adaptation process. Table 2
subtitled speech, whose corresponding audio frames were in provides corpus statistics after normalization.
practice aligned to the silence phoneme. Finally, the fifth col- Second, to define our closed-condition system’s vocabulary,
umn shows the number of hours of clean speech data after ap- we first computed the vocabulary of both train and subs-C24H
plying the described heuristic post-filtering procedure and after sets, and then removed singletons, so that language models can
discarding files that shown a high word rejection rate. Starting properly model unknown word probabilities. After applying
with the original 535 hours of raw audio, we aligned 294 hours these two steps, the resulting vocabulary had 132K words. The
of speech, from which we rejected 76 hours of noisy data, end- out-of-vocabulary ratios of dev1-dev and dev2 sets were 0.36%
ing up with 218 hours of speech suitable for acoustic training. and 0.53%, respectively.
258
Table 3: Perplexities of the different LM components. search pruning parameters were optimized on the dev1-dev set.
dev1-dev dev2 Table 4 shows the results on both dev1-dev and dev2 sets. As
(a) N -gram train 139.6 183.0 expected, the BLSTM acoustic model outperformed the feed-
(b) N -gram subs-C24H 161.2 193.4 forward model by 12.2% relative.
(c) N -gram show-specific 184.0 294.3
(d) N -gram general (a+b) 107.0 147.5 Table 4: Comparison of the CD-FFDNN-HMM and BLSTM-
(e) N -gram adapt (a+b+c) 99.5 139.1 HMM acoustic models using the general n-gram language
(f) RNN 92.3 110.7 model. Results in WER % and relative WER % improvement.
(g) RNN+N -gram general (d+f) 78.2 101.8
(h) RNN+N -gram adapt (e+f) 68.9 99.2 dev1-dev dev2
WER WER ∆WER
FFDNN 29.7 27.1 -
Third, we trained two standard Kneser-Ney smoothed 4- BLSTM 26.5 23.8 12.2
gram LMs on the train and subs-C24H sets using the SRILM
toolkit [9]. Rows (a) and (b) of Table 3 show the perplexities
Next, we analyzed the contribution of different LM com-
obtained with these models on the dev1-dev and dev2 sets. In
binations during search, leaving fixed the acoustic model to the
addition to these two general n-gram LMs, we trained one n-
best BLSTM neural network. Specifically, we carried out recog-
gram LM for each TV show. Row (c) of Table 3 shows the
nition experiments using (1) the general n-gram LM, (2) the
averaged perplexity of the corresponding TV-show-specific LM
RNN LM, (3) the interpolation of the RNN LM with the gen-
for each file.
eral n-gram LM, and (4) the interpolation of the RNN LM with
Next, we trained a RNN LM using the Variance Regular-
the adapted, show-specific n-gram LMs. Table 5 shows per-
ization (VR) criterion [10]. This criterion reduces the compu-
plexities and WERs for the dev1-dev and dev2 sets over these
tational cost during the test phase. Our models were trained on
four different LM setups.
GPU devices using the CUED-RNNLM toolkit [11]. The net-
work setup was optimized to minimize perplexity on the dev1-
dev set. It consisted of a 1024-unit embedding layer and a hid- Table 5: Comparison of different language model combinations
den LSTM layer of 1024 units. The output layer is a 132K-unit using the BLSTM-HMM acoustic model in terms of perplexity,
softmax, whose size corresponds to the vocabulary size. The WER % and relative WER % improvement.
perplexities obtained with this network are depicted in Row (f)
dev1-dev dev2
of Table 3.
PPL WER PPL WER ∆WER
Then, the combination of the LMs was done in two steps.
n-gram general 107 26.5 148 23.8 -
Firstly, we performed a linear interpolation of n-gram models.
RNN 92 26.2 111 23.0 3.4
For the general, non-adapted models, we interpolated the LMs
RNN + n-gram general 78 25.3 102 22.4 5.9
estimated on the train and the subs-C24H sets by minimizing
RNN + n-gram adapt 69 24.8 99 22.4 5.9
the perplexity on dev1-dev [12]. Row (d) of Table 3 shows the
perplexities for this particular LM combination. For each show-
specific LM, we performed a three-way interpolation: the indi- The best results were obtained with the combination of
vidual show-specific LM, the train LM and the subs-C24H LM. RNN and n-gram models, showing a consistent 6% relative im-
In this case, interpolation weights were optimized individually provement in both sets over the baseline general n-gram LM. It
for each TV show so that the perplexity was minimized on the is worth noting that in terms of WER, the improvement from us-
corresponding 500-sentence show-specific validation set, simi- ing adapted models does not translate to dev2. As dev1-dev and
larly to the approach followed in [13, 14]. Secondly, we com- dev2 contain different shows with strongly varying amounts of
bined the interpolated n-gram LMs with the RNN LM. Other show-specific text data available for training, not all shows ben-
than the static interpolation of n-gram LMs, the result of this efit from adaptation equally. Anyway, since the adaption does
step is not a new monolithic model, but a set of interpolation not degrade the system performance, and given the good im-
weights to be used on-the-go by the ASR decoder during search. provement seen on dev1-dev, we decided to use the combination
Perplexities for the combination of the RNN LM with the gen- of RNN LMs plus adapted n-gram LMs for the final system.
eral and the adapted n-gram LMs can be found in Rows (g) Looking at the system outputs, after carrying out error anal-
and (h) of Table 3. ysis, we realized that our VAD module [15] was discarding a
Finally, to take the most of the provided data, the final sub- significant amount of speech regions in the audio files. This
mitted system (MLLP-RWTH p-final closed) was trained using significantly affected the WER by increasing the number of
the same hyper-parameters values estimated during the devel- deletions. For this reason, we decided to explore other au-
opment stage, but using also dev1-dev and dev2 sets as part of dio segmentation approaches and compare its performance in
the training data. terms of WER. Concretely, we compared the following ap-
proaches: (1) our baseline MLLP-UPV VAD system, based on
3.4. Experiments and results a speech/non-speech GMM-HMM classifier that ranked second
in the Albayzin-2012 audio segmentation challenge [15]; (2)
In this section we describe the experiments carried out to de- The LIUM Speaker Diarization Tools, a VAD system based
termine the best closed training condition system. Our exper- on Generalized Likelihood Ratio between speech/non-speech
iments were devoted to assess three components of the sys- Gaussian models [16]; (3) The well-known CMUseg audio seg-
tem: acoustic models, language models and voice activity de- mentation system using the standard configuration [17]; (4) Ap-
tection (VAD) modules. In all cases we used the TLK toolkit ply a fast pre-recognition step to segment the audio file by the
decoder [7] for recognizing test data using a one-pass decoding recognized silences, using the best CD-FFDNN-HMM acoustic
setup. model and a pruned version of the general n-gram LM; and (5)
First, we compared the performance of the CD-FFDNN- Use the segments generated in (4), and apply VAD the system
HMMs and BLSTM-HMMs acoustic models described in Sec- (1) to classify those segments into speech/non-speech. It is im-
tion 3.2. In both cases we used the general n-gram language portant to note that (3) and (4) are not VAD systems but just au-
model described in Section 3.3. Grammar scale factor and dio segmenters, so all detected segments are considered speech,
259
i.e. all audio is passed through to the ASR. Table 6 shows the scribed speech from several sources, covering a variety of do-
WER for each of the five audio segmentation/VAD techniques, mains and acoustic conditions. The collection consists of subti-
including the ratio of discarded audio that is dropped by the tled videos crawled from Spanish and Latin American websites.
VAD prior to decoding. We used a pronunciation lexicon with a vocabulary size
of 325k with one or more pronunciation variants. The acous-
Table 6: Comparison of different audio segmentation/VAD tech- tic model takes 80-dim. MFCC features as input and estimates
niques using the BLSTM-HMM acoustic model and the combi- state posterior probabilities for 5000 tied triphone states. The
nation of the RNN LM + adapted n-gram LM. Results in WER state tying was obtained by estimating a classification and re-
% and relative WER % improvement and the ratio of dropped gression tree (CART) on all available training data. Acous-
audio. tic modeling was done using a bi-directional LSTM network
dev1-dev dev2 with four layers and 512 LSTM units in each layer. About 30%
% drop. WER % drop. WER ∆WER of activations are dropped in each layer for regularization pur-
MLLP-UPV (1) 10.9 24.8 5.9 22.4 - pose [22]. During training we minimized the cross-entropy of
LIUM (2) 7.1 23.7 3.9 20.8 7.1 a network generated distribution in the softmax output layer at
CMUseg (3) 0 23.2 0 20.9 6.7 aligned label positions using a Viterbi alignment defined over
Pre-Recognition (4) 0 22.9 0 20.6 8.0 the 5000 tied triphone states of the CART. We used the Adam
+ MLLP-UPV (5) 3.2 22.3 3.3 20.0 10.7 learning rate schedule [23] with integrated Nesterov momen-
tum and further reduced the learning rate following a variant of
As we expected, the baseline VAD system (1) was dis- the Newbob scheme. We split input utterances into overlapping
carding too much segments, as it was too aggressive com- chunks of roughly 10 seconds and perform an L2 normalization
pared to other techniques. With either (2) or (3) we obtained of the gradients for each chunk. With the normalized gradients
a consistent improvement. It was further increased up to 8% the network is updated in a stochastic gradient descent manner
by using (4). We decided then to combine this segmentation where batches containing up to 50 chunks are distributed over
with our baseline VAD system (1), which led us to achieve eight GPU devices and recombined into a common network af-
an 11% relative WER improvement. In absolute terms, we ter roughly 500 chunks have been processed by all devices.
got a 2.4 WER points gain in dev2, with a final WER of The language model for the single-pass HMM decoding is
20.0%. This setup constituted our contrastive closed-condition a 5-gram count model trained with Kneser-Ney smoothing on a
system (MLLP-RWTH c1-dev closed), whilst our primary sys- large body of text data collected from multiple publicly avail-
tem (MLLP-RWTH p-final closed) was the result of re-training able sources. Its perplexity on dev1-dev and dev2 is 173.5 and
the same acoustic and language models with all available data, 173.2 respectively. This open-track system has reached a WER
as stated in Sections 3.2 and 3.3. of 18.3% and 15.6% on dev1-dev and dev2 without any speaker
Finally we analyzed the speed of submitted system in terms or domain adaptation or model tuning.
of Real Time Factor (RTF). We studied how tightening the prun-
ing parameters affects the RTF and the WER. Also, to assess the
speed of a fast pre-recognition step to segment the audio signal, 5. Conclusions
we also did this comparison using the LIUM VAD system. Re- In this paper we have presented the description of the three sys-
sults of this analysis are shown in Table 7. First, a more ag- tems that participated in the IberSpeech-RTVE 2018 Speech-to-
Table 7: Speed analysis in terms of RTF an its effect on the Text transcription challenge. Two of them, one primary (MLLP-
WER% over the dev2 set, either with the submitted system and RWTH p-final closed) and one contrastive (MLLP-RWTH c1-
removing the pre-precognition step, using LIUM VAD instead. dev closed), were submitted to the closed training conditions,
while the other one (MLLP-RWTH p-prod open) participated in
RTF WER the open training track. On the one hand, our best development
Submitted system (1) 1.5 20.0 closed-condition ASR system (MLLP-RWTH c1-dev closed),
+ inc. prune 0.8 20.3 consisting of a BLSTM-HMM acoustic model trained on a re-
(1) with LIUM VAD 1.0 20.9 liable set of 205 hours of training speech data, and a combi-
+ inc. prune 0.4 21.3 nation of both RNN and TV-show adapted n-gram language
models, achieved a competitive mark of 20.0% WER on the
gressive pruning speeds up the submitted system by 88% while dev2 set. Our final, primary closed-condition ASR system
degrading the WER by 0.3% absolute. Next, if we replace the (MLLP-RWTH p-final closed) should offer a similar or even
pre-recognition step on the submitted system by the LIUM VAD better performance as it followed the same system design setup
module, we get a speed-up of 50% at the cost of 0.9 points but trained with all available data, including both development
WER. Finally, we could afford a very significant speed-up of sets. On the other hand, our general-purpose open-condition
375% if we tighten the prune parameters when using LIUM ASR system (MLLP-RWTH p-prod open), without carrying out
VAD, with a WER loss of 1.3 absolute points, although it would any speaker, domain nor model adaptation of any kind, scored
still be a competitive system, scoring 21.3% WER points on 15.6% WER on the dev2 set.
dev2.
4. Open-condition system 6. Acknowledgements
The main motivation for participating in the open-condition The research leading to these results has received funding from
track was the desire to evaluate a system developed in the recent the European Union’s Horizon 2020 research and innovation
months for a different purpose, not related to the IberSpeech programme under grant agreement no. 761758 (X5gon) and the
challenge. In order to achieve this goal, we decided to keep Spanish government’s TIN2015-68326-R (MINECO/FEDER)
the amount of parameter optimization as low as possible. This research project MORE. This work also financed by grant
system is based on the software developed at RWTH Aachen FPU14/03981 from the Spanish Ministry of Education, Culture
University: RASR [18, 19] and RETURNN [20, 21]. and Sport. Finally, we would also like to thank our colleagues
The ASR system is based on a hybrid LSTM-HMM acous- at RWTH Aachen for many fruitful discussions: Eugen Beck,
tic model. It was trained on a total of approx. 3800 hours of tran- Tobias Menne and Albert Zeyer.
260
7. References [16] S. Meignier and T. Merlin, “LIUM SpkDiarization: an open
source toolkit for diarization,” in Proc. CMU SPUD Workshop,
[1] C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen,
Dallas, TX, USA, Mar. 2010.
Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina, N. Jaitly,
B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art speech [17] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, “Automatic seg-
recognition with sequence-to-sequence models,” in Proc. IEEE mentation, classification and clustering of broadcast news audio,”
Int. Conf. on Acoustics, Speech and Signal Processing, Calgary, in Proc. DARPA Speech Recognition Workshop, 1997, pp. 97–99.
Canada, Apr. 2018, pp. 4774–4778. [18] D. Rybach, S. Hahn, P. Lehnen, D. Nolden, M. Sundermeyer,
[2] W. Xiong, L. Wu, J. Droppo, X. Huang, and A. Stolcke, “The Mi- Z. Tüske, S. Wiesler, R. Schlüter, and H. Ney, “RASR -
crosoft 2017 conversational speech recognition system,” in Proc. the RWTH Aachen University open source speech recognition
IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Cal- toolkit,” in Proc. IEEE Automatic Speech Recognition and Un-
gary, Canada, Apr. 2018, pp. 5934–5938. derstanding Workshop (ASRU), Honolulu, HI, USA, Dec. 2011.
[3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, [19] S. Wiesler, A. Richard, P. Golik, R. Schlüter, and H. Ney,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kings- “RASR/NN: The RWTH neural network toolkit for speech recog-
bury, “Deep neural networks for acoustic modeling in speech nition,” in IEEE International Conference on Acoustics, Speech,
recognition: The shared views of four research groups,” IEEE Sig- and Signal Processing, Florence, Italy, May 2014, pp. 3313–3317.
nal Processing Magazine, vol. 29, no. 6, pp. 82–97, Nov. 2012.
[20] P. Doetsch, A. Zeyer, P. Voigtlaender, I. Kulikov, R. Schlüter, and
[4] H. Bourlard and C. J. Wellekens, “Links between Markov models H. Ney, “RETURNN: The RWTH extensible training framework
and multilayer perceptrons,” in Advances in Neural Information for universal recurrent neural networks,” in IEEE International
Processing Systems I, D. Touretzky, Ed. San Mateo, CA, USA: Conference on Acoustics, Speech, and Signal Processing, New
Morgan Kaufmann, 1989, pp. 502–510. Orleans, LA, USA, Mar. 2017, pp. 5345–5349.
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [21] A. Zeyer, T. Alkhouli, and H. Ney, “RETURNN as a generic flexi-
Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997. ble neural toolkit with application to translation and speech recog-
[6] S. J. Young, J. J. Odell, and P. C. Woodland, “Tree-based state nition,” in Annual Meeting of the Assoc. for Computational Lin-
tying for high accuracy acoustic modelling,” in Proc. Workshop on guistics, Melbourne, Australia, Jul. 2018.
Human Language Technology, Plainsboro, NJ, USA, Mar. 1994,
[22] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
pp. 307–312.
[7] M. del Agua, A. Giménez, N. Serrano, J. Andrés-Ferrer, J. Civera, works from overfitting,” Journal of Machine Learning Research,
A. Sanchis, and A. Juan, “The translectures-UPV toolkit,” in Ad- vol. 15, no. 1, pp. 1929–1958, 2014.
vances in Speech and Language Technologies for Iberian Lan-
guages, Nov. 2014, pp. 269–278. [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” in Proc. of the Int. Conf. on Machine Learning, San
[8] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, Diego, CA, USA, May 2015.
G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning
on heterogeneous systems,” 2015, software available from
tensorflow.org. [Online]. Available: https://www.tensorflow.org/
[9] A. Stolcke, “SRILM – an extensible language modeling toolkit,”
in Proc. of the Int. Conf. on Spoken Language Processing, Denver,
CO, USA, Sep. 2002, pp. 901–904.
[10] X. Chen, X. Liu, M. J. F. Gales, and P. C. Woodland, “Improving
the training and evaluation efficiency of recurrent neural network
language models,” in Proc. IEEE Int. Conf. on Acoustics, Speech
and Signal Processing, Brisbane, Australia, Apr. 2015, pp. 5401–
5405.
[11] X. Chen, X. Liu, Y. Qian, M. J. F. Gales, and P. C. Woodland,
“CUED-RNNLM – An open-source toolkit for efficient training
and evaluation of recurrent neural network language models,” in
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process-
ing, Mar. 2016, pp. 6000–6004.
[12] F. Jelinek and R. L. Mercer, “Interpolated estimation of Markov
source parameters from sparse data,” in Proc. Workshop on
Pattern Recognition in Practice, Amsterdam, Netherlands, Apr.
1980, pp. 381–397.
[13] A. Martı́nez-Villaronga, M. A. del Agua, J. Andrés-Ferrer, and
A. Juan, “Language model adaptation for video lectures transcrip-
tion,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal
Processing, Vancouver, Canada, May 2013, pp. 8450–8454.
[14] A. Martı́nez-Villaronga, M. A. del Agua, J. A. Silvestre-Cerdà,
J. Andrés-Ferrer, and A. Juan, “Language model adaptation for
lecture transcription by document retrieval,” in Proc. IberSpeech,
Nov. 2014.
[15] J. A. Silvestre-Cerdà, A. Giménez, J. Andrés-Ferrer, J. Civera, and
A. Juan, “Albayzin Evaluation: The PRHLT-UPV Audio Segmen-
tation System,” in Proc. IberSpeech, Madrid, Spain, Nov. 2012,
pp. 596–600.
261
IberSPEECH 2018
Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program

transcription
Juan M. Perero-Codosero1,2 , Javier Antón-Martı́n1,2 , Daniel Tapias Merino1 , Eduardo
López-Gonzalo2 , Luis A. Hernández-Gómez2
Sigma Technologies S.L.

1
2
GAPS Signal Processing Applications Group, Universidad Politécnica de Madrid
{jmperero, janton, daniel}@sigma-ai.com, {eduardo.lopez, luisalfonso.hernandez}@upm.es
Abstract tion and language models (AM, PM, LM) into a unified neu-
ral network modeling framework. Using a simplified training
Deep Neural Networks (DNN) are fundamental part of current process acoustic, pronunciation and language modeling com-
ASR. State-of-the-art are hybrid models in which acoustic mod- ponents are integrated to generate the hypothesized graphemes,
els (AM) are designed using neural networks. However, there is sub-words or word sequences. This also greatly simplifies the
an increasing interest in developing end-to-end Deep Learning decoding.
solutions where a neural network is trained to predict charac- Most end-to-end ASR approaches [5] are typically based on
ter/grapheme or sub-word sequences which can be converted a Connectionist Temporal Classification (CTC) framework [6,
directly to words. Though several promising results have been 7], a Sequence-to-Sequence attention-based encoder-decoder
reported for end-to-end ASR systems, it is still not clear if they [8, 9] or a combination of both [10].
are capable to unseat hybrid systems. Several sequence-to-sequence Neural Network models
In this contribution, we evaluate open-source state-of- have been proposed as Recurrent Neural Network Transducer
the-art hybrid and end-to-end Deep Learning ASR under the (RNN-T) [11], Listen, Attend and Spell (LAS) [9], and Mono-
IberSpeech-RTVE Speech to Text Transcription Challenge. The tonic Alignments [12]. In attention-based encoder-decoder
hybrid ASR is based on Kaldi and Wav2Letter will be the end- schemes [4] the listener encoder module plays a similar role
to-end framework. Experiments were carried out using 6 hours that a conventional acoustic model, the attender learns align-
of dev1 and dev2 partitions. The lowest WER on the reference ments between the source and the target sequence, and the de-
TV show (LM-20171107) was 22.23% for the hybrid system coder works as a language model. Better performance has been
(lowercase format without punctuation). Major limitation for reported [4] by modelling longer units such as word pieces mod-
Wav2Letter has been a high training computational demand (be- els (WPM) and using Multi-head attention (MHA).
tween 6 hours and 1 day/epoch, depending on the training set). Several research works have also been proposed aiming to
This forced us to stop the training process to meet the Chal- re-use HMM-based models to improve end-to-end systems. For
lenge deadline. But we believe that with more training time it example, well-trained tied-triphone acoustic models (AM) can
will provide competitive results with the hybrid system. be used as an initial model for a character-based end-to-end sys-
Index Terms: TV shows Speech-to-Text transcription, ASR tem [5] or training tied-triphone CTC models from scratch, but
systems, Hybrid DNN-HMM, End-to-end Deep Learning. in this case a lexicon was required. However, these methods
have important limitations as they demand a complex system
1. Introduction development, high computation and a large amount of data for
training, thus losing the attractiveness of end-to-end systems.
Deep Neural Networks (DNN) have become fundamental part
Therefore, even with the promising results already reported
of current ASR systems. State-of-the-art approaches are gener-
for many end-to-end ASR systems, it is still not clear if they
ally hybrid models in which acoustic models (AM) are designed
are capable to unseat the current state-of-the-art hybrid DNN-
using neural networks to create HMM class posterior proba-
HMM ASR systems.
bilities. These HMM-based neural network acoustic models
In this paper, our aim is to contribute to the research towards
(DNN-HMM) are combined with conventional pronunciation
the development of end-to-end ASR systems as an alternative
(PM) and language (LM) models [1].
to state-of-the-art hybrid ASR systems. For this purpose, we
The main limitations in hybrid ASR systems is a high com-
will develop and compare two open-source hybrid and end-to-
plexity associated to the boot-strapping process for training the
end ASR systems for a specific speech-to-text task. As hybrid
DNN-HMM models, requiring phoneme alignments for frame-
system the DNN-HMM Kaldi Toolkit [13] will be used, while
wise cross entropy, and a sophisticated beam search decoder
Wav2Letter [14] will be the end-to-end framework. The speech-
[2].
to-text task we will work on will be the RTVE IberSpeech 2018
Though several approaches are being proposed to overcome Challenge1 . This task represents a highly demanding domain
these limitations, such as to train without requiring a phoneme corresponding to the automatic transcription of TV shows and
alignment, or to avoid the lexicon [3], there is an increasing broadcast news, in specific conditions, as different noisy envi-
interest in working towards end-to-end solutions. ronments and with the lack of accurate transcriptions for train-
In end-to-end deep learning ASR systems [4] a neural net- ing.
work is trained to predict character/grapheme or sub-word se- The rest of the paper is structured as follows. In Section
quences which can be converted directly to words, or even
word sequences directly. They present the important advan- 1 http://iberspeech2018.talp.cat/index.php/albayzin-evaluation-
tage of integrating conventional separate acoustic, pronuncia- challenges/
262 10.21437/IberSPEECH.2018-55
2, we describe the end-to-end and the hybrid ASR systems, frame rate requires HMM traversable in one transition; we use
which we have evaluated in the RTVE IberSpeech 2018 Chal- fixed transition probabilities in the HMM, and don’t train them.
lenge. Section 3 details the datasets we have used: how we have Additionally, training DNN-HMM following a sequence-level
preprocessed them and the experimental protocols we have fol- objective function allowed its implementation as a maximum
lowed. Results are shown and discussed in Section 4. Finally, mutual information (MMI) criterion without lattices on GPU:
we summarize our results and conclusions in Section 5. doing a full forward-backward on a decoding graph derived
from a phone n-gram language
2. Deep Learning ASR Systems Starting from the available transcript of the training speech
data, training acoustic models is an iterative process of audio
2.1. End-to-end Speech Recognition System re-alignment starting from GMM/HMM monophone models
As representative of open-source end-to-end ASR systems, we and progressing to more accurate triphone models through re-
have chosen Wav2Letter. The Wav2Letter ASR system is based training. For all of our experiments we used conventional fea-
on a neural network architecture composed of convolutional ture pipe-line that involves splicing the 13-dimensional front-
units [15], with a Gated Linear Units (GLUs) implementation end MFCCs across 9 frames, followed by applying LDA
[16, 17]. The acoustic modeling is based on Mel-Frequency to reduce the dimension to 40 and then further decorrela-
Spectral Coefficients (MFSC) which feed the Gated CNNs that tion using MLLT [21]. For initial GMM/HMM alignments
generate letter scores at their outputs. These scores are pro- speaker independent acoustic models were obtained using fM-
cessed by an alternative to CTC, the Auto Segmentation Crite- LLR. The input features to the neural network in DNN-HMM
rion (ASG) leading to letter-based sequences (see Figure 1). In models was represented by a fixed transform that decorre-
order to train the acoustic model, the feature extraction module lates a vector of (40*7)-dimensional features obtained by pack-
computes 40-dimensional MFSCs, due to robustness to small ing seven frames of 40-dimesional features MFCC(spliced) +
time-warping distortions, as referred in [17]. LDA+MLLT+fMLLR corresponding to 3 frames on each side
of the central frame.
To improve robustness mainly on speakers variability,
speaker adaptive training (SAT) based on i-vectors was also
implemented [22]. Speaker adaptive models were obtained by
fine-tuning DNNs to a speaker-normalized feature space. On
each frame a 100-dimensional i-vector is appended to the 40-
Figure 1: Architecture of the end-to-end ASR system based on dimensional acoustic space. In this extended acoustic space i-
Wav2Letter system. Adapted from [17]. vectors may supply information about different sources of vari-
ability as speakers ID, so the network itself can do any fea-
As commented before, the Neural Network architecture is ture normalization that is needed. To overcome some issues
trained to infer the segmentation of each letter in the train- reported when test signals have substantially different energy
ing transcriptions using Auto Segmentation Criterion (ASG), levels than the training data, in our experiments the test-signal
an alternative criterion to Connectionist Temporal Classification energies were energy-normalized to the average of the training
(CTC). CTC takes into account all possible letter sequences, al- data.
lowing a special blank state, which represents possible garbage
frames between letters or the separation between repeated let- 3. Experimental Setup
ters. In ASG blank states are replaced by the number of rep-
3.1. Datasets
etition of the previous letter, consequently a simpler graph is
obtained [14]. Besides this graph that scores letter sequences 3.1.1. RTVE2018 Database
depicting the right transcription, another graph is used to score
In this evaluation, we investigate the performance of end-to-end
of all letter sequences. Finally, a beam-search decoder (as de-
and hybrid ASR systems on RTVE voice contents2 , a collection
scribed in [14]), is used at the last stage. It depends on a
of TV shows and broadcast news from 2015 to 2018.
beam thresholding, histogram pruning and an optional language
model. Training partition consists of audio files with subtitles, with
the following limitations:
2.2. Hybrid Speech Recognition System • Subtitles have been generated through a re-speaking pro-
In order to compare with end-to-end ASR, we have built a hy- cedure that sometimes summarizes what has been said,
brid ASR system using open-source Kaldi Toolkit [13]. The producing imprecise transcriptions.
ASR architecture consists of the classical sequence of three • Transcriptions have not been supervised by humans.
main modules: an acoustic model, a dictionary or pronunci-
ation lexicon and a N-gram language model. These modules • Timestamps are not properly aligned with the speech sig-
are combined for training and decoding using Weighted Finite- nal.
State Transducers (WFST) [18]. The acoustic modeling is based Trying to avoid the use of these low-quality transcriptions,
on Deep Neural Networks and Hidden Markov Models (DNN- which could cause confusion in the acoustic space, audio data
HMM). was initially aligned by a baseline alignment system. This sys-
For the implementation of Kaldi DNN-HMM acoustic tem was the same hybrid system described in Section 2.2. but
modelling we followed the so-called chain model [19], based trained using our own labeled databases, explained in Subsec-
on a subsampled time-delay neural network (TDNN) [20]. This tion 3.1.2. To improve the quality of these automatic transcrip-
implementation uses 3-fold reduced frame rate at the output of tions, they were undergone to a manual supervision process.
the network; this represents a significant reduction in decod-
ing computation and the corresponding test time. The reduced 2 http://catedrartve.unizar.es/reto2018/RTVE2018DB.pdf
263
RTVE training partition consists of 460 hours, however Table I: WER on reference TV (LM-20171107) show for acous-
due to our limitations in the manual supervision process, only tic models over closed and open training conditions (different
two training datasets have been prepared for our experiments: train data sizes and language models).
RTVE train350 (350 hours of train set) and RTVE train100 (a
RTVE train350 subset of 100 hours). Validation datasets where Hybrid systems WER(%)
extracted from the 10% training set for each RTVE train350 and
RT V E train100 + LM subtitles 26.01
RTVE train100 partitions. These validation datasets have been
RT V E train350 + LM subtitles 24.21
designed trying to cover the different scenarios in the whole
RT V E train350 + LM supervised 25.95
RTVE training data: political and economic news, in-depth in-
RT V E train350 + LM subtsuperv 23.47
terviews, debate, live magazines, weather information, game
RT V E train350 + Others + LM subtsuperv 23.20
and quiz shows. Consequently, for testing purposes, two de-
RT V E train350 + Others + LM open 22.23
velopment datasets have been defined as follows:
• RTVE dev1: 5 hours have been selected in a balanced
way in order to have an hour of each show type (e.g.
20H dev1 is one hour of 20H program). For testing on closed training condition, we used the full
volume of RTVE training data that has been manually revised
• RTVE dev2: 1 hour has been selected carrying out the (RTVE train350).
same procedure as RTVE dev1. In addition to AM, LM has been trained with a different cor-
pus. Four LMs were generated: LM subtitles (based on subti-
3.1.2. Other Databases tles given in RTVE database for the Challenge), LM supervised
Acoustic models have also been evaluated over open training (based on transcriptions of RTVE data training supervised by
condition, that is: by using additional datasets. To this end two humans), LM subtsuperv (based on two mentioned corpus) and
additional datasets have been used to train the system. LM open (based on several corpus: news between 2015 and
2018, interviews, film captions and the two mentioned before).
• VESLIM: It consists of 103 hours of Spanish clean
As shown in Table I, adding supervised transcriptions from
voice, where speakers read some sentences. More de-
the training set to a subtitles-based LM, we achieved a WER of
tails in [23].
23.47%, a 3% relative improvement over the same system using
• OWNMEDIA: It contains 162 hours of TV programs, in- a language model trained only with subtitles. This improvement
terviews, lectures and similar multimedia contents. This is mainly due to the fact that supervised transcriptions contains
dataset contains manual transcriptions. some conversational language features (i.e. false starts, trun-
cated words, filler words, syntactic structure changes at talking
3.2. Training time, etc.) that are generally omitted in subtitles because of re-
From the hybrid ASR system, our AM was trained following a speaking procedure. In contrast, a LM only trained with super-
process based on the Switchboard Kaldi recipe (TDNN Chain vised transcriptions did not provide better results. In this case,
models). there were some tags in these transcriptions when words could
In order to create the PM for hybrid ASR system, it is used not be confidently revised/transcribed (e.g. foreign names, mis-
a set of 29 real phonemes (without silence phones). Instead, pronunciation, background noise, etc.). Inserting tags meant to
end-to-end system, the vocabulary contains 38 graphemes rep- include ”unk” symbol to the LM and results were not as good
resenting the standard Spanish alphabet plus stressed vowels, as expected.
the apostrophe, symbols for repetitions and separation. We next evaluated the systems over open condition, where
we increased the amount of data for both AM and LM training.
3.3. Resources We combined RTVE training dataset and our own databases
(see Section 3.1.2), resulting in a total over 600 hours of speech.
Experiments have been carried out using several computa- As it can be seen in Table I the hybrid system provided a
tion resources. A server with 2 Xeon E5-2630v4, 2,2GHz, slight improvement. But, more importantly, WER went down to
10C/20TH and 3 GPUs Nvidia GTX 1080 Ti was used for hy- 22.23% when transcriptions from additional corpus were incor-
brid ASR system. GPU for the DNN training and CPU for the porated to train a LM. As a result, increasing both the amount
HMM training and final decoding. of audio and transcription data will enable us to obtain the max-
In order to train end-to-end models, more RAM was re- imum performance, and to cover as much information as possi-
quired, so it was used a GPU Nvidia Quadro P5000 (16 GB), ble appearing in test files.
for training and letter decoding. The division of development datasets according to the dif-
ferent show types makes it possible a deeper error analysis. Ta-
4. Results ble II shows that models applied to TV programs as 20H and
Millennium obtain the best results, a low WER of 14-17% for
4.1. Hybrid ASR System
the best models. This could be explained because contents are
First, we compared the performance of our hybrid system in- daily news having good acoustic conditions (clean voice, only
creasing training data volume from 100 to 350 hours, and by one speaker at time) and being better featured in LM. However,
adding additional training data from our external datasets. CA (Comando Actualidad) dataset contains some challenging
Evaluation plans mentioned that a reference TV show (LM- scenarios (interviews at the street, background noise, overlap-
20171107) has been used to obtain results with some commer- ping, music). As a result, models achieve a high value of WER
cial systems. This show is a live magazine covering Spanish (49.51%).
current events and it has been used to obtain first results. As Furthermore, it has to be emphasized that reference master
expected, more than 14% relative improvement in WER is ob- of transcriptions was given without any review from our part. To
tained we adding all the available data. evaluate the possible impact of transcription errors in the refer-
264
Table II: WER(%) on the different datasets of models over a closed and open training conditions (different train data volume and
language models). The duration of each dataset is an hour.
20H dev1 AP dev1 CA dev1 LM dev1 Mill dev1 LN24H dev1

Hybrid systems
RT V E train100 + LM subtitles 17.67 24.36 51.95 25.41 19.59 27.47
RT V E train350 + LM subtitles 16.04 21.68 51.58 23.61 17.43 26.35
RT V E train350 + LM supervised 16.05 22.18 59.34 25.06 19.22 26.75
RT V E train350 + LM subtsuperv 15.38 21.43 49.67 22.86 17.81 25.15
RT V E train350 + Others + LM subtsuperv 15.21 21.95 49.51 22.30 18.51 25.09
RT V E train350 + Others + LM open 14.88 20.94 49.55 21.44 17.01 24.13
End-to-end system
RT V E train100 71.61 75.38 87.06 72.35 69.49 76.70
ence master, we undergo a new test using an external dataset

in which the reference master of transcriptions has been made Comparison in loss function terms
60
manually and revised. This dataset consists of 3.5 hours of tele- LibriSpeech
RTVE_train100
vision news broadcasts (similar to 20H). Applying the same best 50 VESLIM
models trained for RTVE IberSpeech 2018 Challenge, we ob-
tained a WER of 8.51%, being the best results achieved so far. 40
4.2. End-to-end ASR System 30

loss
Overall our results using Wav2Letter are still far from those of
Kaldi-based hybrid ASR system. Nevertheless we must first ac- 20
knowledge our limitations training Wav2Letter due to its high
computational demand. This forced us to compare both systems 10
using only our 100 hours training dataset: RTVE train100. We
obtained a WER on the same development dataset (TV shows 00 2 4 6 8 10 12 14 16
of dev1 and dev2 partition) was 73.41% WER (on reference TV epoch
show). Applying LM, WER improved to 65.3%. Time con-
sumption in decoding was 0.0088*RT (without applying LM) Figure 2: Comparison in loss function terms during the end-to-
and it increased to 8.42*RT (applying LM). In addition, Table II end models training, on 100 hours of various databases: Lib-
also collects the rest of the results after evaluating end-to-end riSpeech, VESLIM and RTVE train100.
system over closed training condition.
We tried both to validate our Wav2Letter development and
to analyze these results in the context of recent developments 5. Conclusions
of Wav2Letter on read speech LibriSpeech in English [24]. To
In this contribution we have tested the use of open-source hybrid
this end we compared evolution of the loss function per training
and end-to-end ASR systems under the RTVE IberSpeech 2018
epoch we obtained training with RTVE train100, depicted in
Challenge. According to our results, the development of a hy-
Figure 2, with some other developments using a similar amount
brid DNN-HMM Kaldi Toolkit [13] seems to be capable to ad-
of training data. In particular we developed Wav2Letter on 100
dress the difficulties of this hard task but it requires to put more
hours from LibriSpeech in English and from VESLIM Span-
effort in obtaining better quality transcriptions to improve both
ish dataset (see Section 3.1.2) which also contains about 100
acoustic and language models. It is important to remark that
hours of audio. As it can be seen analyzing the evolution of
WER of 8.51% is obtained, in the best conditions. However,
the loss functions in Figure 2, it seems that characters are bet-
further research on the use of robust feature spaces and DNN
ter modeled applying clean speech in which people read sen-
training should be addressed looking for robustness in the more
tences than applying contents in not controlled conditions, as
challenging scenarios (street interviews with background noise,
in RTVE train100. This limitation could be solved applying
speakers overlapping, music, etc.). In what relates to end-to-end
some type of data augmentation to introduce perturbed samples
Wav2Letter system, we must acknowledge that our research has
to model better the graphemes.
been limited by a high training computational time. Neverthe-
Last, we must also indicate that we tried to train end-to-end less, even under this limitation, we found that when compared
system over open training condition. However, when using all to read speech (as LibriSpeech) it seems that both, the lack of
our training databases we found a high computational demand: correct transcriptions and the difficulties of dealing with con-
using our resources (see Section 3.3) it took about 1 day/epoch versational and noisy speech, will require more research so this
(unlike training the system over closed condition 6.5 h/epoch) kind of end-to-end architectures could be used for these chal-
and the best results have been achieved after 40 epochs. For lenging tasks.
this reason, results have not been delivered to RTVE IberSpeech
2018 Challenge. In any case, our expectations for the end-to-
end models are high, and therefore, we have considered to con- 6. References
tinue scientific research in this area: changing training strate- [1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
gies, applying optimization techniques, etc. in future works. A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep
265
neural networks for acoustic modeling in speech recognition: The [20] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural
shared views of four research groups,” IEEE Signal processing network architecture for efficient modeling of long temporal con-
magazine, vol. 29, no. 6, pp. 82–97, 2012. texts,” in Sixteenth Annual Conference of the International Speech
Communication Association, 2015.
[2] A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved training
of end-to-end attention models for speech recognition,” arXiv [21] S. P. Rath, D. Povey, K. Veselỳ, and J. Cernockỳ, “Improved fea-
preprint arXiv:1805.03294, 2018. ture processing for deep neural networks.” in Interspeech, 2013,
pp. 109–113.
[3] A. Zeyer, E. Beck, R. Schlüter, and H. Ney, “Ctc in the context of
generalized full-sum hmm training,” in Proc. Interspeech, 2017, [22] Y. Miao, H. Zhang, and F. Metze, “Speaker adaptive train-
pp. 944–948. ing of deep neural network acoustic models using i-vectors,”
IEEE/ACM Transactions on Audio, Speech and Language Pro-
[4] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, cessing (TASLP), vol. 23, no. 11, pp. 1938–1949, 2015.
Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina et al., “State-
of-the-art speech recognition with sequence-to-sequence models,” [23] D. T. Toledano, L. A. H. Gómez, and L. V. Grande, “Automatic
arXiv preprint arXiv:1712.01769, 2017. phonetic segmentation,” IEEE transactions on speech and audio
processing, vol. 11, no. 6, pp. 617–625, 2003.
[5] S. Kim, M. L. Seltzer, J. Li, and R. Zhao, “Improved training
for online end-to-end speech recognition systems,” arXiv preprint [24] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
arXiv:1711.02212, 2017. rispeech: an asr corpus based on public domain audio books,” in
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE
[6] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con- International Conference on. IEEE, 2015, pp. 5206–5210.
nectionist temporal classification: labelling unsegmented se-
quence data with recurrent neural networks,” in Proceedings of
the 23rd international conference on Machine learning. ACM,
2006, pp. 369–376.
[7] A. Graves and N. Jaitly, “Towards end-to-end speech recognition
with recurrent neural networks,” in International Conference on
Machine Learning, 2014, pp. 1764–1772.
[8] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
585.
[9] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend
and spell: A neural network for large vocabulary conversational
pp. 4960–4964.
[10] T. Hori, S. Watanabe, and J. Hershey, “Joint ctc/attention decod-
ing for end-to-end speech recognition,” in Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), vol. 1, 2017, pp. 518–529.
[11] A. Graves, “Sequence transduction with recurrent neural net-
works,” arXiv preprint arXiv:1211.3711, 2012.
[12] C. Raffel, M.-T. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Online
and linear-time attention by enforcing monotonic alignments,”
[14] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-
to-end convnet-based speech recognition system,” arXiv preprint
arXiv:1609.03193, 2016.
[15] Y. LeCun, Y. Bengio et al., “Convolutional networks for images,
speech, and time series,” The handbook of brain theory and neural
networks, vol. 3361, no. 10, p. 1995, 1995.
[16] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language
modeling with gated convolutional networks,” in International
Conference on Machine Learning, 2017, pp. 933–941.
[17] V. Liptchinsky, G. Synnaeve, and R. Collobert, “Letter-based
speech recognition with gated convnets,” CoRR, 2017.
[18] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state trans-
ducers in speech recognition,” Computer Speech & Language,
vol. 16, no. 1, pp. 69–88, 2002.
[19] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar,
X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neu-
ral networks for asr based on lattice-free mmi.” in Interspeech,
2016, pp. 2751–2755.
266
IberSPEECH 2018
The Vicomtech-PRHLT Speech Transcription Systems for the

IberSPEECH-RTVE 2018 Speech to Text Transcription Challenge
Haritz Arzelus1 , Aitor Álvarez1 , Conrad Bernath1 , Eneritz Garcı́a1 ,
Emilio Granell2 , Carlos-D. Martı́nez-Hinarejos2
1
Vicomtech, Human Speech and Language Technology Group, Spain
2
Pattern Recognition and Human Language Technologies Research Center,
Universitat Politècnica de València, Spain
[harzelus,aalvarez,cbernath,egarciam]@vicomtech.org, [egranell,cmartine]@dsic.upv.es
Abstract on the End-To-End (E2E) principle.

This paper describes our joint submission to the Historically, ASR systems have made use of Hidden
IberSPEECH-RTVE Speech to Text Transcription Challenge Markov Models (HMMs) to capture the time variability of
2018, which calls for automatic speech transcription systems the signal and Gaussian Mixture Models (GMMs) to model
to be evaluated in realistic TV shows. With the aim of building the HMM state probability distributions. However, numer-
and evaluating systems, RTVE licensed around 569 hours of ous works have shown that DNNs in combination with HMMs
different TV programs, which were processed, re-aligned and can outperform traditional GMM-HMM systems at acous-
revised in order to discard segments with imperfect transcrip- tic modeling on a variety of datasets [1]. More recently,
tions. This task reduced the corpus to 136 hours that we consid- new attempts have been focused on building E2E ASR archi-
ered as nearly perfectly aligned audios and that we employed as tectures [2], which directly map the input speech signal to
in-domain data to train acoustic models. grapheme/character sequences and jointly train the acoustic,
A total of 6 systems were built and presented to the evalu- pronunciation and language models as a whole unit [3, 4, 5, 6].
ation challenge, three systems per condition. These recognition Nowadays, two main approaches predominate to train E2E
engines are different versions, evolution and configurations of ASR models. On the one hand, the Connectionist Temporal
two main architectures. The first architecture includes an hy- Classification (CTC) is probably the most widely used crite-
brid LSTM-HMM acoustic model, where bidirectional LSTMs rion for systems based on characters [2, 7, 8], sub-words [9]
were trained to provide posterior probabilities for the HMM or words [10]. It employs Markov assumptions and dynamic
states. The language model corresponds to modified Kneser- programming to efficiently solve sequential problems [2, 3, 7].
Ney smoothed 3-gram and 9-gram models used for decoding On the other hand, attention-based methods employ an atten-
and re-scoring of the lattices respectively. The second architec- tion mechanism to perform alignment between acoustic frames
ture includes an End-To-End based recognition system, which and characters [4, 5, 6]. Unlike CTC, it does not require sev-
combines 2D convolutional neural networks as spectral feature eral conditional independence assumptions to obtain the la-
extractor from spectrograms with bidirectional Gated Recur- bel sequence probabilities, allowing extremely non-sequential
rent Units as RNN acoustic models. A modified Kneser-Ney alignments. Additionally, a number of enhancement techniques
smoothed 5-gram model was also integrated to re-score the E2E have been employed to overcome the performance of these sys-
hypothesis. All the systems’ outputs were then punctuated us- tems, such as Data Augmentation [11], Transfer Learning [12],
ing bidirectional RNN models with attention mechanism and Dropout [13] or Curriculum Learning [14], among others.
capitalized through recasing techniques.
Index Terms: speech recognition, deep learning, end-to-end Our systems were constructed following both DNN-HMM
speech recognition, recurrent neural networks and E2E architectures basis, given that depending on the avail-
able training data, one approach performed more robustly than
the other on the development set. A total of 6 systems were pre-
1. Introduction sented to the evaluation challenge, three systems per condition
The IberSPEECH-RTVE 2018 Speech to Text Transcription (closed and open). Regarding the closed condition, in which
Challenge calls for Automatic Speech Recognition (ASR) sys- only the data released by RTVE could be used to train and eval-
tems that are robust against realistic TV shows. It is a notable uate systems, two systems based on the DNN-HMM and one
trend that aims to approach ASR technology to different ap- E2E system were presented. In contrast, the results obtained
plications such as automatic subtitling or metadata generation from two E2E based systems and one DNN-HMM system were
in the broadcast domain. Although most of this work is still submitted for the open condition. In all systems, the raw rec-
performed manually or through semiautomatic methods (e.g. ognized output was punctuated, capitalized and normalized au-
re-speaking), the current state of the art in speech recognition tomatically using Recurrent Neural Models (RNNs), recasing
suggests that this technology could start to be exploitable au- techniques and rule-based heuristics respectively.
tonomously without any post-edition task, mainly on contents
with optimal audio quality and clean speech conditions.
The use of Deep Learning algorithms in speech processing 2. Corpus processing
have made it possible to introduce this technology in such a
complex scenario through the use of systems based on Deep Depending on the training condition, different datasets were
Neural Networks (DNNs) or more recent architectures based used to train and evaluate the ASR systems.
267 10.21437/IberSPEECH.2018-56
2.1. RTVE2018 dataset As it can be seen in Table 2, a high number of hours were
discarded from the original RTVE2018 dataset. In the end, a to-
The RTVE2018 dataset was released by RTVE and comprises a tal of 136 hours and 11 minutes were considered as nearly per-
collection of TV shows drawn from diverse genres and broad- fect audios, whilst only 90 hours and 9 minutes can be thought
cast by the public Spanish National Television (RTVE) from to be perfectly aligned including the train, dev1 and dev2 re-
2015 to 2018. The real number of hours provided in the original aligned partitions. These both subsets were finally used to build
dataset as training and development sets is presented in Table 1. and tune the acoustic models (AM). The development set for the
tuning of the systems was extracted from the completely perfect
Table 1: Duration of each partition of the original RTVE2018 subset, as it is shown in Table 3.
dataset
Table 3: New perfectly and nearly perfectly aligned subsets for
subset duration train and development. The 4 hours from dev correspond to the
train 462 h. 9 min. same contents in both subsets.
dev1 62 h. 23 min.
dev2 15 h. 13 min. subset train dev
total 539 h. 45 min. Perfectly aligned 86 h. 9 min. 4 h.
Nearly perfect aligned 132 h. 11 min. 4 h.
The main problem of this dataset was that a great amount
of audios had imperfect transcriptions and, therefore, they could In terms of text data, a total of 3.5 million sentences and 61
not be used as such for training and evaluation purposes. With million words were compiled. This data was used to estimate
the aim of recovering only the correctly aligned segments, a the language models (LM) and the punctuation and capitaliza-
highly costly process was carried out, where an alignment and tion modules.
re-alignment techniques were performed first and a manual re-
vision and automatic recognition task afterwards. The align- 2.2. Open dataset
ment and re-alignment processes consist of two steps. In the
first step, we tried to align the original audios with their corre- The open dataset was used to build the ASR systems for the
sponding transcriptions using 4 different beam values (10; 100; open condition. In addition to the perfectly re-aligned subset
1,000 and 10,000). For the case of the train partition, a total from the RTVE2018 dataset, 4 different corpora were prepared
of 101 hours and 47 minutes were only aligned after this ini- for training. The SAVAS corpus [16] is composed of broadcast
tial step. Thus, 360 hours and 22 minutes were definitely dis- news contents from the Basque Country’s public broadcast cor-
carded to be used in any training process. A second step of re- poration EiTB (Euskal Irrati Telebista), and includes annotated
alignment was then performed over this new subset of 101 hours and transcribed audios in both clear (studio) and noisy (outside)
and 47 minutes, using beam and retry-beam values of 1 and 2 conditions. The Youtube RTVE Series corpus includes Span-
respectively, obtaining a total of 86 hours and 29 minutes of au- ish broadcast contents of RTVE shows and series gathered from
dio segments that were considered as nearly correctly aligned. the Youtube platform. The audio contents were downloaded
The alignments processes were performed using a feed-forward along with the automatic transcriptions provided by the plat-
DNN-HMM acoustic model trained with the Kaldi toolkit [15] form. These audios and their corresponding automatic tran-
and estimated over contents from the broadcast domain. Fi- scriptions were then split and re-aligned following the same
nally, a small partition of the nearly correctly aligned hours methodology as it was explained in Section 2.1. Finally, the Al-
were revised manually, whilst the remaining were recognized bayzin [17] and Multext [18] corpora were also included. The
using a different recognition architecture as employed for the development set corresponded to the in-domain new dev parti-
alignments. In this case, the recognition was performed using tion shown in Table 3. The total amount of hours available for
an E2E based recognition system trained with the same contents the open condition are summarized in Table 4.
from the broadcast domain. Only the recognition outputs that fit
exactly to the reference were tagged as perfect segments. The Table 4: The open dataset description
same cleaning methodology was also applied on the dev1 and
dev2 partitions. corpus #hours
The total number of hours discarded after the first step, and RTVE2018 132 h. 11 min.
the hours tagged as nearly perfect and completely perfect are SAVAS 160 h. 58 min.
summarized in Table 2. These hours correspond to audio seg- Youtube RTVE 197 h. 13 min.
ments that lasted more than one second, since the shorter ones Albayzin 6 h. 5 min.
were also discarded. Multext 53 min.
Total 497 h. 20 min.
Table 2: RTVE2018 dataset after the alignment and re-
alignment processes
Regarding text data, data selection techniques were applied
alignment re-alignment revision+E2E on general news data gathered from digital newspapers and us-
subset ing the LM created with the in-domain RTVE2018 text data as a
(discarded) (nearly perfect) (perfect)
train 360 h. 22 min. 86 h. 29 min. 56 h. 27 min. reference. A total of 3.5 million sentences and 71 million words
dev1 7 h. 43 min. 44 h. 34 min. 29 h. 39 min. were selected with a maximum perplexity threshold value of
120. Hence, summing the in-domain and new texts data, a total
dev2 9 h. 27 min. 5 h. 8 min. 4 h. 3 min.
of 132 million words were employed to estimate the LM, and
total 377 h. 32 min. 136 h. 11 min. 90 h. 9 min. punctuation and capitalization modules for the open condition.
268
3. Main architectures 3-fold augmented through the speed based augmentation tech-
nique. Each audio was transformed randomly depending on a
Two main architectures were employed to build the systems for
modification parameter ranged between 0.9 and 1.1 values. A
both closed and open conditions.
total of 396 hours and 33 minutes were therefore used for train-
ing. The LMs were estimated with the in-domain texts compiled
3.1. LSTM-HMM based systems from the RTVE2018 dataset.
These systems include a bidirectional LSTM-HMM acoustic
model and n-gram language models for decoding and rescoring 4.1.2. Contrastive systems
porpuses. The AMs and final graphs were estimated using the
Kaldi toolkit. The AM corresponded to a hybrid LSTM-HMM The first constrastive system was called ’Vicomtech-
implementation, where bidirectional LSTMs were trained to PRHLT c1-K2 closed’ and it was set up using the same
provide posterior probability estimates for the HMM states. configuration of the primary system, but the AM was estimated
This model was constructed with a sequence of 3 LSTM layers, using the 3-fold augmented acoustic data of the perfectly
using 640 memory units in the cell and 1024 fully connected aligned partition (see Table 3). A total of 258 hours and 27
hidden layer outputs. The number of steps used in the estima- minutes were employed for training.
tion of the LSTM state before prediction of the first label was
The same data was used to build the the second contrastive
fixed to 40 in both contexts. Furthermore, modified Kneser-Ney
system, tagged as ’Vicomtech-PRHLT c2-E1 closed’. It was
smoothed 3-gram and 9-gram models were used for decoding
an E2E recognition system which follows the architecture de-
and re-scoring of the lattices respectively. Both LMs were esti-
scribed above, and it was evolved for 30 epochs. The LM was
mated using the KenLM toolkit [19].
a 5-gram with an alpha value of 1.5 and a beam-width of 1000
during decoding.
3.2. E2E based systems
The E2E systems were developed following the Deep Speech 4.2. Open condition
2 architecture [2]. The core of the system is basically an RNN
model, in which speech spectrograms are ingested and text tran- 4.2.1. Primary system
scriptions are provided as output.
Initially, a sequence of 2 layers of 2D convolutional neu- The primary system of the open condition was called
ral networks (CNN) are employed as spectral feature extrac- ’Vicomtech-PRHLT p-E1 open’ and it was based on the E2E
tor from spectrograms. A 2D batch normalization function is architecture described in Section 3.2. This system was an evo-
then applied to the output of both layers, in addition to a hard lution of an already existing E2E model, which was built using
tanh function as an activation function. The E2E systems were the 3-fold augmented SAVAS, Albayzin, and Multext corpora
set up using 5 layers of bidirectional Gated Recurrent Units for 28 epochs. This model reached a WER of 7.2% on a 4 hours
(GRU) [20] layers as RNN networks. Each hidden layer is test set of the SAVAS corpus.
composed of 800 hidden units. After the bidirectional recur-
For this challenge, it was evolved for 2 new epochs using
rent layers, a fully connected layer is applied as the last layer of
the same corpora in addition to the 3-fold augmented nearly per-
the whole model. The output corresponds to a softmax function
fectly aligned corpus obtained from the RTVE2018 dataset (see
which computes a probability distribution over the characters.
Table 3). A total of 897 hours were used for training. The LM
During the training process, the CTC loss function is computed
was a 5-gram trained with the text data from the open dataset,
to measure the error of the predictions, whilst the gradient is es-
with an alpha value of 0.8 and a beam-width of 1000 during
timated using backpropagation through time algorithm with the
decoding.
aim of updating the network parameters. The optimizer is the
Stochastic Gradient Descent (SGD).
In addition, an external LM was integrated for decoding 4.2.2. Contrastive systems
with the aim of rescoring the initial lattices. To this end, modi-
fied Kneser-Ney smoothed 5-grams models were estimated us- The first constrastive system was called ’Vicomtech-
ing the KenLM toolkit. PRHLT c1-E2 open’ and as the primary system, it was
based on the previously explained E2E architecture. This
system was also an evolution of the already existing E2E
4. Systems descriptions model, but in this case, it was evolved for one epoch using the
A total of 6 systems based on the above described architectures 3-fold augmented SAVAS, Albayzin, Multext, nearly perfectly
were submitted to the challenge, three systems per condition. aligned partition and Youtube RTVE corpora. The duration of
the total amount of training audios was 1488 hours. The LM
4.1. Closed condition was a 5-gram trained with the text data from the open dataset,
with an alpha value of 0.8 and a beam-width of 1000 during
4.1.1. Primary system decoding.
The primary system submitted to the closed condition was The second contrastive system was composed by a bidirec-
called ’Vicomtech-PRHLT p-K1 closed’ and it is a bidirectional LSTM-HMM acoustic model combined with a 3-gram
tional LSTM-HMM based system combined with a 3-gram LM LM for decoding and a 9-gram LM for re-scoring lattices. The
for decoding and a 9-gram LM for re-scoring lattices. The AM AM was evolved for 10 epochs, with an initial and final learning
was trained for 10 epochs, with an initial and final learning rate rate of 0.0006 and 0.00006 respectively, using a mini-batch size
of 0.0006 and 0.00006 respectively, using a mini-batch size of of 100 and 20,000 samples per iteration, and it was trained with
100 and 20,000 samples per iteration. The AM was trained with the same data as the primary system of the open condition. The
the nearly perfectly aligned partition (see Table 3), which was LMs were estimated with the text data from the open dataset.
269
5. Results fewer training data were available. In fact, the primary sys-
tem in the closed condition achieved an error of 4 percentage
The results obtained over the development set shown in Table 3 points lower than the E2E based second contrastive system.
are presented in the following Table 5. The development set is In this condition, it is also remarkable how the primary sys-
composed by audio segments from all the TV shows included tem, trained with nearly correctly aligned audios, achieved bet-
in the original RTVE2018 dataset and lasts a total of 4 hours. ter results than the first contrastive LSTM-HMM based system,
which was built with perfectly aligned contents, even if the pri-
Table 5: WER results for each submitted system over the gener- mary system included more training data. It suggests that in
ated development set this case, exploiting more data although they were not aligned
exactly, helped systems to perform better.
type system cond. WER In the open condition, the E2E based systems achieved bet-
P Vicomtech-PRHLT p-K1 closed 22.6 ter results than the LSTM-HMM based one. It could be ex-
C1 Vicomtech-PRHLT c1-K2 closed Closed 22.8 pected since more training data were available to train models.
C2 Vicomtech-PRHLT c2-E1 closed 26.6 Even if the first contrastive system obtained a slightly better
P Vicomtech-PRHLT p-E1 open 20.7 performance than the primary one, a qualitative evaluation of
C1 Vicomtech-PRHLT c1-E2 open Open 20.5 the results gave as the intuition that the primary system was
C2 Vicomtech-PRHLT c2-K1 open 22.0 more robust against spontaneous speech. In this sense, the al-
pha value (0.8), which defines the weight of the LM against the
AM, of the E2E systems were lower than the alpha value (1.5)
employed in the E2E system of the closed condition, given that
5.1. Processing time and resources the AM performed better and the global system obtained higher
The decodings of the 6 recognition systems were performed on precision, especially with spontaneous speech.
an Intel Xeon CPU E5-2683v4 2.10 GHz 4xGPU server with Finally, it should be remarked that all the error rates
256GB DDR4 2400MHz RAM memory. Each GPU corre- achieved in this work are lower or at least are in the range of
sponds to an NVIDIA Geforce GTX 1080 Ti 11GB graphics the reference WER values given in the evaluation plan. These
acceleration card. WER values were obtained by commercial ASR systems over
The following Table 6 presents the processing time and one TV show in the dataset, and ranged between 22% and 27%
computational resources needed by each submitted system for of word error rate.
the decoding of the released test set of almost 40 hours of au-
dios. It should be noted that the LSTM-HMM based systems 7. References
were decoded using CPU cores, whilst the E2E systems took
[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
advantage of the GPU cards. A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep
neural networks for acoustic modeling in speech recognition: The
Table 6: Processing time and computational resources needed shared views of four research groups,” IEEE Signal processing
by each submitted system magazine, vol. 29, no. 6, pp. 82–97, 2012.
[2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat-
CPU tenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen
system RAM GPU Time et al., “Deep speech 2: End-to-end speech recognition in english
cores
and mandarin,” in International Conference on Machine Learn-
Vicom-PRHLT p-K1 close 12GB 20 - 24h ing, 2016, pp. 173–182.
Vicom-PRHLT c1-K2 close 12GB 20 - 24h
Vicom-PRHLT c2-E1 close 4GB 8 7GB 12h [3] A. Graves and N. Jaitly, “Towards end-to-end speech recognition
with recurrent neural networks,” in Proceedings of the 31st In-
Vicom-PRHLT p-E1 open 5GB 8 7GB 8h ternational Conference on International Conference on Machine
Vicom-PRHLT c1-E2 open 5GB 8 7GB 9h Learning - Volume 32, ser. ICML’14. JMLR.org, 2014, pp. II–
Vicom-PRHLT c2-K1 open 19GB 12 - 40h 1764–II–1772.
[4] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend
and spell: A neural network for large vocabulary conversational
6. Conclusions (ICASSP), 2016 IEEE International Conference on. IEEE, 2016,
pp. 4960–4964.
In this paper, the ASR systems submitted to the IberSPEECH-
RTVE Speech to Text Transcription Challenge 2018 have been [5] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
presented. In the beginning, one of the most costly task was
the processing of the released RTVE2018 dataset, since a high 585.
number of transcriptions were imperfect or do not fit exactly
[6] L. Lu, X. Zhang, and S. Renals, “On training the recurrent neural
to the related spoken audio. Furthermore, the type of contents
network encoder-decoder for large vocabulary end-to-end speech
posed a notable difficulty to the task, given that the TV shows recognition,” 2016 IEEE International Conference on Acoustics,
included most of the main challenges for any speech recogni- Speech and Signal Processing (ICASSP), pp. 5060–5064, 2016.
tion engine, including spontaneous speech, accents, noise back-
[7] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech
grounds, and/or overlapped speakers, among others. Thus, the recognition using deep rnn models and wfst-based decoding,” in
cleaning process of the dataset became a crucial task to exploit Automatic Speech Recognition and Understanding (ASRU), 2015
the data correctly. IEEE Workshop on. IEEE, 2015, pp. 167–174.
Looking at the results obtained on the internally generated [8] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-
development test and presented in Table 5, it can be clearly de- to-end convnet-based speech recognition system,” arXiv preprint
duced that LSTM-HMM based systems performed better when arXiv:1609.03193, 2016.
270
[9] H. Liu, Z. Zhu, X. Li, and S. Satheesh, “Gram-ctc: Automatic unit
selection and target decomposition for sequence labelling,” arXiv
[10] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Na-
hamoo, “Direct acoustics-to-word models for english conver-
sational speech recognition,” arXiv preprint arXiv:1703.07754,
2017.
[11] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmen-
tation for speech recognition,” in Sixteenth Annual Conference of
the International Speech Communication Association, 2015.
[12] L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of
Research on Machine Learning Applications and Trends: Algo-
rithms, Methods, and Techniques. IGI Global, 2010, pp. 242–
264.
search, vol. 15, no. 1, pp. 1929–1958, 2014.
[14] S. Braun, D. Neil, and S.-C. Liu, “A curriculum learning method
for improved noise robustness in automatic speech recognition,”
in Signal Processing Conference (EUSIPCO), 2017 25th Euro-
pean. IEEE, 2017, pp. 548–552.
[16] A. del Pozo, C. Aliprandi, A. Álvarez, C. Mendes, J. P. Neto,
S. Paulo, N. Piccinini, and M. Raffaelli, “Savas: Collecting, an-
notating and sharing audiovisual language resources for automatic
subtitling.” in LREC, 2014, pp. 432–436.
[17] F. Casacuberta, R. Garcia, J. Llisterri, C. Nadeu, J. Pardo, and
A. Rubio, “Development of spanish corpora for speech research
(albayzin),” in Workshop on International Cooperation and Stan-
dardization of Speech Databases and Speech I/O Assesment Meth-
ods, Chiavari, Italy, 1991, pp. 26–28.
[18] E. Campione and J. Véronis, “A multilingual prosodic database,”
in Fifth International Conference on Spoken Language Process-
ing, 1998.
[19] K. Heafield, “Kenlm: Faster and smaller language model queries,”
in Proceedings of the Sixth Workshop on Statistical Machine
Translation. Association for Computational Linguistics, 2011,
pp. 187–197.
[20] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase repre-
sentations using rnn encoder-decoder for statistical machine trans-
lation,” arXiv preprint arXiv:1406.1078, 2014.
271
IberSPEECH 2018
The Intelligent Voice ASR system for the Iberspeech 2018

Speech to Text Transcription Challenge
Nazim Dugan1, Cornelius Glackin1, Gérard Chollet1, Nigel Cannings1

1
Intelligent Voice Ltd, London, UK
nazim.dugan@intelligentvoice.com
Kaldi is a commonly used speech recognition framework

Abstract which supports DNN-HMM hybrid systems with a couple of
This paper describes the system developed by the Empathic different DNN implementations. It also supports Gaussian
team for the open set condition of the Iberspeech 2018 Speech Mixture Model (GMM) HMM hybrid systems [4] which are
to Text Transcription Challenge. A DNN-HMM hybrid mostly used iteratively for the alignment of input transcripts
acoustic model is developed, with MFCC's and iVectors as with the input audio frames. State of the art Kaldi recipes use a
input features, using the Kaldi framework. The provided ground phonetic description of the transcripts and the GMM-HMM
truth transcriptions for training and development are cleaned up iterative phone alignment methodology, even though there is
using customized clean-up scripts and then realigned using a experimental support for character-based training and CTC
two-step alignment procedure which uses word lattice results alignment in the Kaldi framework. These new experimental
coming from a previous ASR system. 261 hours of data is recipes mostly produce ASR systems with slightly lower
selected from train and dev1 subsections of the provided data, accuracy compared to the state of the art recipes.
by applying a selection criterion on the utterance level scoring In this work, a DNN-HMM hybrid Kaldi recipe with GMM-
results. The selected data is merged with the 91 hours of training HMM iterative phone alignment was used for training a
data used to train the previous ASR system with a factor 3 times European Spanish ASR system with 8kHz down-sampled input
data augmentation by reverberation using a noise corpus on the audio and transcriptions from the Iberspeech 2018 speech to
total training data, resulting a total of 1057 hours of final text transcription challenge training and development data, and
training data. Selected text from the train and dev1 subsections four well known European Spanish corpora. A previous ASR
are also used for new pronunciation additions and language system for European Spanish language was used for the
model (LM) adaptation of the LM from the previous ASR utterance level time alignments of the input transcriptions
System. The resulting model is tested using data from the dev2 provided for the Iberspeech 2018 speech to text transcription
subsection selected with the same procedure as the training challenge, training and development. These transcriptions are
data. also used for the language model (LM) adaptation of the
previous ASR model using the SRILM toolkit [6].
Index Terms: speech recognition, forced alignment, neural
network 2. Data preparation
1. Introduction It was observed that the provided utterance level time
alignments of the training and development transcripts were not
Deep Neural Networks (DNNs) and Hidden Markov Model accurate and some of the provided files do not have any time
(HMM) based hybrid Automatic Speech Recognition (ASR) alignments. Word lattice results coming from a previous ASR
systems [1] are still widely used despite the recent rise of end- system trained using four well known European Spanish
to-end speech recognition systems [2], based on recurrent corpora: Albayzin [7], Dihana [8], CORLEC-EHU [9], TC-
neural networks (RNNs). An important part of the ASR STAR [10] and the text corpus El País were used for the re-
acoustic model training procedure is the time alignment of the alignment of the training and development transcripts. The ASR
training audio with the transcriptions. Usually, the utterance system used to generate the word lattice was developed with the
same input data as the ASR model described in [5]. However,
start and end times of the training transcriptions are given to the
an improved recipe of the Kaldi framework [11] is used which
ASR training system, and in the utterance, character or phone
utilizes a factor 3-times data augmentation with reverberation
level alignments of the training text are computed by an and noise addition, high resolution Mel Frequency Cepstral
iterative process, or with Connectionist Temporal Coefficients (MFCCs), iVectors [12] and nnet3 chain
Classification (CTC) [3]. If the given start and/or end time is implementation. This previous ASR model will be referred to
wrong for an utterance it would be discarded by the ASR as the base model throughout the remaining text.
training system or if not, it will reduce the accuracy of the ASR The provided training and development audio files are sub-
training. Therefore, input data preparation with accurate sampled to 8kHz before being used in the training and testing
utterance level time alignments is an important prerequisite for processes.
training an ASR system in either case.
272 10.21437/IberSPEECH.2018-57
2.1. Two-step time alignment procedure set and the results are not as accurate as the results obtained by
the two-step process discussed in this work.
A two-step utterance level time alignment procedure is used
which includes forced alignment of plain text transcripts and 2.2. Data selection for training and testing
word level time alignments using the NIST sclite ASR scoring
utility in Speech Recognition Scoring Toolkit (SCTK) [13]. Utterance level time alignment information computed with the
two-step procedure described in the previous section is used to
2.1.1. Forced alignment of plain text transcripts convert the plain text transcripts into stm reference format
accepted by NIST sclite utility where each utterance was
Time alignments in the srt type subtitle and stm type reference labeled as a different speaker in order to enable utterance level
files of the train, dev1 and dev2 subsections of the provided ratios of correct, substitute, deleted and inserted words. The
training and development data are cleaned up to produce plain ASR best path results are used as the hypothesis input in ctm
text transcriptions where utterances are separated by a newline format. With the observation that the ratio of insertions is high
character. An iterative forced alignment script [14] which for wrongly aligned utterances, the two criteria below are
accepts plain text transcripts and ASR word lattice results as applied in order to select the utterances with correct alignments:
input, is used to compute the utterance level time alignments.
Punctuation cleanup is also applied on the plain text transcripts 1) %insertions < (%correct + %substitutions + %deletions)
in order to improve the accuracy of the alignment procedure, 2) %correct > 0
since ASR word lattice results do not involve any punctuation.
In each iteration, the alignment script uses confidence regions The provided ground truth transcriptions are assumed to be
of the results of the previous iteration to narrow down the search correct, and no analysis was carried out to test the correctness.
space. The number of iterations is configured as three from 278 hours of training data is selected from train, dev1 and dev2
previous experience. subsections of the provided data with the described data
preparation and selection mechanism. 17 hours of data from
2.1.2. Word level time alignment dev2 subsection is preserved for testing and 261 hours of data
The results from the forced alignment procedure described in from train and dev1 subsections are used for training the ASR
the previous section have been used to generate stm type model.
reference files to be used as the reference input to the NIST The selected training data is combined with 91 hours of
sclite ASR scoring utility. The ASR word lattice results are other European Spanish ASR training data from databases:
converted to ctm format which is accepted by the sclite utility Albayzin, Dihana, CORLEC-EHU, TC-STAR [5] to obtain 352
as a hypothesis file with word alternative level time hours of training data for ASR task.
information. The NIST sclite utility was called with the “-o
sgml” option in order to generate word level logs of correct
words, substitutions, deletions and insertions as seen in Table 3. ASR model training
1. Information in these logs are used to find the word level
timings of all the words in the reference files using linear time The Kaldi framework [11] was used to train the acoustic model
interpolation for the deletions. The computed word level of the ASR system submitted to Iberspeech 2018 Text to
timings are used to verify and update the utterance level start Speech Transcription Challenge by the Empathic team.
and end times computed with the forced alignment script Component diagram for ASR system training and testing is
described in previous section. given in Figure 1.
3.1. Kaldi Framework

Table 1: Formatted output example of NIST sclite ASR
scoring utility with sgml type output. Possible Type Kaldi is an open source toolkit for automatic speech
values are correct (C), substition (S), deletion (D) recognition, intended for use by speech recognition researchers
and insertion (I). Ref column is for the words from stm and professionals. It is composed of C++ binaries, utilities
reference file, Hyp column is for the words from ctm written in Bash Script, Python and Perl scripting languages and
hyphotesis file. Hyptime is the word start time in ready to run training and testing recipes with data preparation
hypothesis file and Reftime is the time for the steps for various languages and scenarios.
reference word computed by the sclite utility.
3.2. Lexicon update
Type Ref Hyp Reftime Hyptime
C españa españa 10.640 10.700 The vocabulary and the rule based phonetic descriptions of the
I - un 11.240 11.310 base model is expanded using the out of vocabulary words
S entre entra 11.560 11.640 detected in the new data selected from the provided training and
D y - 12.120 12.140 development data. The final vocabulary size used in the ASR
model training is 110124 words. However, for the testing with
a language model a subset of this vocabulary is used.
Word level timings can be obtained with the sclite utility
without using the utterance level start and end times coming 3.3. Acoustic model building
from the forced alignment procedure described in the previous
The Kaldi Aspire recipe [15] which is the recipe used by the
section. However, the computation time in this case is on the
John Hopkin’s University team for the submission to the
order of days for a typical file from the training or development
IARPA ASpIRE challenge was used for building the DNN-
HMM acoustic model. High resolution MFCC’s with 40
273
coefficients and iVectors are used as input features. Gaussian utility where ground truth transcriptions for selected utterances
posteriors used for the iVector estimation are based on the input of dev2 subsection are used in stm reference format. Words
features with Cepstral Mean and Variance Normalization from the ASR results which are not in the start end time interval
(CMVN). GMM-HMM iterative phone alignment was used of any reference utterance are omitted in the scoring process.
before starting the neural network training of the DNN-HMM An average WER result of 23.9 is obtained in the final model
training stage where nnet3 chain implementation of the Kaldi testing. Base model WER result obtained using the same testing
framework was used with frame-subsampling-factor of 3,
method and decoding parameters is 35.3.
reducing number of output frames to 1/3 of the input frames. A
3-fold data augmentation is applied on the input acoustic Real time factor in the decoding process including VAD,
features for the DNN-HMM training stage using the feature extraction and lattice post-processing is 0.022 using a 4
reverberation algorithm implemented in the Kaldi framework, cores Intel i7-4820K CPU @ 3.70GHz and a single NVidia
by using noise databases RWCP, AIR and Reverb2014 in order GeForce GTX 1080Ti GPU.
to create multi-condition data of total 1057 hours.
The provided test data for Iberspeech 2018 competition is
A sub-sampled Time-delayed Deep Neural Network
(TDNN) [16] with 6 layers and with ReLU and pnorm processed with the produced ASR model using the same
activation functions is used. Details of the neural network procedure described above and resulting transcriptions are
architecture and the stochastic gradient descent (SGD) based submitted in plain text format.
greedy layer-wise supervised training can be found in the
system submission article for the Kaldi Aspire recipe [17]. Figure 1: ASR system training / testing components.
TDNN training time for 2 epochs and 508 iterations is 13
hours 30 minutes with 3 NVidia GPUs (Quadro K6000,
GeForce Titan X, GeForce Titan XP). Last iteration training
and validation accuracies are 0.171631 and 0.199849
respectively. In order to avoid over-training, a selective system
combination is carried out over all the iterations skipping the
first 100 iterations, considering recorded accuracies for
individual iteration results.
3.4. Language model (LM) adaptation

A 3-gram LM of the base model described in Section 2 is
adapted using the selected training transcriptions of the
provided data for the Iberspeech 2018 Speech to Text
Transcription Challenge. Selected training transcriptions are
parsed with the ngram-count command of the SRILM utility up
to order of 3–grams and calculated probabilities are mixed with
the previous probabilities of the base model using ngram
command with mixing coefficient 0.1. Vocabulary size for the
produced LM is about 67000.
3.5. Model testing

The resulting ASR model is tested using selected audio and
transcriptions from the dev2 subsection of the provided
Iberspeech 2018 training and development data. Data
preparation and selection procedure described in Section 2 is
used also for the preparation of the test data since the
development data also suffers from the mis-alignment problem. 4. Discussion
Audio files are segmented using voice activity detection
The main challenge in the ASR model training and testing
based on adaptive thresholding method proposed by Otsu [18].
process with the provided data was the wrong or missing
Feature extraction is applied on produced segments in order to utterance level time alignment information of the provided
obtain high resolution MFCC’s and iVectors with scaling factor training and development data. Therefore, time alignment and
0.75. The extracted features are processed by a Viterbi beam selection of the provided training and development data was
decoder considering LM 3-gram probabilities coupled with necessary prior to acoustic model training. Since only selected
TDNN tri-phone output probabilities. A GPU implementation utterances are used also in the model testing stage, and the ASR
of Kaldi nnet3-latgen-faster decoder is used with parameters -- results which do not match with the selected time intervals are
beam=15.0, --lattice-beam=2.5, max-active=7000, --acoustic- omitted from the WER result obtained by the model testing
scale=1.0, --frame-subsampling-factor=2. Produced ASR process using the dev2 subsection of the provided data, do not
lattice output is post-processed using minimum bayes risk represent a WER result for all the data. If all the ground truth
(MBR) decoding [19] in order to find the best path result. ASR with true time alignments could be used, the obtained WER
result is expected to be higher than is presented here since a
best path transcription results with word level timing
selective process is used in the current WER calculation.
information in ctm format are scored using the NIST sclite
However, the WER result calculated in the model testing
274
process was helpful for the benchmarking of the produced ASR
model with the base model, and some other experiment results
prior to the production of the final ASR model.
Much higher WER results are obtained (average WER 58.3
for the final produced model, average WER 105.7 for the base
model) when they are calculated using all the provided ground
truth text of the dev2 subsection ignoring provided wrong time
information by using txt formatted ASR hypothesis files. A
detailed analysis of the word level sclite logs shows that these
values are not reliable because of mis-alignment problem of
sclite utility usage without time information for such long
reference and hypothesis files. This observation is the basis for
the necessity of the two-step time alignment process used in this
work prior to model training and testing.
A different value of frame-subsampling-factor compared to
training process is chosen in the model testing (frame-
subsampling-factor=3 in training and frame-subsampling-
factor=2 in the testing) since it yields more accurate results in
the testing of audio with Viterbi decoding using a LM.
5. Conclusion
The acoustic model building process with a data preparation
and selection using a two-step time alignment procedure and
utterance level thresholding with WER values yielded a good
working acoustic model when the new training data is merged
with the training data of the base model. The two-step time
alignment procedure together with the utterance level data
selection mechanism described in Section 2 enabled the usage
of the provided data for the acoustic model training step of the
ASR system generatıon. Model testing with the development
data using an adapted version of the base LM showed a
significant reduction in the WER results compared to the base
model results used in the experiments.
6. Acknowledgements
The research leading to the results presented in this paper has
been (partially) granted by the EU H2020 research and
innovation program under grant number 769872
275
TDNNs, i-vector Adaptation, and RNN-LMs,” in
Proceedings of the IEEE Automatic Speech Recognition
and Understanding Workshop, 2015.
7. References [18] N. Otsu, “A threshold selection method from gray-level
[1] G. E. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. histograms,” IEEE Trans. Systems, Man and Cybernetics,
Jaitly, A. W. Senior, V. Vanhoucke, P. Nguyen, T. N. vol. 9, pp. 62–66, 1979.
Sainath et al., “Deep Neural Networks for Acoustic [19] H. Xu, D. Povey, L. Mangu, J. Zhu, “Minimum Bayes Risk
Modeling in Speech Recognition: The Shared Views of Four decoding and system combination based on a recursion for
Research Groups,” IEEE Signal Processing Magazine, vol. edit distance,” Computer Speech & Language, Volume 25,
29, no. 6, pp. 82–97, 2012. Issue 4, October 2011, pp. 802-828
[2] A. Graves, N, Jailty, “Towards end-to-end speech
recognition with recurrent neural networks,” ICML'14
Proceedings of the 31st International Conference on
International Conference on Machine Learning,
Conference, June 21 - 26, Beijing, China, Proceedings,
2014, pp. II-1764-II-1772
[3] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber,
“Connectionist temporal classification: labelling
unsegmented sequence data with recurrent neural
networks,” in Proceedings of the 23rd international
conference on Machine learning. ACM, 2006, pp. 369–376.
[4] K. Vesely, A. Ghoshal, L. Burget, and D. Povey, “Sequence-
discriminative training of deep neural networks.” in
Proceedings of INTERSPEECH, 2013, pp. 2345–2349.
[5] A. L. Zorrilla, N. Dugan, M. I. Torres, C. Glackin, G.
Chollet and N. Cannings, “Some ASR experiments using
Deep Neural Networks on Spanish databases,” IberSpeech
2016 Conference, November 23 – 25, Lisbon, Portugal,
Proceedings, 2016, pp. 149-158
[6] SRILM Toolkit,
http://www.speech.sri.com/projects/srilm
[7] Asunció n Moreno, Dolors Poch, Antonio Bonafonte,
Eduardo Lleida, Joaquim Llisterri, José B. Mariño, and
Climent Nadeu, “Albayzin speech database: design of the
phonetic corpus.,” in EUROSPEECH. 1993, ISCA.
[8] José miguel Benedı́, Eduardo Lleida, Amparo Varona,
Marı́a josé Castro, Isabel Galiano, Raquel Justo, Iñi go
López De Letona, and Antonio Miguel, “Design and
acquisition of a telephone spontaneous speech dialogue
corpus in spanish: Dihana,” in In Fifth LREC, 2006, pp.
1636–1639.
[9] Luis J. Rodrı́guez and Torres M. Inés, “Spontaneous speech
events in two speech databases of human-computer and
human-human dialogs in spanish,” Language and Speech,
vol. 49, no. 3, pp. 333–366, 2006.
[10] Henk van den Heuvel, Khalid Choukri, Christian Gollan,
Asuncion Moreno, Djamel Mostefa: "TC-STAR: New
language resources for ASR and SLT purposes” LREC 2006
[11] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O.
Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P.
Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi
Speech Recognition Toolkit, IEEE 2011 Workshop on
Automatic Speech Recognition and Understanding, Hawaii,
IEEE Signal Processing Society, 2011.
[12] M. Karafiat, L. Burget, P. Matejka, O. Glembek, and J.
Cernocky, “iVector-based discriminative adaptation for
automatic speech recognition,” in 2011 IEEE Workshop on
Automatic Speech Recognition & Understanding. IEEE,
Dec. 2011, pp. 152–157
[13] Speech Recognition Scoring Toolkit (SCTK), National
Institute of Standards and Technology, US Department of
Commerce.
[14] N. Dugan, Forced alignment Python script, Intelligent Voice
LTD, https://github.com/IntelligentVoice/Aligner
[15] Kaldi aspire recipe: https://github.com/kaldi-
asr/kaldi/tree/master/egs/aspire
[16] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay
neural network architecture for efficient modeling of long
temporal contexts,” in Proceedings of INTERSPEECH,
2015.
[17] V. Peddinti, G. Chen, V. Manohar, T. Ko, D. Povey, and S.
Khudanpur, “JHU ASpIRE system: Robust LVCSR with
276
IberSPEECH 2018
GTM-UVIGO System for Albayzin 2018 Speech-to-Text Evaluation

Laura Docio-Fernandez, Carmen Garcia-Mateo
atlanTTic Research Center, Multimedia Technologies Group, University of Vigo, Spain

ldocio@gts.uvigo.es, carmen.garcia@uvigo.es
Abstract 2. The GTM-UVIGO ASR system

This paper describes the Speech-to-Text system developed by This section describes the main blocks that comprises the ASR
the Multimedia Technologies Group (GTM) of the atlanTTic system.
research center at the University of Vigo, for the Albayzin
Speech-to-Text Challenge (S2T) organized in the Iberspeech 2.1. Speech activity detection
2018 conference. The large vocabulary automatic speech recog-
nition system is built using the Kaldi toolkit. It uses an hybrid The speech activity detection activity detection approach de-
Deep Neural Network - Hidden Markov Model (DNN-HMM) veloped in the proposed system has four main stages. First, a
for acoustic modeling, and a rescoring of a trigram based word- voice activity detection (VAD) based on gaussian mixture mod-
lattices, obtained in a first decoding stage, with a fourgram lan- els (GMMs) is applied to the audio signal in order to discard the
guage model or a language model based on a recurrent neural non-speech intervals. Next, a two-step audio segmentation ap-
network. The system was evaluated only on the open set train- proach, based on the Bayesian information criterion (BIC) [11],
ing condition. is carried out. Once the audio segmentation output is obtained,
Index Terms: automatic speech recognition, deep neural net- those segments that are classified as music by a logistic regres-
works, language model, speech activity detection sion classifier are discarded; this classifier relies on the i-vector
paradigm for audio representation, as done in [12][13].
1. Introduction The above stages use as acoustic features 19 Mel-frequency
cepstral coefficients (MFCCs) plus energy, and a cepstral mean
Automatic Speech Recognition (ASR) is essential in many
subtraction using a sliding window of 300 ms is applied.
applications such as: dictation and transcription or caption-
ing apps, speech-controlled interfaces, search engines for large
multimedia archives, speech-to-speech translation, etc. 2.2. Acoustic modeling
These days, with increases in computing power, the use
The acoustic models use a hybrid DNN-HMM modeling strat-
of deep neural networks has spread to many fields including
egy with a neural network based on Dan Povey's implemen-
the ASR. With their use the performance of automatic speech
tation in Kaldi [14]. This implementation uses a multi-spliced
recognition has been greatly improved. The current ASR sys-
TDNN (Time Delay Neural Network) feed-forward architecture
tems are predominantly based on acoustic Hybrid Deep Neural
to model long-term temporal dependencies and short-term voice
Network–Hidden Markov Models (DNN-HMMs) [1] and the n-
characteristics. The inputs to the network are 40 Mel-frequency
gram language model [2] [3]. However, in recent years, Neural
cepstral coefficients extracted in the Feature Extraction block
Network Language Models (NNLM) have begun to be applied
with a sampling frequency of 16 kHz. In each frame, we ag-
[4] [5] [6]. In these, words are embedded in a continuous space,
gregate a 100-dimensional iVector to a 40-dimensional MFCC
in an attempt to map the semantic and grammatical information
input.
present in the training data, and in this way to achieve better
generalization than n-gram models. The depth of the created The topology of this network consists of an input layer fol-
network (the number of hidden layers), together with the ability lowed by 5 hidden layers with 1024 neurons with RELU ac-
to model a large number of context-dependent states, results in tivation function. Asymmetric input contexts were used, with
a reduction in Word Error Rate (WER). The type of neural net- more context in the background, which reduces the latency of
works most often used in language modelling are recurrent neu- the neuronal network in on-line decoding, and also because it
ral networks (RNNLMs). The recurrent connections present in seems to be more efficient from a WER perspective. Asymmet-
these networks allow the modelling of long-range dependencies ric contexts of 13 frames were used in the past, and 9 frames in
that improve the results obtained by n-gram models. In more re- the future. Figure 1 shows the topology used and Table 1 the
cent work, recurrent network topologies such as LSTM (Long layerwise context specification corresponding to this TDNN.
Short-Term Memory) [7] have also been applied [8][9][10].
In this paper, we present a summary of the GTM efforts in
Table 1: Context specification of TDNN in Figure 1
developing speech-to-text technology for Spanish language and
its evaluation in the IberSPEECH-RTVE Speech to Text Tran- Layer Input context
scription challenge. The submitted ASR system was evaluated 1 [-2,+2]
on the open set training condition. 2 [-1,2]
The paper is organized as it follows: in Section 2 the ASR 3 [-3,3]
system is described. Section 3 presents the data used to train the 4 [-7,2]
acoustic and language models. Section 4 describes the submit- 5 {0}
ted systems specific characteristics, and finally Section 5 offers
some final conclusions.
277 10.21437/IberSPEECH.2018-58
Figure 1: TDNN used in acoustic models [14]
2.3. Language modeling
In terms of the language models, when working with n-gram

LMs the model was trained using the SRI Language Modeling
Toolkit. N-gram models of order 3 and 4 were used, that is,
trigrams and fourgrams. A modified Kneser-Ney discounting
of Chen and Goodman has also been applied, together with a
weight interpolation with lower orders [15].
For training the RNNLMs, the Kaldi RNNLM [16] soft-
ware was also used. The neural network language model is
based on a RNN with 5 hidden layers and 800 neurons, where Figure 3: GTM-UVIGO ASR System
TDNN layers with activation function RELU, and LSTM lay-
ers are combined. The training is performed using Stochastic
Gradient Descent (SGD), and in several epochs (in our case, 20 3. Data resources
epochs). All RNNLM models have been trained with the same
material as in the case of n-gram statistical models. Figure 2 As stated in Section 1, the submitted ASR system was evalu-
shows the topology of the network used. ated on the open set training condition. It uses acoustic models
trained with data not in the RTVE2018DB, and language mod-
els trained with both text data in the RTVE2018DB and extern
text data.
Next, the data resources used for training the models are
described.
3.1. Audio corpora

The data used for acoustic model training came from the fol-
lowing corpora:
• 2006 TC-STAR speech recognition evaluation
campaign[17]: 79 hours of speaking in Spanish
• Galician broadcast news database Transcrigal [18]: 30
Figure 2: RNNLM topology used in the ASR system. hours of speaking in Galician.
It must be noted that all the non-speech parts as well as the
speech parts corresponding to transcriptions with pronunciation
errors, incomplete sentences and short speech utterances were
2.4. Recognition process discarded, so in the end the acoustic training material consisted
of approximately 109 hours (79 hours in Spanish and 30 hours
The recognition process is developed using the Kaldi toolkit in Galician).
[16]. The ASR system is based on two decoding stages to ob-
tain the text transcription of the input speech signal. In the first 3.2. Text corpora
decoding stage a lattice is obtained. This lattice is created using
a 3-gram language model. In the second decoding stage, a lan- The following text corpora were used to train the LMs.
guage model rescoring is applied on this lattice. Figure 3 shows • A text corpus of approximately 90M words composed
a block diagram of the recognition process. of material from several sources: transcriptions of Eu-
278
ropean and Spanish Parliaments from the TC-STAR Table 3: Average WER on Development set.
database, subtitles, books, newspapers, online courses Primary Contrastive
and the transcriptions of the Mavir sessions included in TV show Dev1 Dev2 Dev1 Dev2
the development set of the Albayzin 2016 Spoken Term 20H 12.13% – 12.86% –
Detection Evaluation 1 . The vocabulary size of this cor- AP 22.82% – 24.30% –
pus is approximately 250K words. CA 53.95% – 54.35% –
• The RTVE subtitles provided by the organizers of the LM 30.81% – 32.48% –
evaluation. This text comprises approximately 60M LN24H 27.22% 28.85% 28.48% 29.85%
words and its vocabulary size is approximately 173K millennium – 25.52% – 24.78%
words. Average 26.65% 25.30% 27.61% 26.47%
Table 2 shows the main characteristics of these text re- Average 26.37% 27.37%
sources.
Table 2: Characteristics of the text resources 6. Acknowledgements

Training text size Vocabulary size This work has received financial support from the Spanish
TC-STAR and Others 90M 250K Ministerio de Economı́a y Competitividad through project
RTVE subtitles 60M 173K
’TraceThem’ (TEC2015-65345-P), from the Xunta de Galicia
(Agrupación Estratéxica Consolidada de Galicia accreditation
From these text corpora, three LMs have been trained: 2016-2019) and the European Union (European Regional De-
• 3-gram LM: A trigram language model obtained by lin- velopment Fund ERDF).
ear interpolation of two single trigram models. Each sin-
gle model was trained using one of the above text re- 7. References
sources. [1] G. Hinton, L. Deng, D. Yu, and Y. Wang. 2012. Deep Neural Net-
• 4-gram LM: A fourgram language model obtained by works for Acoustic Modeling in Speech Recognition. IEEE Signal
linear interpolation of two single fourgram models. Each Processing Magazine, vol. 9, no. 3, pp. 82-97.
single model was trained using one of the above text re- [2] J. T. Goodman. 2001. A bit of progress in language modeling.
sources. Computer Speech and Language, vol. 15, no. 4, pages 403434.
[3] D. Jurafsky and J.H. Martin. 2008. Speech and Language Pro-
• RNNLM: A RNN language model trained with all the
cessing: An Introduction to Language Processing, Computational
text described above. Linguistics, and Speech Recognition.
The lexicon size of these models was approximately 312K [4] T. Mikilov, S. Kombrink, A. Deoras, L. Bruget, and J. Cer-
words. The phonetic transcription of the lexicon words was au- nicky. 2011. RNNLM-recurrent neural network language model-
tomatically generated using the phonetic transcriber that forms ing toolkit, in Proc. of the 2011 ASRU Workshop, pages 196-201.
part of the Cotovı́a GTM-UVIGO Text-to-Speech system [19]. [5] E. Arisoy, T.N. Sainath, B. Kingsbury, and B. Ramabhadran.
2012. Deep Neural Network Language Models. In NAACL-HLT
Workshop on the Future of Language Modeling for HLT, pages
4. Recognition results 20-28, Stroudsburg, PA, USA. Association for Computational
This section presents the performance of the GTM-UVIGO LInguistics.
ASR system on the development data provided by the organiz- [6] Y. Bengio, R. Ducharme, P. Vincent, and C. Juavi. 2003. A neural
ers of the competition. Two systems that differ in the language probabilistic language model. Journal of Machine Learning Re-
search, 3:1137-1155.
model used in the rescoring stage were evaluated. The primary
[7] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y.
system uses the 4-gram LM and the contrastive system uses the
Wu. 2016. Exploring the limits of language modeling. In arXiv
RNNLM, both described in Sections 2 and 3. The results ob- preprint arXiv:1602.02410.
tained in the development set with the primary and contrastive
[8] H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel, D.
systems are shown in Table 3. The table shows the average Povey, and S. Khudanpur. 2018. A pruned rnnlm lattice-rescoring
Word Error Rate (WER) by TV show and also the global aver- algorithm for automatic speech recognition, In ICASSP.
age WER. [9] M. Sundermeyer, Z. Tuske, R. Schluter, and H. Ney. 2014. Lat-
tice decoding and rescoring with long-span neural network lan-
5. Conclusions and future work guage models. En Fifteenth Annual Conference of the Interna-
tional Speech Communication Association.
In this paper, we have presented two ASR systems to [10] X. Chen, X. Liu, A. Ragni, Y. Wang and M. Gales. 2017. Future
IberSPEECH-RTVE Speech to Text Transcription challenge. word contexts in neural network language models. ArXiv preprint
The difference between these two systems lies in the language arXiv:170805592.
models used in the decoding stage. The primary system uses a [11] M. Cettolo and M. Vescovi. 2003. Efficient audio segmentation
4-gram language model for rescoring and the contrastive system algorithms based on the BIC. In Proceedings of ICASSP. vol. VI,
uses a language model based on recurrent neural networks. pp. 537540 (2003).
As a future line of development we plan to improve the [12] P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo. 2014.
acoustic and language models of the ASR system, as well as GTM-UVigo system for Albayzin 2014 audio segmentation eval-
to enrich their output with punctuation marks and correct capi- uation. In Iberspeech 2014: VIII Jornadas en Tecnologa del Habla
and IV Iberian SLTech Workshop (2014).
talization.
[13] P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo. 2016.
1 MAVIR was a project funded by the Madrid region that coordinated GTM-UVigo System for Albayzin 2016 Speaker Diarisation Eval-
several research groups and companies working on information retrieval uation. In Third International Conference Iberspeech (IberSpeech
(http://www.mavir.net) 2016).
279
[14] V. Peddinti, D. Povey and S. Khudanpur. 2015. A time delay neu-
ral network architecture for efficient modeling of long temporal
contexts. In Proceedings of INTERSPEECH 2015.
[15] A. Stolcke. 2002. SRILM An extensible language modeling
toolkit. Proceedings of the International Conference on Statisti-
cal Language Processing, Denver, Colorado.
N. Goel, M. Hannemann, P. Motlcek, Y. Quian, P. Schwarz, J.
Silovsk, G. Stemmer, and K. Vesel. 2011. The Kaldi Speech
Recognition Toolkit. In ASRU.
[17] L. Docı́o, A. Cardenal and C. Garcı́a. 2006. TC-STAR 2006
automatic speech recognition evaluation: The uvigo system. In
Proc. Of TC-STAR Workshop on Speech-to-Speech Translation,
ELRA, Parı́s, France.
[18] C. Garcı́a, J. Tirado, L. Docı́o and A. Cardenal. 2004. Transcrigal:
A bilingual system for automatic indexing of broadcast news. In
IV International Conference on Language Resources and Evalua-
tion.
[19] E. Rodrı́guez Banga, C. Garcı́a Mateo, F.J. Méndez Pazó, M.
González, C. Magariños Iglesias. 2012. Cotovı́a: an open source
TTS for Galician and Spanish. In IberSPEECH 2012 – VII Jor-
nadas en Tecnologa del Habla and III Iberian SLTech Workshop.
280
IberSPEECH 2018
Topic coherence analysis for the classification of Alzheimer’s disease

Anna Pompili1,2 , Alberto Abad1,2 , David Martins de Matos1,2 , Isabel Pavão Martins3,4
L2 F − Spoken Language Systems Laboratory − INESC-ID Lisboa

1
2
Instituto Superior Técnico, Universidade de Lisboa, Portugal
3
Laboratório de Estudos de Linguagem, Faculty of Medicine, University of Lisbon, Portugal
4
Instituto de Medicina Molecular, Lisbon, Portugal
anna@l2f.inesc-id.pt
Abstract to pragmatic deficits that lead, for example, to the problem of

Language impairment in Alzheimer’s disease is character- maintaining the topic of a conversation [7].
ized by a decline in the semantic and pragmatic levels of lan- In the recent years, there has been a growing interest from
guage processing that manifests since the early stages of the the research community in the computational analysis of lan-
disease. While semantic deficits have been widely investigated guage impairment in AD. Overall, existing studies targeted
using linguistic features, pragmatic deficits are still mostly un- the automatic assessment of lexical, syntactical, and semantic
explored. In this work, we present an approach to automati- deficits through an extensive amount of linguistic features [8, 9,
cally classify Alzheimer’s disease using a set of pragmatic fea- 10]. More recently, semantic changes have been also investi-
tures extracted from a discourse production task. Following the gated through vector space model representations [11, 12]. On
clinical practice, we consider an image representing a closed the other hand, up to our knowledge, there are no works fac-
domain as a discourse’s elicitation form. Then, we model the ing language impairments at an higher level of processing, con-
elicited speech as a graph that encodes a hierarchy of topics. To sidering macro-linguistic aspects of discourse production such
do so, the proposed method relies on the integration of various as cohesion and coherence. While cohesion expresses the se-
NLP techniques: syntactic parsing for sentence segmentation mantic relationship between elements, coherence is related to
into clauses, coreference resolution for capturing dependencies the conceptual organization of speech, and may be analyzed
among clauses, and word embeddings for identifying semantic through the study of local, global, and topic coherence.
relations among topics. According to the experimental results, In this work, we investigate the possibility of automatically
pragmatic features are able to provide promising results distin- discriminate AD exploring a novel approach, based on the anal-
guishing individuals with Alzheimer’s disease, comparable to ysis of topic coherence. To this end, we model discourse tran-
solutions based on other types of linguistic features. scripts into graphs encoding a hierarchy of topics on which we
Index Terms: natural language processing, topic coherence, compute a relatively small set of pragmatic features. In the fol-
Alzheimer’s disease lowing, in Section 2, topic coherence analysis is briefly intro-
duced, followed by an overview of the current state of the art
1. Introduction for AD classification. Then, in Section 3 and 4, we present
the dataset used in this study and a description of our methodol-
In 2006 the prevalence of Alzheimer’s disease (AD) was 26.6 ogy. Finally, the features related with topic coherence and clas-
million people worldwide. Due to the increase of average lifes- sification results are reported in Section 5 and 6, respectively.
pan, it is expected that this data will quadruple by 2050, af- Conclusions are summarized in Section 7.
fecting 1 in 85 persons worldwide [1]. No treatments stop or
reverse the progression of the disease, though some may tem-
porarily improve the symptoms. AD is currently diagnosed
2. Related work
through an analysis of the patient history and through neuropsy- 2.1. Introduction to topic coherence analysis
chological tests assessing cognitive decline in different domains
(memory, reasoning, language, and visuospatial abilities). In The notion of topic and subtopic was introduced in 1991 in the
fact, although the prominent symptom of the disease is memory work of Mentis & Prutting [13], whose target was an analysis
impairment, language problems are also prevalent and existing of topic introduction and maintenance. A topic was defined as
literature confirms they are an important factor [2, 3, 4, 5, 6]. a clause identifying the question of immediate concern, while a
Impairments in language abilities are usually the result of a subtopic being an elaboration or expansion of one aspect of the
decline either in the semantic or pragmatic levels of language main topic.
processing. Semantic processing is related with the content of Several years later, Bradie et al. [14] analyzed topic co-
language and involve words and their meaning. Deficits in this herence and topic maintenance in individuals with right hemi-
domain are typically characterized by difficulties in word find- sphere brain damage. This work extends the one of Mentis &
ing, word comprehension, semantic paraphasia, and by the use Prutting [13] with the inclusion of the notion of sub-subtopic
of a reduced vocabulary. Pragmatic processing, on the other and sub-sub-subtopic. Topic and sub-divisional structures were
hand, is concerned with the inappropriate use of language in so- further categorized as new, related, or reintroduced.
cial situations. Deficits in this domain may include difficulties In a following study Mackenzie et al. [15] used discourse
in understanding questions, in following conversations, and in samples elicited through a picture description task to determine
identifying the key points of a story, getting lost in the details. It the influences of age, education, and gender on the concepts
is likely that the semantic and pragmatic levels are interdepen- and topic coherence of 225 healthy adults. Results confirmed
dent, and that semantic deficits in word finding may contribute education level as a highly important variable affecting the per-
281 10.21437/IberSPEECH.2018-59
formance of healthy adults.
More recently, Miranda [16] investigated the influence of
education in the macro-linguistic dimension of discourse eval-
uation, considering concepts analysis, local, global and topic
coherence, and cohesion. The study was performed on a pop-
ulation of 87 healthy, elderly Portuguese participants. Results
corroborated the ones obtained by Mackenzie et al. [15], con-
firming the effect of literacy in this type of analysis.
2.2. Computational work for AD classification

From a computational point of view, language impairment in
AD has been extensively assessed through the analysis of lin-
guistic and acoustic features. Among the various works, we
mention the study of Fraser et al. [8], where the authors consid-
ered more than 350 features to capture lexical, syntactic, gram-
matical, and semantic phenomena. Using a selection of 35 fea-
Figure 1: The Cookie Theft picture, from the Boston Diagnostic
tures the authors achieved state of the art classification accuracy
Aphasia Examination [21].
of over 81% in distinguishing individuals with AD.
Yancheva et al. [11] presented a generalizable method to
automatically generate and evaluate the information content
conveyed from the description of the Cookie Theft picture. The tive communication [22]. In fact, being important for both the
authors created two cluster models, one for each group, from speaker and the listener, this type of organization highlights the
which they extracted different semantic features. Classification key concepts and indicates the degrees of importance and rele-
accuracy results achieved an F-score of 0.74. By combining se- vance within the discourse. Mackenzie et al. [15] in their work
mantic features to the set of lexicosyntactic features used in the provided an example of a topic hierarchy based on the Cookie
work of Fraser et al. [8] the F-score improves to 0.80. Theft picture description task, which was later extended in the
To the extent of our knowledge, the first work approach- study of Miranda [16]. To better understand the problem at
ing coherence and cohesion computationally is the study of dos hand, an excerpt of this hierarchy is also reported in Figure 2.
Santos et al. [17], although the target of the authors is the detec- The number of topics that can be described observing the
tion of Mild Cognitive Impairment (MCI). To this purpose, they Cookie Theft picture, is somehow limited to the concepts that
model discourse transcripts with a complex network enriched are explicitly represented in the scene, and to those ones that
with word embeddings. Classification is performed using topo- can be inferred from the previous (e.g., climate). Taking this
logical metrics of the network and linguistic features, among into account, the problem of building a topic hierarchy from a
which referential cohesion. Accuracy varies among 52%, 65%, transcript can be modeled with a semi-supervised approach in
and 74%, depending on the dataset used. which a predefined set of topics clusters is used to guide the
assignment of a new topic to a level in the hierarchy.
3. The Cookie Theft corpus Both for the creation of the topics clusters, and for the anal-
ysis of a new discourse sample, a multistage approach is used to
Data used in this work are obtained from the DementiaBank prepare, enhance, and transform the original transcriptions in a
database1 , which is part of the larger TalkBank project [18, 19]. representation suitable for the subsequent analysis. Initially the
The collection was gathered in the context of a yearly basis, transcriptions are preprocessed, then syntactical information is
longitudinal study; demographics data, together with the edu- used to separate sentences into clauses and to identify corefer-
cation level, are provided. Participants included elderly con- ential expressions. Finally, we compute the vector representa-
trols, people with MCI and different type of Dementia. Among tion of each clause by averaging the embeddings extracted for
other assessments, participants were required to provide the de- each word. To build the graph representing the topic hierarchy
scription of the Cookie Theft picture, shown in Figure 1. Each we develop an algorithm based on the cosine similarity that first
speech sample was recorded and then manually transcribed at evaluates the membership of a clause to the topic clusters, and
the word level following the TalkBank CHAT (Codes for the then assigns the clause to a node in the hierarchy. We account
Human Analysis of Transcripts) protocol [20]. Data are in En- for new and repeated topics. Each stage of this process is better
glish language. described in the following sections.
For the purposes of this study, only participants with a di-
agnosis of AD were selected, resulting in 234 speech samples 4.1. Preprocessing
from 147 patients. Control participants were also included, re-
sulting in 241 speech samples from 98 speakers. The Cookie Theft corpus provides the textual transcriptions of
the participant’s speech samples together with the morphologi-
cal analysis and an extensive set of manual annotations (i.e., dis-
4. Modeling discourse transcripts as a fluencies, pauses, repetitions, and other more complex events).
hierarchy of topics Among these, retracing and reformulation are used to indicate
The topics used during discourse production may be subject to abandoned sentences where the speaker starts to say something,
an internal, structural organization in order to achieve an infor- but then stops. While in the former the speaker may maintain
mation hierarchy. This organizational structure allows a grad- the same idea changing the syntax, the latter involves a com-
ual organization of information that is essential for an effec- plete restatement of the idea. In order to prepare the transcrip-
tions for the next stage of the pipeline, all the annotations were
1 https://dementia.talkbank.org removed, and in the case of a retrace or a reformulation, also
282
the hierarchy, the coreference information is used to guide the
assignment of a subtopic to the corresponding level in the hi-
erarchy. To this purpose, we constraint the results provided by
the coreference system to those mentions whose referent is the
subject of the sentence. We are not interested in considering
other coreferential expressions because a subtopic, being a spe-
cialization of a topic, is typically referred to the subject of the
sentence.
4.4. Sentence embeddings

In the last step of the pipeline, discourse transcripts are trans-
formed in a representation suitable to compare and measure
differences between sentences. In particular, the transformed
transcripts should be robust to syntactic and lexical differences
and should provide the capability to capture semantic regulari-
Figure 2: An excerpt of a topic hierarchy for the Cookie Theft ties among sentences. To this purpose, we rely on a pre-trained
picture found in the work of Miranda [16] (text was translated model of word vector representations containing 2 million word
from Portuguese). vectors, in 300 dimensions, trained with fastText on Common
Crawl [27]. In the process of converting a sentence into its vec-
tor space representation, we first perform a selection of four
lexical items (nouns, pronouns, verb, adjectives), then, for each
word we extract the corresponding word vector and finally we
the text marked as being substituted was ignored. Additionally,
compute the average over the whole sentence.
disfluencies were disregarded and contractions were expanded
to their canonical form. At this stage of processing, stopwords
were not removed. Once the preprocessing phase is concluded, 4.5. Topic hierarchy analysis
Part of Speech (POS) tags are automatically generated using the To create a topic hierarchy from a transcript, we follow a
lexicalized probabilistic parser of the Stanford University [23]. methodology that is partly inspired by current clinical practice.
Thus, in modelling the problem we do not want to impose a
4.2. Clause segmentation predefined order or structure in the way topics and subtopics
may be presented, as this, of course, will depend on how the
The next step requires to face the problem of identifying depen-
discourse is organized. However, we can take advantage of the
dent and independent clauses. In fact, while the Cookie Theft
closed domain nature of the task to define a reduced number of
corpus already provides a segmentation of the input speech
clusters of broad topics that will help to guide the construction
into sentences, this is not sufficient for the purposes of this
of the hierarchy and the identification of off-topic clauses.
work. Complex, compound or complex-compound sentences
may contain references to multiple topics. The following
4.5.1. Topic clusters definition
excerpt shows an example of this problem, a complex sentence
composed of a dependent and an independent clause: /and the As mentioned, the proposed solution relies on the supervised
mother is washing dishes while the water is running over in the creation of a predefined number of clusters of broad topics.
sink on the floor /. Each cluster contains a representative set of sentences that are
related with the topic of the cluster. 10 clusters were defined:
A possible way to cope with the separation of different sen- main scene, mother, boy, girl, children, garden, climate, not-
tences types is by using syntactic parse trees. Thus, in a similar related, incomplete, and no-content. The cluster not-related
way to the work of Feng et al. [24], POS tags are used for the was used to model those sentences in which the participant is
identification of dependent and independent clauses. For the not performing the task (e.g., questions directed to the inter-
former, the tag SBAR is used, while for the latter, the proposed viewer). The clusters incomplete and no-content are instead
solution checks the sequence of nodes along the tree to verify if used to explicitly model sentences that may be characteristics
the tag S or the tags [NP VP] appear in the sequence2 . of a language impairment. The former contains fragments of
text that do not represent a complete sentence (e.g., /overflow-
4.3. Coreference analysis ing sink/), the latter identifies those expressions that do not add
semantic information about the image (e.g., /fortunately there
The analysis of coreference proves to be particularly useful in is nothing happening out there/, /what is going on/). To build
higher level NLP applications that involve language understand- the clusters, 30% of the data from both the AD and the control
ing, such as an extended discourse [25]. Strictly related with group is used. Each sentence has been manually annotated with
the notions of anaphora and cataphora, coreference resolution the corresponding cluster label and clusters are simply modelled
goes beyond the relation of dependence implicated by these by the complete set of sentences belonging to them.
concepts. It allows to identify when two or more expressions
refer to the same entity in a text. In this work, the analysis of 4.5.2. Topic hierarchy building algorithm
coreference has been performed with the Stanford coreference
resolution system [26], using the results of the segmentation The algorithm to build the topic hierarchy relies on the cosine
performed in the previous step. During the process of building similarity between sentence embeddings. The first step consists
in verifying to which topic cluster belongs the current sentence.
2 http://www.surdeanu.info/mihai/teaching/ista555- This is achieved by computing the cosine similarity between the
fall13/readings/PennTreebankConstituents.html current sentence embeddings and each sentence embeddings in
283
each topic clusters. The highest result determines the cluster 0.78
for the new sentence. In the following step, we need to assign 0.76
the current sentence embeddings to a level in the current hierar- 0.74
chy. This implies to establish if we are dealing with a new or a 0.72
repeated topic and its level of specialization (i.e., subtopic, sub-
Accuracy
0.7
subtopic, etc.). This is achieved by first identifying, in the cur- 0.68
rent hierarchy, the sub-graph whose nodes belong to the same
0.66
cluster of the current sentence (e.g., the sub-graph correspond-
0.64
ing to the mother cluster). Then, we compute the cosine simi-
0.62
larity between the current sentence and each nodes of this sub- 1 12 8 10 7 2 13 4 3 9 6 16 14 11 5 15
graph. The new sentence is considered a son of the closest node Features
if the similarity is lower than a threshold. Otherwise, it is con-
sidered a repeated topic. If there is no sub-graph, the sentence
embedding is added as a new topic. If the new topic results to Figure 3: Variation of the classification accuracy while increas-
be a coreferential expression, this kind of information supersede ing the number of features.
the cosine metric strategy, and the new topic is added directly
as a son of its referent.
Although the algorithm developed resembles the analysis
performed in the standard clinical practice, the aim of this work nating the disease.
is not the comparison of the automatic method with the manual
Our results, achieved with the 70% of the data, provided
one. Instead, our focus is understanding if pragmatic features
an average accuracy of 77%, and an average F-score of 77%
related with topic coherence analysis may be relevant to dis-
in classifying AD. Interestingly, the number of topics was the
criminate AD. The type of features computed, as well as the
first feature selected, providing, alone, an average accuracy of
results of classification experiments are described in the follow-
67%. Comparing these results with current state of the art, we
ing sections.
acknowledge that Fraser et al. [8] achieved a higher accuracy
(81%) using a set of lexicosyntactic features. On the other hand,
5. Topic coherence features we also recognized that these results are slightly better than the
Through the multistage approach and the final hierarchy of ones achieved by Yancheva et al. [11] (F-score 74%) using only
topics we identified sixteen measurements: (1-4) the number a set of 12 semantic features. However, when the authors com-
of topics, subtopics, sub-subtopics and sub-sub-subtopics in- bine lexicosyntactic and semantic information, the F-score im-
troduced, (5-6) the proportion of dependent and independent proves to 80%. These considerations are interesting for multi-
clauses to the total number of sentences, (7) the total number of ple reasons, in fact, on one side they confirm the relevance of
coreferential mentions, (8) the total number of topics, subtopics, pragmatic features related with topic coherence in the task of
sub-subtopics and sub-sub-subtopics repeated, (9-11) the num- classifying AD. On the other hand, they also highlight that lex-
ber of sentences that were classified as not-related, incomplete, icosyntactic features are extremely important in characterizing
or no-content in the first step of the main algorithm, (12) the the disease and should be used in a complementary way with
coefficient of variation (the ratio of the standard deviation to the other features.
mean) of the cosine similarity between two temporally consec-
utive topics, (13) the length of the longest path from the root 7. Conclusions
node to all leaves, (14) the average number of outgoing edges
of all nodes, (15) the total number of sentences, (16) the ratio of In this work, we approached the problem of exploting topic co-
dependent to independent clauses. herence analysis to automatically classify AD. To this purpose,
we proposed an algorithm inspired by the type of assessment
6. Results and discussion conducted by clinicians to construct the topic hierarchy of a
picture description task, from which we extract a reduced set of
Classification experiments were performed with a Random For- pragmatic features for automatic classification. Initial experi-
est classifier, using the 70% of the remaining data of the Cookie mental results show comparable AD classification performance
Theft corpus, once that 30% of the data was retained to model to current state of the art approaches using different types of
the topic hierarchy. A stratified k-fold cross validation per sub- consolidated linguistic features. As future work, we plan to
ject strategy was implemented, with k being equal to 10. integrate the proposed pragmatic features with lexicosyntactic
Initial results, using the set of features described previously, features and to explore the extension of this kind of analysis
provided an average accuracy of 74% in distinguishing AD pa- to other types of discourse production tasks, including open-
tients from healthy controls. Then, in order to understand the domain tasks.
importance of each feature, we implemented a forward feature
selection method. This is an iterative approach in which the
model is trained with a varying number of features. Starting 8. Acknowledgments
with no features, at each iteration we test the accuracy of the
model by adding, one at a time, each of the features that were The authors want to express their gratitude to Dr. Filipa Mi-
not selected in a previous iteration. The accuracy is evaluated randa for her precious advice and help provided during the
with a stratified 10-fold cross validation. The feature that yields development of this work. This work was supported by Por-
the best accuracy is retained for further processing. The results tuguese national funds through – Fundação para a Ciência e
of this method are shown in Figure 3. With this approach, we a Tecnologia (FCT), under Grants SFRH/BD/97187/2013 and
identified the first six features as the most relevant in discrimi- Project with reference UID/CEC/50021/2013.
284
9. References [18] J. T. Becker, F. Boiler, O. L. Lopez, J. Saxton, and K. L. McGo-
nigle, “The natural history of alzheimer’s disease: description of
[1] R. Brookmeyer, E. Johnson, K. Ziegler-Graham, and H. M. Ar- study cohort and accuracy of diagnosis,” Archives of Neurology,
righi, “Forecasting the global burden of alzheimers disease,” vol. 51, no. 6, pp. 585–594, 1994.
Alzheimer’s & dementia, vol. 3, no. 3, pp. 186–191, 2007.
[19] B. MacWhinney, S. Bird, C. Cieri, and C. Martell, “Talkbank:
[2] K. E. Forbes, A. Venneri, and M. F. Shanks, “Distinct pat-
Building an open unified multimodal database of communica-
terns of spontaneous speech deterioration: an early predictor of
tive interaction,” 4th International Conference on Language Re-
alzheimer’s disease.” Brain Cogn, vol. 48, no. 2-3, pp. 356–361,
sources and Evaluation, pp. 525–528, 2004.
Mar-Apr 2002.
[20] M. B, “The childes project: Tools for analyzing talk, 3rd edition.”
[3] J. Reilly, J. Troche, and M. Grossman, “Language processing in
Lawrence Erlbaum Associates, Mahwah, New Jersey, 2000.
dementia,” The handbook of Alzheimer’s disease and other de-
mentias, pp. 336–368, 2011. [21] H. Goodglass, E. Kaplan, and B. Barresi, The Boston Diag-
nostic Aphasia Examination, Baltimore: Lippincott, Williams &
[4] V. Taler and N. A. Phillips, “Language performance in alzheimer’s
Wilkins, 2001.
disease and mild cognitive impairment: a comparative review.” J
Clin Exp Neuropsychol, vol. 30, no. 5, pp. 501–556, Jul 2008. [22] H. Ulatowska and S. Chapman, Discourse Analysis and Applica-
[5] G. Oppenheim, “The earliest signs of alzheimer’s disease.” J tions. Studies In Adult Clinical Populations. Hillsdale: Lawrence
Geriatr Psychiatry Neurol, vol. 7, no. 2, pp. 116–120, Apr-Jun Elbaum Associates, 1994, ch. Discourse macrostructure in apha-
1994. sia, pp. pp. 29–46.
[6] D. Kempler, “Language changes in dementia of the alzheimer [23] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in
type,” Dementia and communication, pp. 98–114, 1995. Proceedings of the 41st Annual Meeting on Association for Com-
putational Linguistics-Volume 1. Association for Computational
[7] S. H. Ferris and M. R. Farlow, “Language impairment in Linguistics, 2003, pp. 423–430.
alzheimers disease and benefits of acetylcholinesterase in-
hibitors,” in Clinical interventions in aging, 2013. [24] S. Feng, R. Banerjee, and Y. Choi, “Characterizing stylistic el-
ements in syntactic structure,” in Proceedings of the 2012 Joint
[8] K. C. Fraser, J. A. Meltzer, and F. Rudzicz, “Linguistic features Conference on Empirical Methods in Natural Language Process-
identify alzheimer’s disease in narrative speech.” J Alzheimers ing and Computational Natural Language Learning. Association
Dis, vol. 49, no. 2, pp. 407–422, 2016. for Computational Linguistics, 2012, pp. 1522–1533.
[9] S. O. Orimaye, J. S.-M. Wong, and K. J. Golden, “Learning [25] S. Boytcheva, P. Dobrev, and G. Angelova, “Cgextract: Towards
predictive linguistic features for alzheimer’s disease and related extraction of conceptual graphs from controlled english.” Con-
dementias using verbal utterances,” in Proceedings of the 1st tributions to ICCS-2001, 9th International Conference of Concep-
Workshop on Computational Linguistics and Clinical Psychology tual Structures,, 2001.
(CLPsych). sn, 2014, pp. 78–87.
[26] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard,
[10] M. Yancheva, K. C. Fraser, and F. Rudzicz, “Using linguistic fea- and D. McClosky, “The Stanford CoreNLP natural language
tures longitudinally to predict clinical scores for alzheimer’s dis- processing toolkit,” in Association for Computational Linguistics
ease and related dementias,” in Proceedings of SLPAT 2015: 6th (ACL) System Demonstrations, 2014, pp. 55–60. [Online].
Workshop on Speech and Language Processing for Assistive Tech- Available: http://www.aclweb.org/anthology/P/P14/P14-5010
nologies, 2015, pp. 134–139.
[27] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin,
[11] M. Yancheva and F. Rudzicz, “Vector-space topic models for de- “Advances in pre-training distributed word representations,” in
tecting alzheimer’s disease.” in Proceedings of the 54th Annual Proceedings of the International Conference on Language Re-
Meeting of the Association for Computational Linguistics (Vol- sources and Evaluation (LREC 2018), 2018.
ume 1: Long Papers), 2016, pp. 2337–2346.
[12] K. C. Fraser and G. Hirst, “Detecting semantic changes in
alzheimer’s disease with vector space models,” in Proceedings
of LREC 2016 Workshop. Resources and Processing of Linguis-
tic and Extra-Linguistic Data from People with Various Forms of
Cognitive/Psychiatric Impairments (RaPID-2016), Monday 23rd
of May 2016, no. 128. Linköping University Electronic Press,
Linköpings universitet, 2016, p. 1 to 8.
[13] M. Mentis and C. A. Prutting, “Analysis of topic as illustrated in
a head-injured and a normal adult,” Journal of Speech, Language,
and Hearing Research, vol. 34, no. 3, pp. 583–595, 1991.
[14] M. Brady, C. Mackenzie, and L. Armstrong, “Topic use following
right hemisphere brain damage during three semi-structured con-
versational discourse samples,” Aphasiology, vol. 17, no. 9, pp.
881–904, 2003.
[15] C. Mackenzie, M. Brady, J. Norrie, and N. Poedjianto, “Picture
description in neurologically normal adults: Concepts and topic
coherence,” Aphasiology, vol. 21, no. 3-4, pp. 340–354, 2007.
[16] F. Miranda, “Influência da escolaridade na dimensão
macrolinguı́stica do discurso,” Master’s thesis, Universidade
Católica Portuguesa, Instituto de ciências da saúde, 2015.
[17] L. Santos, E. A. Corrêa Júnior, O. Oliveira Jr, D. Amancio,
L. Mansur, and S. Aluı́sio, “Enriching complex networks
with word embeddings for detecting mild cognitive im-
pairment from speech transcripts,” in Proceedings of the
55th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). Association for Compu-
tational Linguistics, 2017, pp. 1284–1296. [Online]. Available:
http://www.aclweb.org/anthology/P17-1118
285
IberSPEECH 2018
Building a global dictionary for semantic technologies

Eszter Iklódi1 , Gábor Recski1 , Gábor Borbély2 , Maria Jose Castro-Bleda3
1
Department of Automation and Applied Informatics, Budapest University of Technology and
Economics, Budapest, Hungary
2
Department of Algebra, Budapest University of Technology and Economics, Budapest, Hungary
3
Departamento de Sistemas Informáticos y Computación, Universitat Politècnica de València,
Valencia, Spain
eszter.iklodi@gmail.com, recski.gabor@aut.bme.hu, borbely@math.bme.hu,
mcastro@dsic.upv.es
Abstract One way to create such models is to find mappings between

embeddings of different languages [3, 7, 8]. Our work proposes
This paper proposes a novel method for finding linear map- a novel procedure for learning such mappings in the form of
pings among word vectors for various languages. Compared translation matrices that serve to map each language to a uni-
to previous approaches, this method does not learn translation versal space.
matrices between two specific languages, but between a given Section 2 summarizes the progress made on learning trans-
language and a shared, universal space. The system was trained lation matrices between word embeddings over the last five
in two different modes, first between two languages, and af- years. Section 3 discusses the proposed method in detail. Fol-
ter that applying three languages at the same time. In the first lowing that, Section 4 describes the experimental setup we used
case two different training data were applied; Dinu’s English- and reports the obtained results. Finally, Section 5 concludes
Italian benchmark data [1], and English-Italian translation pairs the advantages and disadvantages of the proposed model, and
extracted from the PanLex database [2]. In the second case only also discusses some improvements for future work.
the PanLex database was used.
The system performs on English-Italian languages with the 2. Related work
best setting significantly better than the baseline system of
Mikolov et al. [3], and it provides a comparable performance In 2013, Mikolov et al. [3] published a simple two-step proce-
with the more sophisticated systems of Faruqui and Dyer [4] dure for creating universal embeddings. In the first step they
and Dinu et al. [1]. Exploiting the richness of the PanLex built monolingual models of languages using huge corpora, and
database, the proposed method makes it possible to learn lin- in the second step a small bilingual dictionary was used to
ear mappings among an arbitrary number languages. learn linear projection between the languages. The optimiza-
tion problem was the following:
Index Terms: semantics, word embeddings, multilingual em- n
X
beddings, translation, artificial neural networks min ||W xi − zi ||2 (1)
W
i=1
1. Introduction
where W denotes the transformation matrix, and {xi , zi }n i=1
Computer-driven natural language processing plays an increas- are the continuous vector representations of word translation
ingly important role in our everyday life. In the current digital pairs, with xi being in the source language space and zi in the
world, using natural language for human-machine communica- target language space.
tion has become a basic requirement. In order to meet this re- Faruqui and Dyer [4] proposed a procedure to obtain mul-
quirement, it is inevitable to analyze human languages semanti- tilingual word embeddings by concatenating the two word vec-
cally. Nowadays, state-of-the-art systems represent word mean- tors coming from the two languages, applying canonical corre-
ing with high dimensional vectors, known as word embeddings. lation analysis (CCA). Xing et al. [9] found that bilingual trans-
Current embedding models are learned from monolingual lation can be largely improved by normalizing the embeddings
corpora, and therefore infer language dependency. But one and by restricting the transformation matrices into orthogonal
might ask if the structure of the different embeddings, i.e. dif- ones. Dinu et al. [1] showed that the neighbourhoods of the
ferent meaning representations, are universal among all human mapped vectors are strongly polluted by hubs, which are vectors
languages. Youn et al. [5] proposed a procedure for building that tend to be near a high proportion of items. They proposed
graphs from concepts of different languages. They found that a method that computes hubness scores for target space vec-
these graphs reflected a certain structure of meaning with re- tors and penalizes those vectors that are close to many words,
spect to the languages they were built of. They concluded that i.e. hubs are down-ranked in the neighbouring lists. Lazari-
the structural properties of these graphs are consistent across dou et al. [10] studied some theoretical and empirical properties
different language groups, and largely independent of geogra- of a general cross-space mapping function, and tested them on
phy, environment, and the presence or absence of literary tra- cross-linguistic (word translation) and cross-modal (image la-
ditions. Such findings led to a new research direction within belling) tasks. They also introduced the use of negative samples
the field of computational semantics, which focuses on the during the learning process. Amar et al. [11] proposed meth-
construction of universal meaning representations, most of the ods for estimating and evaluating embeddings of words in more
times in the form of cross-lingual word embedding models [6]. than fifty languages in a single shared embedding space. Since
286 10.21437/IberSPEECH.2018-60
English usually offers the largest corpora and biligual dictio- were obtained when applying the SVD only once, at the begin-
naries, they used the English embeddings to serve as the shared ning of the learning process.
embedding space. Artetxe et al. [12] built a generic framework
that generalizes previous works made on cross-linguistic em- 4. Experiments
beddings and they concluded that the best systems were the
ones with orthogonality constraint and a global pre-processing 4.1. Experimental setup
with length normalization and dimension-wise mean centering.
Smith et al. [7] also proved that translation matrices should be 4.1.1. Pre-trained word embeddings
orthogonal, for which they applied singular value decomposi- For pre-trained word embeddings we took the fastText embed-
tion (SVD) on the transformation matrices. Besides, they also dings published by Conneau et al. [8]. These embeddings were
introduced a novel “inverted softmax” method for identifying trained by applying their novel method where words are repre-
translation pairs. All these works listed above applied super- sented as a bag of character n-grams [13]. This model outper-
vised learning. However, in 2017 Conneau et al. [8] introduced formed Mikolov’s [14] CBOW and skipgram baseline systems
an unsupervised way for aligning monolingual word embedding that did not take any sub-word information into account. Con-
spaces between two languages without using any parallel cor- neau’s pre-trained word vectors trained on Wikipedia are avail-
pora. This unsupervised procedure holds the current state-of- able for 294 languages1 .
the-art results on Dinu’s benchmark word translation task. For Some experiments were also run by using the same embed-
comparing the different results see Table 1 and Table 2. ding that was used by Dinu et al. [1] in their experiments. These
word vectors were trained with word2vec and then the 200k
3. Proposed method most common words in both the English and Italian corpora
were extracted. The English word vectors were trained on the
We propose a method that learns linear mappings between word
WackyPedia/ukWaC and BNC corpora, while the Italian word
translation pairs in the form of translation matrices that map pre-
vectors were trained on the WackyPedia/itWaC corpus. This
trained word embeddings into a universal vector space. During
word embedding will be referred to as the WaCky embedding.
training, the cosine similarity of word translation pairs is max-
imized, which is calculated in the universal space. The method
4.1.2. Gold dictionaries
is applicable for any number of languages. Since, independent
of the number of languages applied during training, for each First, we ran the experiments on Dinu’s English-Italian bench-
language always exactly one translation matrix is learned, by mark data [1]. It is an English-Italian gold dictionary split into
introducing new languages, the number of the learned parame- a train and a test set, which was built from Europarl en-it2 [15].
ters remains linear to the number of the applied languages. For the test set they used 1,500 English words split into 5 fre-
Let L be a set of languages, and TP a set of translation pairs quency bins, 300 randomly chosen in each bin. The bins are de-
where each entry is a tuple of two in the form of (w1 , w2 ) where fined in terms of rank in the frequency-sorted lexicon: [1-5K],
w1 is a word in language L1 and w2 is a word in language L2 , [5K-20K], [20K-50K], [50K-100K], and [100K-200K]. Some
and both L1 and L2 are in L. Then, let’s consider the following of these 1500 English words have multiple Italian translations
equation to optimize: in the Europarl dictionary, so the resulting test set contains 1869
word pairs all together, with 1500 different English, and with
X X 1849 different Italian words. For the training set the top 5k
1 entries were extracted and care was taken to avoid any over-
· cos sim(w1 · T1 , w2 · T2 ) (2)
|T P | L ,L lap with test elements on the English side. On the Italian side,
1 2 (w1 ,w2 )
∈L ∈T P however, an overlap of 113 words is still present. In the end the
train set contains 5k word pairs with 3442 different English, and
where T1 and T2 are translation matrices mapping L1 and L2 4549 different Italian words.
to the universal space. Since the equation is normalized with Then, we built another golden dictionary similar to that of
the number of translation pairs in the TP set, the optimal value Dinu’s, but this time the translation pairs were extracted from
of this function is 1. Off-the-shelf optimizers are programmed the PanLex [2] database. PanLex is a nonprofit organization
to find local minimum values, so during the training process that aims to build a multilingual lexical database from available
the loss function is multiplied by −1. Word vectors are always dictionaries made by domain experts in all languages. To each
normalized, so the cos sim reduces to a simple dot product. translation pair a confidence value is assigned, which can be
At test time, first, both source and target language words used for filtering the extracted data. These confidence values
are mapped into the universal space, and from the most frequent are in the range of [1, 9], with 9 meaning high and 1 meaning
200k mapped target language words a look-up space is defined. low confidence. During the extraction process, translations with
Then, the system is evaluated with the Precision metric, more a confidence value below 7 and those for which no word vec-
specifically with Precision @1, @5, and @10, where Precision tor was found in the fastText embedding were dropped. Then,
@N denotes the percentage of how many times the real transla- training and test sets were constructed following Dinu’s steps,
tion of a source word is found among the N closest word vectors except for that only those English words were taken for which
in the look-up space. The distance assigned to the word vectors only one Italian translation was present. Experiments showed
when searching in the look-up space is the cos sim. that otherwise a serious noise was brought into the system, since
Previous works, such as Mikolov et al. [7] or Conneau et in many cases one English word might have up to 10 different
al. [8], suggested restricting the transformation matrix to an or- Italian translations.
thogonal one. From an arbitrary transformation matrix T an
orthogonal T 0 can be obtained by applying the SVD procedure. 1 http://github.com/facebookresearch/fastText/blob/master/pretrained-
Our experiments showed that by applying SVD on the transfor- vectors.md
mation matrices the learning is significantly faster. Best results 2 http://opus.lingfil.uu.se/
287
4.2. Results test set No. word pairs in old No. word pairs in new
Dinu 1869 1455
4.2.1. Parameter adjustment using Dinu’s data PanLex 1500 1242
First, parameter adjustment was performed using Dinu’s data, Table 4. Word reduction of the new test sets
which gave 0.1 as the best learning rate and 64 as the best batch
size, where batch size is equal to the number of translation pairs
used in one iteration. With applying SVD only once at the be-
ginning the obtained results of our best system are significantly the number of word pairs in the old and the new test sets. It
worse than state-of-the-art results on this benchmark data, but should be noted that by this reduction principally the most com-
they are comparable with or even better than some of the previ- mon English words are affected, and therefore worse scores
ous models discussed in Section 2. For comparison see Table 1 are expected compared to the previous train-on-Dinu-test-on-
and Table 2. Dinu, or train-on-PanLex-test-on-PanLex top results. Scores on
Dinu’s test set are shown in Table 5 and on the PanLex data in
Eng-Ita @1 @5 @10 Table 6. The obtained results show that training on the PanLex
Mikolov et al. 0.338 0.483 0.539 data cannot beat the system trained on Dinu’s data, which per-
Faruqui et al. 0.361 0.527 0.581 forms better both on Dinu’s and on the PanLex test sets. Not
Dinu et al. 0.385 0.564 0.639 even combining the two training sets succeeds in achieving sig-
Smith et al. (2017) 0.431 0.607 0.651 nificantly better results, although on the PanLex test set it does
Conneau et al. (2017) - WaCky 0.451 0.607 0.651 improve the scores in the Italian-English direction.
Conneau et al. (2017) - fastText 0.662 0.804 0.834
Proposed method - fastText 0.377 0.565 0.625 4.2.4. Continuing the training with PanLex data
Proposed method - WaCky 0.220 0.333 0.373
Table 1. Comparing English-Italian results on Dinu’s data. Another experiment was conducted to continue the baseline sys-
tem trained on Dinu’s data with the PanLex data. In other
words, it is the same as initializing the translation matrices of
the PanLex training process with previously learned ones. The
baseline system reaches its best performance between 2000 and
Ita-Eng @1 @5 @10
4000 epochs, depending on which precision value is regarded.
Mikolov et al. 0.249 0.410 0.474
Table 7 shows that on the English-Italian task there is no im-
Faruqui et al. 0.310 0.499 0.570
provement at all, while on the Italian-English task with the best
Dinu et al. 0.246 0.454 0.541
setting slightly better scores are achieved on precision @1 and
Smith et al. (2017) 0.380 0.585 0.636
@10 values.
Conneau et al. (2017) - WaCky 0.383 0.578 0.628
Conneau et al. (2017) - fastText 0.587 0.765 0.809
4.2.5. Experiments using three languages
Proposed method - fastText 0.310 0.502 0.547
Proposed method - WaCky 0.103 0.163 0.190 Finally, a multilingual experiment was carried out where the
Table 2. Comparing Italian-English results on Dinu’s data. system was trained on three languages - English, Italian, and
Spanish - at the same time. During training the system learns
three different translation matrices, one for English-universal,
one for Italian-universal, and one for Spanish-universal space
mapping. For example, in order to learn the English-universal
4.2.2. Experiments with the PanLex data
translation matrix, both the English-Italian and the English-
Using the PanLex database some experiments were made with Spanish dictionaries are used, according to Equation (2).
different training set sizes. 3k training examples proved to be Batches are homogeneous, but two following batches are al-
the best as Table 3 shows. ways different in terms of the language origins of the contained
data. That is, first an English-Italian batch is fed to the sys-
eng-ita ita-eng tem, then an English-Spanish batch, after that an Italian-Spanish
Prec. @1 @5 @10 @1 @5 @10 batch, and so on. First, bilingual models were trained in order to
1k 0.1500 0.2847 0.3340 0.1391 0.2761 0.3256 compare them later with the multilingual system. The results of
3k 0.2127 0.3473 0.3933 0.2232 0.3650 0.4152 the bilingual models are summarized in Table 8. Results are best
5k 0.1980 0.3193 0.3620 0.2212 0.3555 0.4030 on the Italian-Spanish task. Next, the system was trained using
10k 0.1613 0.2807 0.3227 0.1879 0.3012 0.3372 all the three languages at the same time. During the training
Table 3. Experiments with different training set sizes process the model was evaluated on the bilingual test datasets
of which the results are shown in Table 9. The obtained results
show that no advantage was achieved by extending the number
of languages, since the multilingual model performs worse than
4.2.3. Comparison of systems trained on Dinu’s and PanLex any of the pairwise bilingual models.
data
In the next step, some experiments were made to determine
which data is more apt for learning linear mappings between This paper proposes a novel method for finding linear mappings
embeddings. In order to compare all the experiments objec- between word embeddings in different languages. As a proof of
tively subsets of the original test sets were created. These sub- concept a framework was developed which enabled basic pa-
sets do not contain any English word present either in the Dinu rameter adjustments and flexible configuration for initial exper-
training set or in the PanLex training set. Table 4 summarizes imentation.
288
eng-ita ita-eng
Precision @1 @5 @10 @1 @5 @10
train:Dinu - test:old 0.3770 0.5647 0.6245 0.3103 0.5018 0.5474
train:Dinu - test:new 0.3560 0.5407 0.5978 0.2917 0.4792 0.5215
train:PanLex - test:new 0.1360 0.2309 0.2594 0.1361 0.2556 0.2965
train:Dinu+PanLex - test:new 0.2930 0.4349 0.4861 0.2910 0.4556 0.5090
Table 5. Comparing Dinu’s and PanLex data on Dinu’s test set
eng-ita ita-eng
Precision @1 @5 @10 @1 @5 @10
train:PanLex - test:old 0.1960 0.3087 0.3440 0.1838 0.3059 0.3443
train:PanLex - test:new 0.1812 0.2858 0.3196 0.1668 0.2835 0.3213
train:Dinu - test:new 0.2295 0.4171 0.4839 0.2227 0.3763 0.4199
train:Dinu+PanLex - test:new 0.2295 0.3712 0.4275 0.2498 0.4026 0.4495
Table 6. Comparing Dinu’s and PanLex data on the PanLex test set
eng-ita ita-eng
Precision @1 @5 @10 @1 @5 @10
original 0.3770 0.5647 0.6245 0.3103 0.5018 0.5474
cont from 2000 0.3426 0.5256 0.5802 0.3229 0.4882 0.5535
cont from 3000 0.3535 0.5416 0.5970 0.3229 0.4840 0.5465
cont from 4000 0.3510 0.5273 0.5911 0.3118 0.4701 0.5243
Table 7. Continuing the baseline system with the PanLex data.
L1-L2 L2-L1
Precision @1 @5 @10 @1 @5 @10
eng-ita 0.2080 0.3280 0.3687 0.2082 0.3386 0.3904
eng-spa 0.2840 0.4320 0.4800 0.2883 0.4331 0.4836
spa-ita 0.3920 0.5340 0.5813 0.3655 0.5291 0.5750
Table 8. Results of bilingual models trained pairwise on the three different languages.
L1-L2 L2-L1
Precision @1 @5 @10 @1 @5 @10
eng-ita 0.1573 0.2667 0.3127 0.1638 0.2942 0.3386
eng-spa 0.1947 0.2973 0.3447 0.2350 0.3538 0.4064
spa-ita 0.2520 0.3640 0.4160 0.2568 0.3723 0.4162
Table 9. Bilingual results of the multilingual model trained using three different languages at the same time.
An interesting finding was that the system learned much data with the PanLex dataset brought a slight improvement on
faster when an initial SVD was applied on the translation ma- the Italian-English scores, but English-Italian scores only got
trices. Results obtained with these settings on Dinu’s data worse.
showed that the proposed model did learn from the data. The Finally, the system was trained on three different languages
obtained precision scores, though, were far from current state- at the same time. The obtained pairwise precision values are
of-the-art results on this benchmark data, they were compara- proved to be worse than the results obtained when the system
ble with results of previous attempts. The proposed model per- was trained in bilingual mode. However, these results are still
formed much better using the fastText embeddings [8], than us- promising considering that a completely new approach was im-
ing Dinu’s WaCky embeddings [1]. plemented, and they showed that the system definitely learned
Thereafter, an English-Italian dataset was extracted from from a data which is available for a wide range of languages.
the PanLex database, from which training and test datasets were The approach is quite promising but in order to reach state-
constructed roughly following the same steps that Dinu et al. [1] of-the-art performance the system has to deal with some mathe-
took. The system was trained and tested on both Dinu’s and matical issues, for example dimension reduction in the universal
PanLex test sets, and in both cases the matrices trained on space. Further experimentation in multilingual mode with an
Dinu’s data were the ones reaching higher scores. On the Pan- extended number of languages could also provide meaningful
Lex data experiments with different training set sizes were ex- outputs. By involving expert linguistic knowledge various sets
ecuted, out of which the 3K training set gave the best results. of languages could be constructed using either only very close
Continuing the training of the matrices obtained by using Dinu’s languages, or, on the contrary, using very distant languages.
289
Thanks to the PanLex database, bilingual dictionaries can easily
be extracted, which can, then, be directly used for multilingual
experiments.
6. Acknowledgements
This work is a collaboration of the Universitat Politècnica de
València (UPV) and the Budapest University of Technology and
Economics (BUTE). Work partially supported by the Spanish
MINECO and FEDER founds under project TIN2017-85854-
C4-2-R.
7. References
[1] G. Dinu, A. Lazaridou, and M. Baroni, “Improving zero-shot
learning by mitigating the hubness problem,” arXiv preprint
arXiv:1412.6568, 2014.
[2] D. Kamholz, J. Pool, and S. M. Colowick, “Panlex: Building a
resource for panlingual lexical translation.” in LREC, 2014, pp.
3145–3150.
[3] T. Mikolov, Q. V. Le, and I. Sutskever, “Exploiting similari-
ties among languages for machine translation,” arXiv preprint
arXiv:1309.4168, 2013.
[4] M. Faruqui and C. Dyer, “Improving vector space word represen-
tations using multilingual correlation,” in Proceedings of the 14th
Conference of the European Chapter of the Association for Com-
putational Linguistics, 2014, pp. 462–471.
[5] H. Youn, L. Sutton, E. Smith, C. Moore, J. F. Wilkins, I. Mad-
dieson, W. Croft, and T. Bhattacharya, “On the universal struc-
ture of human lexical semantics,” Proceedings of the National
Academy of Sciences, vol. 113, no. 7, pp. 1766–1771, 2016.
[6] S. Ruder, I. Vulić, and A. Søgaard, “A survey of cross-lingual
word embedding models,” arXiv preprint arXiv:1706.04902,
2017.
[7] S. L. Smith, D. H. Turban, S. Hamblin, and N. Y. Hammerla, “Of-
fline bilingual word vectors, orthogonal transformations and the
inverted softmax,” arXiv preprint arXiv:1702.03859 (publised at
ICRL2017), 2017.
[8] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and
H. Jégou, “Word translation without parallel data,” arXiv preprint
arXiv:1710.04087, 2017.
[9] C. Xing, D. Wang, C. Liu, and Y. Lin, “Normalized word embed-
ding and orthogonal transform for bilingual word translation,” in
Proceedings of the 2015 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Human Lan-
guage Technologies, 2015, pp. 1006–1011.
[10] A. Lazaridou, G. Dinu, and M. Baroni, “Hubness and pollution:
Delving into cross-space mapping for zero-shot learning,” in Pro-
ceedings of the 53rd Annual Meeting of the Association for Com-
putational Linguistics and the 7th International Joint Conference
on Natural Language Processing (Volume 1: Long Papers), vol. 1,
2015, pp. 270–280.
[11] W. Ammar, G. Mulcaire, Y. Tsvetkov, G. Lample, C. Dyer, and
N. A. Smith, “Massively multilingual word embeddings,” arXiv
[12] M. Artetxe, G. Labaka, and E. Agirre, “Learning principled bilin-
gual mappings of word embeddings while preserving monolingual
invariance,” in Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, 2016, pp. 2289–2294.
[13] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich-
ing word vectors with subword information,” arXiv preprint
arXiv:1607.04606, 2016.
[14] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient esti-
mation of word representations in vector space,” arXiv preprint
arXiv:1301.3781, 2013.
[15] J. Tiedemann, “Parallel data, tools and interfaces in opus.” in
LREC, 2012, pp. 2214–2218.
290
IberSPEECH 2018
TransDic, a public domain tool for the generation of phonetic dictionaries in

standard and dialectal Spanish and Catalan
Juan-María Garrido1, Marta Codina2, Kimber Fodge2
1
Phonetics Lab, National Distance Education University, Spain
2
Department of Translation and Language Sciences, Pompeu Fabra University, Spain
jmgarrido@flog.uned.es
also disadvantages, such as the fact that an excessive number

Abstract of variants may be generated for some words, which may
This paper presents TransDic, a free distribution tool for the cause a decreasing of the performance of the speech
phonetic transcription of word lists in Spanish and Catalan recognizer ([8]). Also, the probability of each transcription
which allows the generation of phonetic transcription variants, variant should be also obtained by rule, a much more difficult
a feature that can be useful for some technological task in the current state of linguistic knowledge.
applications, such as speech recognition. It allows the Other possible technological applications of phonetised
transcription in both standard Spanish and Catalan, but also in dictionaries, such as the phonetisation of classical language
several dialects of these two languages spoken in Spain. Its dictionaries, do not require transcription variants. In this case,
general structure, input, output and main functionalities are only a single transcription, corresponding to the standard
presented, and the procedure followed to define and pronunciation, is needed.
implement the transcription rules in the tool is described. Most existing phonetic transcribers for Spanish and
Finally, the results of an evaluation carried for both languages Catalan have been developed for a specific system of
are presented, which show that TransDic performs correctly application, usually commercial, and they are not public
the transcription tasks that it was developed for. domain ([9], for example). Most non-commercial transcribers
Index Terms: Automatic phonetic transcription, Speech available for these two languages ([10,11]) have limitations to
recognition, Dialects, Spanish, Catalan, Tools its use (for example, they do not allow the transcription of
items that are not words, they only handle a phonetic
1. Introduction transcription alphabet—IPA or SAMPA— or they only allow
the transcription through a web interface) which makes
Phonetised dictionaries are language resources that are useful difficult the task of phonetising word lists. Also, they usually
for several purposes in Speech Technologies, but probably its do not allow the transcription in other dialects different to the
most frequent use is in automatic speech recognition. standard one. Saga [12], for Spanish, and Segre [13] for
Phonetised dictionaries included in speech recognition Catalan, do allow the generation of both standard and dialectal
systems are expected to contain not a unique transcription for phonetic transcriptions, but they do not allow the simultaneous
each entry in the dictionary, but several transcription variants generation of alternative transcriptions for a single item.
representing alternative pronunciations of the word in the TransText [14], a phonetic transcriber for Spanish and
language, as it has been shown that including these variants in Catalan, has the same limitation.
the dictionary improves speech recognition accuracy (see for
This paper presents TransDic, a free distribution tool for
example [1], among many other works). These variants may
the phonetic transcription of word lists in both standard
represent either dialectal, sociolectal or even individual
Spanish and Catalan, but also in several dialects spoken in
variations of the pronunciation of the word.
Spain of these two languages. Its main novel feature is that it
Two main approaches to the generation of these allows the generation of dictionaries both with one single
dictionaries have been attempted: data-driven and knowledge- transcription per entry or with several phonetic transcription
based [2]. Data-driven techniques obtain the transcription variants, which makes it suitable for the generation of
variants, as well as their relative frequency, from the analysis dictionaries for speech recognition systems. Also, it has been
of a phonetised corpus, which has to be large and designed to generate only ‘relevant’ dialectal variants, that is,
representative enough to derive a reliable set of pronunciation those which are general enough in the corresponding
variants. This analysis has been carried out frequently using geographical area. It has been developed from TexAFon
machine learning techniques ([3,4,5,6,7], among others). [15,16], a complete rule-based linguistic processing system for
Despite its evident advantages, the main withdraw of this text-to-speech that includes a phonetic transcriber, and which
approach is that it is very dependent on the contents of the has been improved to include new features, such as the
used corpus, in which probably rare words in the language generation of phonetic variants for a single input item and the
won’t be represented. Knowledge-based approaches make use transcription in several non-standard dialects.
of explicit linguistic knowledge, for example, in the form of
Next sections present the general structure of TransDic
transcription rules implemented in an automatic phonetic
and its main functionalities, and explain the procedure
transcriber. In this case, the automatic transcriber generates
followed to define and implement in the tool the transcription
one or several transcription variants for each word, depending
rules. The results of an evaluation carried for both languages
on the number of phonetic processes that can be applied to it.
are also presented.
The main advantage of this approach is that it ensures a full
coverage of the considered pronunciation variants, but it has
291 10.21437/IberSPEECH.2018-61
2. General description North (es_exn), Extremadura South (es_exs), Canarias
(es_can), Castilla-La Mancha (es_clm), Madrid
TransDic is a multiplatform tool that can be used either in (es_mad) and Murcia (es_mur)
Linux, Mac OS or Windows. It is a command-line tool, which Other arguments allow also to specify the phonetic
requires to specify a set of arguments for its execution (for alphabet to be used (IPA or SAMPA), if the transcription may
example, the language or dialect or the phonetic alphabet for include or not pronunciation variants and syllable marks or the
the transcription). It can be used then both as an independent format of the output dictionary.
tool or integrated in other tools or procedures.
It is important to note that the definition of the
TransDic internal structure includes three levels, as pronunciation variants to be considered in transcription is not
illustrated in Figure 1: specified directly using arguments, as in Saga, for example. In
• The tool itself, TransDic. this case the selection of a given dialect determines the
• The processing core, shared by other applications (such phonetic phenomena and pronunciation variants that will be
as TransText, for the phonetic transcription of texts, or taken into account.
texafon, a full text-processing tool for text-to-speech The output of TransDic is another UTF-8 text file
applications), which includes the letter-to-sound module, containing the phonetised dictionary, which can be generated
other text processing modules, some of them (the text in two different formats: default (Figure 2) or HTK (Figure 3).
processing module, for example) used also by TransDic.
• A set of language/dialect modules, including language boxejar b o k . s e d . ʒ ˈa b o k . s e j . ʒ ˈa bok.sed.ʒ
or dialect dependent dictionaries and rules. Every dialect ˈa ɾ b o k . s e j . ʒ ˈa ɾ
is considered then as an ‘autonomous’ language, with its boxers b o k . s ˈe s b o k . s ˈe ɾ s
own rules and dictionaries. branca b ɾ ˈa n . k a
braçalet b ɾ a . s a . l ˈɛ t
brioix b ɾ i . ˈɔ j s
brisa b ɾ ˈi . z a
Figure 2: Sample of default output dictionary

generated by TransDic.
estas [estas] ˈe s t a s
sea [sea] s ˈe a
tenía [tenía] t e n "i a
nunca [nunca] n "u n k a
poder [poder] p o D ˈe r
aquí [aquí] a k "i
Figure 3: Sample of output dictionary in HTK format

generated by TransDic.
3. Development of the dialect transcription

Figure 1: TransDic structure.
modules
Earlier versions of TexAFon [15, 16] included only language
The transcription procedure in TransDic involves three modules for standard Catalan and Spanish. The development
main stages, which are similar to the ones in other existing of TransDic has been possible after the development of a set
transcription tools: of new language modules for the TexAFon processing core,
1) Text preprocessing covering several Spanish and Catalan dialects spoken in Spain.
2) Lexical stress prediction Each language/dialect module includes a pronunciation
3) Phonetic transcription exception dictionary and the transcription rules, written in
Unlike other available phonetic transcription tools, Python, for the corresponding dialect, and they have been
TransDic is able also to transcribe non-standard tokens (such developed from the corresponding standard version. The
as symbols, dates, times or other figures), which are converted detailed description of the development procedure of these
to readable Spanish items by its text processing module. rules exceeds the scope of this paper, but it can be summarised
TransDic input must be a UTF-8 text file containing the in the following steps for each dialect:
list of tokens (one line per token) to be transcribed. Arguments 1. Definition of the phonetic phenomena characterising its
allow to specify, for example, the dialect in which pronunciation
transcription has to be produced: 2. Implementation of the phonetic rules covering the
• Catalan: Standard (ca), Ribagorza (ca_ri), Pallars selected phenomena
(ca_pa), Tortosa (ca_to), Central Western (ca_ac), 3. Evaluation of the transcription and fixing of errors
Northern Valencian (ca_vs), Central Valencian (ca_vc), These three steps are described briefly in the following
Southern Valencian (ca_vm) and Alicante Valencian subsections. A detailed description of the development
(ca_al) procedure for the different dialect modules in Spanish and
• Spanish: Peninsular Standard (es), Western Andalucía Catalan can be found in [17] and [18], respectively.
(es_aoc), Eastern Andalucía (es_aor), Extremadura
292
3.1. Phonetic characterization of dialects Table 1: Primary and secondary phenomena
considered for Canarias Spanish.
Existing literature on dialectological studies of Spanish spoken
in Spain ([19,20,21,22], among many others) and Catalan Primary Secondary
([23], a recent work also among many others) usually Pronunciation of orthographical
describes dialectal pronunciations in a very detailed way, <c,z> as [s] (seseo)
paying attention to both general and local phenomena. For this <caza>  [ˈkasa]
work, however, only those phenomena which are general Pronunciation of orthographical Elision of syllable-final
<s> at the end of a syllable as [h] orthographical <s>
enough in the geographical area of the dialect were
(aspiración) <los>  [lo]
considered. And within these general phenomena, a distinction <los>  [loh]
was made between ‘primary’ (the most frequent in their Pronunciation of orthographical
corresponding geographical area) and ‘secondary’ (not as <g, j> as [h] (aspiración)
frequent as primary ones, but general enough to be taken into <ajo>  [‘aho]
account in a general description of the dialect in question), in Elision (no pronunciation) of Pronunciation of intervocalic
order to keep the number of pronunciation variants within a orthographical <d> in words <d> as [ð], as in standard
reasonable number. Only geographical variants were ending with <-ado> Spanish
considered, not social, stylistic nor individual. <cansado>  [kanˈsao] <cansado>  [kanˈsaðo]
The goal of this phase was then to define a set of ‘primary’ Elision of intervocalic <d>
and ‘secondary’ pronunciations for each considered dialect, different from those of <-ado>
words
and to establish the relations between them (that is, which
<comido>  [koˈmio]
secondary pronunciations are alternative pronunciations to the
primary ones). This task was carried out through an extensive Elision of word-final <r,l> Pronunciation of word-final
literature review for both languages. It was not an easy task at <comer>  [koˈme] <r,l> as [r,l], as in standard
<Raquel>  [raˈke] Spanish
all from a linguistic point of view, as the information provided
<comer>  [koˈmeɾ]
in the literature is frequently incomplete, with limited <Raquel>  [raˈkel]
information about the frequency or extension of the described
phenomena. Elision of word-final <d> Pronunciation of word-final <d>
<corred>  [koˈre] as [d], as in standard Spanish
The detailed description of all the defined dialect sets is <corred>  [koˈred]
out of the scope of this paper, so only one is presented here in Pronunciation of orthographical
some detail, the one for Canarias Spanish, based in the <ll> as [j] (yeísmo)
description provided mainly in [19,21,24,25]. Table 1 presents <calle>  [ˈkaje]
the subset of primary pronunciations which are different from
standard Spanish in this dialect, the subset of secondary 3.2. Implementation
pronunciations and the relation between both subsets (that is,
which secondary realisations are pronunciation variants of the The implementation of the new dialect modules was done in
primary ones). As it can be observed in this table, the relation all cases taking as starting point the rules and dictionaries for
between primary and secondary pronunciations can follow the corresponding standard dialect and then making the
different patterns: necessary modifications to include the defined primary and
• One primary pronunciation has no secondary secondary phenomena. Some changes in the language-
pronunciations associated, which means that no independent core of TexAFon were done also to allow the
alternative realisations are considered. This is the case, generation of several pronunciation variants for a single input
for example, of the seseo. token.
• One primary pronunciation has one (or more) secondary To implement primary phenomena, new context-
pronunciation(s) associated. This means that all dependent Python rules were developed to replace the
pronunciations, primary and secondary ones, are standard ones in those cases in which the primary
possible in the same context, although the primary one is pronunciation was different from the standard. Figure 4
considered as more frequent than the secondary one(s). presents an example of rule for a primary pronunciation in
For example, elision of syllable-final orthographical <s> Andalusian Spanish. In some cases, the inclusion of these
(secondary pronunciation) is a possible alternative rules led to modify also the exception dictionary, to make
realisation to the pronunciation as [h], considered some entries coherent with the new rule.
primary (that is, most frequent) in Canarias Spanish. The implementation of the secondary phenomena was
done in a second phase and involved the modification of the
• One secondary pronunciation has no primary
primary context-dependent rules to allow the generation of
pronunciation associated. This means that it is an
secondary transcriptions for the same context. Figure 5
alternative to a standard pronunciation, which is also a
illustrates the result of the modification of the same rule
primary pronunciation in that dialect. This is the case of
presented in figure 4 to include deletion of <s> as secondary
yeísmo, the realisation of orthographical <ll> as [j],
variant.
which according to the literature has been considered as
less frequent than the realisation as [ʎ], default Finally, if the user has specified with the corresponding
pronunciation in standard Spanish. argument that transcription variants should be generated,
TransDic produces all transcription variants for the input
These lists of phonetic phenomena were used for the
word. These transcription variants are derived from the
implementation phase, explained in the following subsection.
character-by-character pronunciation variants generated by the
transcription rules: the language-independent letter-to-sound
293
module takes the obtained pronunciations for each character Table 2: Error measures obtained for the Catalan
and combines them to generate the word transcription variants. dialects.
So, for example, for the word ‘casas’ two different
Dialect Value
transcription variants would be generated ([kˈasah], [kˈasa])
Ribagorza 8
using the previously described rules, whereas in the case of Pallarés 11
‘llover’ the output variants would be four ([ʎoβˈeɾ], [joβˈeɾ], Tortosa 9
[ʎoβˈe], [joβˈe]), as the input word includes two characters Central area 8
with transcription variants which are combined to create the North Valencia 9
different alternative word transcriptions. Central Valencia 10
South Valencia 10
Alicante Valencian 9
if ch == "s" and nch == "NIL":
salida.append(["hh",0,False])
return salida Table 3: Mean number of variants per entry obtained
for the Spanish dialects.
Figure 4: Python implementation of a phonetic
transcription rule for the pronunciation of Dialect Mean number of variants
Standard 1.021
orthographical <s> as [h] (aspiración) at the end of a
Western Andalucía 1.76
word in Andalucía Spanish (Fodge, 2014).
Eastern Andalucía 1.354
Extremadura North 1,354
if ch == "s" and nch == "NIL": Extremadura South 1.76
salida.append(["hh",0,False]) Canarias 1.468
# Update deletion of word final syllable final s Castilla-La Mancha 1.207
salida.append(["",0,False]) Madrid 1.207
return salida Murcia 1.713
Figure 5: Python implementation of a secondary
pronunciation transcription rule for <s> deletion (in Table 4: Mean number of variants per entry obtained
bold) in Andalucía Spanish (Fodge, 2014). for the Catalan dialects.
Dialect Mean number of variants
3.3. Evaluation Standard 1
Ribagorza 1.006
The procedure to evaluate the transcription produced by Pallarés 1
the new dialect modules was similar for Spanish and Catalan: Tortosa 1
lists of isolated words, representative of the implemented Central area 1
phenomena (307 words for Spanish; 99 for Catalan), were North Valencia 1.269
processed using each dialect module to obtain the Central Valencia 1.003
corresponding output transcription. These output South Valencia 1.003
transcriptions were then revised manually to detect possible Alicante Valencia 1.003
errors. Both evaluations were carried out using a different tool
of the TexAFon package, but the evaluated rules were the 4. Conclusions
same used in TransDic [17,18].
This paper has presented TransDic, a tool for the generation of
In the case of Spanish, the transcription performance was
phonetised dictionaries for Catalan and Spanish. Its most
perfect: no errors were detected. Some errors were found,
innovative features are that it allows to transcribe in several
however, in the case of Catalan dialects. An error measure was
Spanish dialects spoken in Spain (Saga [12], for example, was
computed in this case, using the procedure described in [26]:
developed considering mainly the American Spanish dialects)
the sum of all phone substitutions (Sub), deletions (Del) and
and that it allows the creation of phonetic dictionaries
insertions (Ins) divided by the total number of phones in the
containing a reasonable number of pronunciation variants. The
reference transcription (N). Table 2 presents the obtained error
knowledge-based approach used in TransDic, based on a
values.
careful selection of the phonetic phenomena considered for
The number of generated variants was also evaluated. transcription, does not lead to an overgeneration of variants, a
Phonetised dictionaries were generated with TransDic for all classical problem in this kind of approach. Finally, another
dialects using two reference word lists in Spanish (1,000 most interesting feature of TransDic is that it is available for free
frequent words in the CREA corpus [27]) and Catalan (a download, from
cleaned version of the CesCa corpus [28], 1,648 words) and https://sites.google.com/site/juanmariagarrido/research/resourc
the mean number of variants per entry was computed. Tables 3 es/tools/transdic. TransText [14], a phonetisation tool which
and 4 present the results. allows the transcription of texts in the same dialects as
The results of these two evaluations indicate that TransDic TransDic, is also available for download from
performs reasonably well both as for transcription quality and https://sites.google.com/site/juanmariagarrido/research/resourc
number of generated variants is concerned. Spanish modules es/tools/transtext.
provide a more accurate transcription than Catalan ones, but Some expected improvements for the tool in the future are
they tend to generate more variants per input entry than the inclusion of new types of rules for variants (for example,
Catalan modules. Anyway, mean number of variants per entry rules for informal pronunciations), and a deeper evaluation of
is always below 2 in both languages. the output by native speakers of each dialect.
294
5. References [18] M. Codina, Automatic Phonetic Transcription of dialectal
variance in Catalan. Master Thesis, Barcelona: Pompeu Fabra
[1] H. Al-haj, R. Hsiao, I. Lane, and A.W. Black, “Pronunciation University, 2016.
modeling for dialectal arabic speech recognition”. IEEE [19] P. García Mouton, Lenguas y dialectos de España, Madrid: Arco
Workshop on Automatic Speech Recognition & Understanding Libros, 1994.
ASRU 2009, pp. 525–528, 2009. [20] M. Alvar (Director), Manual de dialectología hispánica. El
[2] J. Zheng, Pronunciation Variation Modeling for Automatic español de España. Barcelona: Ariel, 1999.
Speech Recognition, Ph. D, Thesis, University of Colorado, [21] F. Moreno Fernández, La lengua española en su geografía.
Boulder 2014. Madrid: Arco Libros, 2009.
[3] P. Taylor, “Hidden Markov Models for Grapheme to Phoneme [22] J. A. Samper Padilla, “Sociophonological Variation and Change
Conversion”, Proceedings of the European Conference on in Spain”, In M. Díaz Campos (Ed.), The Handbook of Hispanic
Speech Communication and Technology, Lisboa, Portugal, Sociolinguistics. West Sussex: Wiley-Blackwell, pp. 98–117,
September 2005, pp. 1973-1976, 2005. 2011.
[4] M. Bisani and H. Ney, “Investigations on joint-multigram [23] J. Veny and M. Massanell, Dialectologia catalana. Aproximació
models for grapheme-to-phoneme conversion”, Proceedings of pràctica als parlars catalans. Barcelona: Universitat de
the 7th International Conference on Spoken Language Barcelona, 2015.
Processing, ICSLP2002 - INTERSPEECH 2002, Denver, [24] M. M. Azevedo, Introducción a la lingüística española, Upper
Colorado, USA, September 16-20, pp. 105-108, 2002. Saddle River: Prentice Hall, 2009.
[5] S. Deligne, F. Yvon, and F. Bimbot, “Variable-length sequence [25] J. I. Hualde, A. Olarrea, and A. M. Escobar, Introducción a la
matching for phonetic transcription using joint multigrams”, lingüística hispánica. Cambridge: Cambridge University Press,
Proceedings of the Fourth European Conference on Speech 2001.
Communication and Technology, EUROSPEECH 1995, Madrid, [26] C. Van Bael, L. Boves, H. Van den Heuvel, and H. Strik,
Spain, September 18-21, pp. 2243-2246, 1995. “Automatic Phonetic Transcription of Large Speech Corpora”.
[6] A. Laurent, P. Deleglise, and S. Meignier, “Grapheme to Computer Speech & Language, vol. 21, no. 4, pp. 652-668,
phoneme conversion using an smt system”, Proceedings of the 2007.
10th Annual Conference of the International Speech [27] Real Academia Española, Banco de datos (CREA). Corpus de
Communication Association, Brighton, United Kingdom, referencia del español actual.
September 6-10, pp. 716-719, 2009. http://corpus.rae.es/lfrecuencias.html.
[7] M. Gerosa and M. Federico, “Coping with out-of-vocabulary [28] A. Llauradó, M. A. Martí, and L. Tolchinsky, "Corpus CesCa:
words: open versus huge vocabulary ASR”, Proceedings of the Compiling a corpus of written Catalan produced by school
2009 IEEE International Conference on Acoustics, Speech and children". International Journal of Corpus Linguistics, vol. 17,
Signal Processing, ICASSP 2009, pp. 4313-4316, 2009. no. 3, pp. 428–441, 2012.
[8] T. Holter and T. Svendsen, “Maximum likelihood modeling of
pronunciation variation”, Speech Communication, vol. 29, no. 2-
4, pp. 177–191, 1999.
[9] P. Bonaventura, F. Giuliani, J. M. Garrido, and I. Ortín,
“Grapheme-to-Phoneme Transcription Rules for Spanish, with
Application to Automatic Speech Recognition and Synthesis”,
Proceedings of the Workshop ‘Partially Automated Techniques
Transcribing Naturally Occurring Continuous Speech’,
Université de Montréal, Montreal, Quebec, Canada, pp. 33-39,
1998.
[10] X. López, Transcriptor fonético automático del español, 2004.
http://www.aucel.com/pln/transbase.html
[11] Molino de Ideas, Transcriptor fonético, 2012
http://www.fonemolabs.com/transcriptor.html.
[12] TALP-UPC, SAGA - Phonetic transcription software for all
Spanish variants, 2017, https://github.com/TALP-UPC/saga.
[13] P. Pachès, C. de la Mota, M. Riera, M. P. Perea, A. Febrer, M.
Estruch, J. M. Garrido, M. J. Machuca, A. Ríos, J. Llisterri, I.
Esquerra, J. Hernando, J. Padrell, and C. Nadeu, “Segre: An
automatic tool for grapheme-to-allophone transcription in
Catalan”, Proceedings of the Workshop on Developing
Language Resources for Minority Languages: Reusability and
Strategic Priorities, LREC, pp. 52-61, 2000.
[14] J. M. Garrido, M. Codina, and K. Fodge. “TransText, un
transcriptor fonético automático de libre distribución para
español y catalán”, Actas del Workshop “Subsidia: herramientas
y recursos para las ciencias del habla”, in press.
[15] J. M. Garrido, Y. Laplaza, M. Marquina, C. Schoenfelder, and S.
Rustullet. “TexAFon: a multilingual text processing tool for text-
to-speech applications”, Proceedings of IberSpeech 2012,
Madrid, Spain, pp. 281-289, 2012.
[16] J. M. Garrido, Y. Laplaza, B. Kolz, and M. Cornudella,
“TexAFon 2.0: A text processing tool for the generation of
expressive speech in TTS applications”, Proceedings of LREC
2014, Ninth International Conference on Language Resources
and Evaluation, Reykjavik (Iceland), pp. 3494-3500, 2014.
[17] K. Fodge, Introducing Spanish dialects in a linguistic processing
module for improved ASR and novel speech synthesis
capabilities, Master Thesis, Barcelona: Pompeu Fabra
University, 2014.
295
IberSPEECH 2018
Wide Residual Networks 1D for Automatic Text Punctuation

Jorge Llombart, Antonio Miguel, Alfonso Ortega, Eduardo Lleida
ViVoLAB, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Spain
jllombg@unizar.es, amiguel@unizar.es, ortega@unizar.es, lleida@unizar.es
Abstract y/Ŷ ĞŶ ůĂ ĨŽƚŽŐƌĂĨşĂ ŵƵǇ ƐŝŵƉůŽŶĂ ǀĞŵŽƐ Ă ƚŽŵ ͞W͟
Documentation and analysis of multimedia resources usually zdƌƵĞ ͞^ƉĂĐĞ͟ ͞^ƉĂĐĞ͟ ͞ŽŵŵĂ͟ ͞^ƉĂĐĞ͟ ͞ŽŵŵĂ͟ ͞^ƉĂĐĞ͟ ͞^ƉĂĐĞ͟ ͞WĞƌŝŽĚ͟ ͞W͟
requires a large pipeline with many stages. It is common to ob-

ƚϬ ƚϭ ƚϮ ƚϯ ƚϰ ƚϱ ƚϲ ƚϳ ƚϴ
tain texts without punctuation at some point, although later steps

might need some accurate punctuation, like the ones related to Figure 1: Input sequence and its tag correspondence for an in-
natural language processing. This paper is focused on the task put sequence of T = 9 when the text only has 8 words.
of recovering pause punctuation from a text without prosodic
or acoustic information. We propose the use of Wide Residual
Networks to predict which words should have a comma or stop Wide Residual networks are also used in automatic speech
from a text with removed punctuation. Wide Residual Networks recognition obtaining good results [12, 13]. In those architec-
are a well-known technique in image processing, but they are tures, Wide Residual Networks use two-dimensional convolu-
not commonly used in other areas as speech or natural language tions, considering the cepstral input as an image, but in this
processing. We propose the use of Wide residual networks be- study we consider that in speech and natural language process-
cause they show great stability and the ability to work with long ing the important dimension is the temporal dimension, and cor-
and short contextual dependencies in deep structures. Unlike relation along other dimensions is lower, so we use convolutions
for image processing, we will use 1-Dimensional convolutions 1D.
because in text processing we only focus on the temporal di- This paper is divided in four sections. The first is the intro-
mension. Moreover, this architecture allows us to work with duction. In the second section we describe the punctuation task
past and future context. This paper compares this architecture and the neural network models used to perform this task. In
with Long-Short Term Memory cells which are used in this task section three we explain how the experiments are done. And in
and also combine the two architectures to get better results than the last section we present some conclusions and future work.
each of them separately.
Index Terms: Text Punctuation, Wide Residual Network, Re- 2. Text punctuation modelling
current Neural Networks, Natural Language Processing In this paper, we will focus on recovering punctuation from
Spanish texts where punctuation is missing. Our experiments
1. Introduction are designed to recover periods and commas, and we simu-
Nowadays, there are a great amount of multimedia resources late the task by deleting them from a collection of texts. In
from which we want to extract more and more information. Spanish there are some different ways to express the end of
This usually requires a large pipeline with many stages where, at a sentence: closing exclamation, closing interrogation, full
some point, it is needed to have accurate punctuation to accom- stop, period, etc. In order to simplify the classification we
plish later steps like machine translation, summarization, sen- consider that all kind of end of sentence is marked as pe-
timent analysis, etc. This task is usually performed after some riod. So the set of labels in our experiments consists on L =
automatic speech recognition system, and it usually takes some {”Space”, ”P eriod”, ”Comma”, ”P ad”} where we will la-
prosodic information like pauses time. Some systems are based bel each input word of our experiments as:
on a capitalization and punctuation system [1], on Long-Short • word without punctuation mark: ”Space”
Term Memory cells (LSTM)[2] or on Bidirectional Long-Short • word followed by any kind of period: ”Period”
Term Memory cells (BLSTM) [3], but not always it is possi-
• word followed by comma: ”Comma”
ble to access to this features. Other works do not use prosodic
features. Some uses Conditional Random Fields (CRF) [4], or • padding: ”PAD”.
in a combination with LSTM [5]. There are approaches with This experiment uses a fixed dictionary O = {0, ..., D+1},
Convolutional Neural Networks (CNN) [6], or distilling from with D the vocabulary size plus two extra symbols. There are
an ensemble of different neural networks [7]. There are also two special words, first one is o0 = ”P AD” which is reserved
works focused on Character-level Neural Networks [8], where to fill input sequences when it is necessary, and the second is
they also use architectures with Convolutional Highway, Resid- o1 = ”U N K”, which is reserved for those words not included
ual networks and BLSTM. in O, usually called out of vocabulary words.
In image classification, convolutional neural networks have Text punctuation is a classification task in which we want
achieved very good results. In those tasks convolutional net- to assign a label from L to each element of an input sequence
works have gain depth, even hundreds of layers. This was pos- X = {x0 , ..., xt , ..., xT } with t ∈ T the maximum number
sible with Residual Networks [9]. They use very deep and thin of elements in the input sequence. Each xt corresponds with
structures, but they suppose a great amount of computation. one word od ∈ O. Our Neural Network provide as output a
Wide Residual networks (WRN) were designed to obtain bet- sequence of labels Y = {y0 , ..., yt , ..., yT }, ∀yt ∈ L. In Figure
ter results with less depth and, therefore with less computation, 1 we represent an input vector Xin with the training label vector
increasing the width [10, 11]. YT rue that corresponds with it.
296 10.21437/IberSPEECH.2018-62
2.1. Wide Residual Networks 1D pass, or also called mini-batch. Each sequence has T elements,
where each element is a vector xW RN,n,t , n ∈ N, t ∈ T with
WRNs are usually used in image tasks where two spatial di- CW the convolutional filter responses. In Figure 2b we show
mensions are the context to work. In automatic speech recog- that each xW RN,n,t is evaluated through the position indepen-
nition we can use spectrograms as an image to adopt this two- dent convolution layer and a Softmax non-linearity to obtain the
dimensional schema [12, 13]. However, in text processing, we final classification.
work only with temporal dimension because word relations are To train the network we use the cross-entropy as cost func-
only meaningful along this dimension. WRN is a structure based tion J(x, y). In the case of many-to-one structure we compute
on 2D-convolutional networks. In order to adapt this structure the mini-batch error as:
to our environment, we will work with 1D-convolutional layers.
In image context, convolutional layers are characterized by the N −1
1 X
number of output channels, C, but in text context, we consider Jmany to one = J(xn , yn ), (1)
N n=0
channels as a temporal sequence of vectors, where each element
of this vectors is one of the C filters response for this sequence
moment. where each example sequence of T elements correspond to one
example of N in the mini-batch. In our case we implement the
A WRN consists on several blocks, that in this paper we many-to-many structure. We also use the cross-entropy error
will call Wide Residual Block (WRB). In each of this WRB, we loss function J(x, y), but the mini-batch cost is computed for
increase the number of channels in the output. We will widen each of the T elements of each of the N examples of the mini-
the convolutional output. These WRBs are also constructed by batch as:
several Residual Block (RB). This Blocks are the basic elements
of the WRNs, and them have two paths. The first path consists N −1
1 X 1 X
T −1
on two convolutional layers with batch normalization [14] and Jmany to many = J(xn,t , yn,t ) (2)
N n=0 T t=0
ReLU non-linearity [15]. The second path is a residual con-
nection between block input and output. In this work we use
a sum of the block input and the convolutional path as residual 2.2. Long-Short Term Memory Cells
connection. In the case where the number of channels of the In this study we will compare WRN with LSTM structures and
two paths are different, we adjust the number of channels with BLSTM structures. In these cases we use the same input Xin as
a convolutional layer in the residual path. In Figure 2a we show described in section 2. In Figure 3 we show that the the input
the complete structure of the WRNs that we use in this work. sequences are processed by an LSTM or BLSTM obtaining a se-
We will characterize the WRB by its widen factor, which is the quence with the same length as the input. In the case of a LSTM
factor we apply to the number of channels at block input to gain or a BLSTM, the sequence Xin passes through the embedding
width. In Figure 2a, we describe convolutional layers by its ker- layer as in a WRN obtaining a sequence of same length but each
nel dimension. This kernel dimension is the number of words element xt from the sequence is a vector of embedding dimen-
that the convolutional filter takes to compute each element in the sion C, as we represent in Figure 3. This sequence is evaluated
output. In this work the kernel dimension of the first convolu- by the LSTM or the BLSTM obtaining a sequence where each el-
tion layer is 5. Convolutional layers in WRB use a kernel of di- ement has dimension H or 2∗H respectively with H the hidden
mension 3. There is a special case when we use a convolutional dimension of the LSTM or the BLSTM layers. With this struc-
layer with kernel dimension of one. This kind of convolutional ture we use the same philosophy many-to-many as in WRN, and
layer sometimes is called 1 x1 convolution, position indepen- we use the same loss function described in equation 2.
dent convolution or position wise fully connected. It is like a To obtain a combination of WRN and BLSTM we just sub-
fully connected layer with all its weights shared through time. stitute the position independent convolution and the Softmax in
In other words, each element of the input is evaluated with same WRN by the BLSTM structure showed in Figure 3. Then we
weights along all sequence. In this layers we also describe in- will remove or add WRBs to the structure to get different results
put dimension and output dimension of each sequence element discussed in section 3
as [input dimension × output dimension].
Usually in classification we adopt the many-to-one struc-
ture where we use an input sequence to classify the central el-
3. Experiment and Results
ement of the sequence. In convolutional networks for classifi- For this experiment we have collected 1,181,413 articles from
cation, usually we use pooling and reduce operations in order electronic edition of 32 diverse Spanish magazines and news-
to classify this central element. A problem that has this archi- papers, form January 2017 up to February 2018. Each article
tecture to classify two contiguous elements, we need to process has 506 words on average. For our purposes all texts were case
two sequences that only differ in one element. This produce that lowered; numbers, dates, hours and Roman numbers were auto-
we recompute T − 1 times the same element in the sequence. In matically transcribed and normalized; some units and abbrevia-
order to reduce this massive excess of computation in long con- tions like Km, m, Kg, min, were changed by their transcription;
text sequences we will use a many-to-many paradigm. In our all symbols were deleted; and final question and exclamation
experiments we select unique very long sequences from the text marks were substituted by dots. This data set was segmented in
as the context, and we compute the convolutions to the whole train set (980,000 articles), development set (101,000 articles)
sequence only once. For this our WRNs do not use any pooling and test set (100,413 articles), randomly chosen. The dictio-
strategy, the stride is always 1 and use the appropriated padding nary for this work is composed by all words in train text which
in order to fix the edge effects in convolutional layers. There- appear at least 50 times in order to avoid misspelled and rare
fore, at the output of the WRBs XW RN in figure 2a, we have the words. This supposes that all our experiments use a dictionary
same sequence length as the input, T . In Figure 2b it is repre- with 116,737 words including especial delimiters.
sented the shape of the WRNs output sequence. N is the number In all of the architectures tested in this work we use as first
of sequence examples evaluated at the same time in a forward layer an embedding layer with input dimension the number of
297
y/Ŷ
y y> y>
ŵďĞĚĚŝŶŐ
d
E
ŽŶǀϭ;ϱͿ E
Z ytZE t
ZĞ>h
tZ;ϭϲͿ
ŽŶǀϭ;ϯͿ ŽŶǀϭ;ϭͿ΀> ǆ>нϭ΁
tZ;ϯϮͿ Z
> т>нϭ > с>нϭ ŽŶǀϭ;ϭͿ΀t ǆKƵƚ΁
E
tZ;ϲϰͿ
Z ZĞ>h ^ŽĨƚŵĂǆ
E
ŽŶǀϭ;ϯͿ
ZĞ>h н y͛>
Z
ytZE Z KƵƚ
E zWƌĞĚ
ŽŶǀϭ;ϭͿ΀ǁ ǆKƵƚ΁ d
yнϭ y>нϭ y>нϭ

^ŽĨƚŵĂǆ (b) Linear classification block in a
zWƌĞĚ WRN.
(a) Wide Residual Network.
Figure 2: Diagram of the Wide Residual Network architecture. The left figure dis-aggregate the different blocks in WRN, and the right
figure shows dimensions and schema of the linear classification block in the WRN.
d
E
word from the input to the classification of a particular output
time. For this we have cancelled sequentially each word of the
y input, by forcing to zero the output of the embedding layer for
this word, and representing the value of the network output af-
ter a forward pass of this modified input. We assume that those
ŚϬ Śƚ Śd words that do more contribution to a correct classification would
>^dD
ŚϬ Śƚ Śd be those words that when are cancelled disturb more the classi-
fication. In Figure 4 we can see the output values for ”Period”
class before the Softmax non-linearity of the LSTM, BLSTM and
, WRN for an output time where a ”Period” should be predicted.
, The dot probability output is shown as each word of the input
is sequentially cancelled. One of the first things we can look at
ŽŶǀϭ;ϭͿ΀Ϯ,ǆKƵƚ΁ is the temporal distance of the first and last cancelled word that
does a significant perturbation in ”Period” classification. LSTM
^ŽĨƚŵĂǆ only has perturbations with words before the moment in which
we want the classification. We can see that this architecture only
E zWƌĞĚ KƵƚ uses past information for classification. This behavior may lead
d to worse performance because future words should be useful in
punctuation task. We can not say that a sentence is ended if we
Figure 3: BLSTM Classification Architecture. do not know when the text change of sense, or subject, or action.
In BLSTM architecture we can see that it uses words from past
and from future. Even those words that do a major perturbation
are from a future context. The case of WRNs is the same as
words in the dictionary, and 1024 as output dimension. The BLSTM, but it uses less number of words for the classification
first convolutional layer in the structures has 512 output chan- because the perturbations are closer to the classification time.
nels, stride of one and padding of two in order to cover edge For evaluation purposes we use three measures, precision,
effects. In wide residual blocks we use 8 as widen factor. The recall and F1 -measure, for each class of interest.
dimension of each WRB is the widen factor times 16 for the T rueP ositive
first block, the widen factor times 32 in the second block, the P recision = (3)
T rueP ositive + F alseP ositive
widen factor times 64 in the third block and the widen factor
times 128 in forth block. The LSTM and BLSTM architectures T rueP ositive
Recall = (4)
are composed by two layers of 1024 cells. All the architectures T rueP ositive + F alseN egative
were trained during 500 iterations where, for this experiment, P recision · Recall
we consider an iteration of training to process 3,000 random ar- F1 = 2 · (5)
P recision + Recall
ticles from the training set. All trainings were accomplished by For this measures only relevant cases are taken into account as
back-propagation procedure with Adam optimizer [16] config- we see in equations 3, 4 and 5, where only consider the True
ured with default hyper-parameters in Pytorch software [17]. Positive, False Positive, and False Negative for each class of
In order to understand the behaviour of each architecture, interest.
we have studied the response of the output when we modify For this paper we start considering as baseline a LSTM and a
the input. We want to get an idea of the contribution of each BLSTM. In Table 1 we show that BLSTM provides better results
298
0
−2
Period class network output
−4
−2
la
fecha
del
inicio
de
las
obras
de
para
los
diques
la
de
la
magdalena
y
estabilización
las
playas
santanderinas
de
los
peligros
bikinis
se
podría
conocer
esta
misma
semana
según
ha
avanzado
la
alcaldesa
gema
igual
la
regidora
municipal
así
lo
ha
indicado
este
martes
a
preguntas
de
la
prensa
sobre
este
proyecto
que
ejecutará
la
empresa
tragsa
y
sobre
el
que
mantiene
permanente
contacto
con
la
secretaria
de
estado
de
medio
ambiente
en
el
marco
de
ha
esos
primera
contactos
manifestado
que
espera
que
la
fecha
del
inicio
de
los
trabajos
se
conozca
muy
pronto
tanto
que
según
ha
añadido
en
esta
semana
del
año
se
pondrá
la
fecha
igual
ha
reivindicado
la
necesidad
de
que
ha
enfatizado
ha
quedado
nuevamente
estos
diques
algo
demostrado
estos
Canceled word in input sequence
Figure 4: Variations caused by the cancellation of an input word to ”Period” class from the perspective of a True period label after the
word ”ambiente” pointed by the vertical line. In grey shadow we show the context that affects the classification. From top to bottom
we represent the same output from LSTM, BLSTM and WRN architectures.
Table 1: Precision, Recall and F1 -Measure percentage for ”Period” and ”Comma” classes along the different architectures.
Period Comma
Architecture
Precision Recall F1 Precision Recall F1
LSTM 73.78 82.18 77.75 87.74 81.36 84.43
BLSTM 87.12 88.96 88.03 91.28 89.78 90.52
Convolutional 1WB 84.05 87.47 85.73 89.76 86.88 88.30
WRN 1WRB 83.10 88.66 85.79 90.67 85.94 88.24
WRN 2WRB 86.20 88.91 87.53 91.13 88.89 90.00
WRN 3WRB 87.41 89.02 88.21 91.17 89.84 90.50
WRN 4WRB 76.30 87.79 81.64 89.23 78.76 83.67
WRN 1WRB + BLSTM 89.33 92.36 90.82 93.86 91.38 92.60
WRN 2WRB + BLSTM 89.28 90.74 90.00 92.62 91.43 92.02
WRN 3WRB + BLSTM 86.93 89.92 88.40 81.81 89.32 90.55
than LSTM achieving a 88.03 of F1 with ”Period” class, and a 4. Conclusions and future work
90.53 of F1 with ”Comma”. These improvements are obtained
In this work, we have shown the use of Wide Residual Networks
thanks to use past and future context in BLSTM.
for punctuation recovering. We show that architectures that use
Residual connections usually work better when the network past and future context obtain better results. WRNs performs
is very deep, but they also help in shallower networks. In or- similarly to BLSTM in our experiments so other aspects of the
der to see how residual connections help the classification, we architecture should be taken into account, like training time or
trained convolutional networks which are exactly the same ar- resources consumed. The best results are obtained when WRNs
chitecture as WRNs but without residual connections. In Table 1 and BLSTMs are concatenated. This work suggests the idea of
we show that when we increase the number of blocks in convo- the use of WRNs as feature extractor. Then we need powerful
lutional networks without residual connections, they start to get structures than a linear classifier to perform the final classifica-
worse results than the same configuration in WRNs. With only tion. In problems that requires sequential processing, BLSTM
one block they perform equal, but with three blocks, the net- structures that uses past and future context, are the best option.
work without residual connections performs worse. We achieve For future work, it would be interesting to differentiate be-
the best result with a WRN using three blocks. We obtain an tween question marks and ”periods”. In Spanish, this is a chal-
F1 of 88.21 for ”Period” and 90.50 for ”Comma”, which is the lenging task since we need to predict the closing question mark,
same performance as the base case of a BLSTM. but also the opening question mark, which is a symbol not used
In the last experiments we concatenate the output of a WRN in other languages like English. Future work also will pay at-
with the input, without embedding layer, of the BLSTM. These tention to architecture improvements like hyper-parameter con-
two architectures are compatibles because both are designed to figuration or to include attention mechanisms.
work with sequences, processing element by element, so we do
not need to do any transformation to the inputs or outputs. In
Table 1 we present the result of concatenation of a WRN of one,
5. Acknowledgements
two or three blocks and a BLSTM with the same architecture as This work is supported by the Spanish Government and the Eu-
the baseline. We can see that all those configurations perform ropean Union under project TIN2017-85854-C4-1-R, and by
better than any of the two architectures alone. The case of a Gobierno de Aragón / FEDER (research group T36 17R). We
WRN of one block presents the better result with an F1 of 90.82 gratefully acknowledge the support of NVIDIA Corporation
in ”Period” and 92.60 in ”Comma”. with the donation of the Titan Xp GPU used for this research.
299
6. References
[1] F. Batista, H. Moniz, I. Trancoso, and N. Mamede, “Bilingual ex-
periments on automatic recovery of capitalization and punctuation
of automatic speech transcripts,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 20, no. 2, pp. 474–485,
2012.
[2] O. Tilk and T. Alumäe, “Lstm for punctuation restoration in
speech transcripts,” in Sixteenth annual conference of the inter-
national speech communication association, 2015.
[3] ——, “Bidirectional recurrent neural network with attention
mechanism for punctuation restoration.” in Interspeech, 2016, pp.
3047–3051.
[4] W. Lu and H. T. Ng, “Better punctuation prediction with dynamic
conditional random fields,” in Proceedings of the 2010 conference
on empirical methods in natural language processing. Associa-
tion for Computational Linguistics, 2010, pp. 177–186.
[5] K. Xu, L. Xie, and K. Yao, “Investigating lstm for punctuation
prediction,” in Chinese Spoken Language Processing (ISCSLP),
2016 10th International Symposium on. IEEE, 2016, pp. 1–5.
[6] X. Che, C. Wang, H. Yang, and C. Meinel, “Punctuation predic-
tion for unsegmented transcript based on word vector.” in LREC,
2016.
[7] J. Yi, J. Tao, Z. Wen, and Y. Li, “Distilling knowledge from an en-
semble of models for punctuation prediction,” Proc. Interspeech
2017, pp. 2779–2783, 2017.
[8] W. Gale and S. Parthasarathy, “Experiments in character-level
neural network models for punctuation,” in Proceedings Inter-
speech, 2017, pp. 2794–2798.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 770–778.
[10] S. Zagoruyko and N. Komodakis, “Wide residual networks,”
[11] H. Li, Z. Xu, G. Taylor, and T. Goldstein, “Visualizing the loss
landscape of neural nets,” arXiv preprint arXiv:1712.09913, 2017.
[12] H. K. Vydana and A. K. Vuppala, “Residual neural networks
for speech recognition,” in Signal Processing Conference (EU-
SIPCO), 2017 25th European. IEEE, 2017, pp. 543–547.
[13] L. D. Jahn Heymann and R. Haeb-Umbach, “Wide residual blstm
network with discriminative speaker adaptation for robust speech
recognition,” in Proceedings of the 4th International Workshop on
Speech Processing in Everyday Environments (CHiME16), 2016,
pp. 12–17.
[14] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
deep network training by reducing internal covariate shift,” arXiv
[15] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
networks,” in Proceedings of the fourteenth international confer-
ence on artificial intelligence and statistics, 2011, pp. 315–323.
[16] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” arXiv preprint arXiv:1412.6980, 2014.
[17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,
Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differ-
entiation in pytorch,” in NIPS-W, 2017.
300
IberSPEECH 2018
End-to-End Multi-Level Dialog Act Recognition

Eugénio Ribeiro1,2 , Ricardo Ribeiro1,3 , and David Martins de Matos1,2
L F – Spoken Language Systems Laboratory – INESC-ID Lisboa, Portugal

1 2
2
Instituto Superior Técnico, Universidade de Lisboa, Portugal
3
Instituto Universitário de Lisboa (ISCTE-IUL), Portugal
eugenio.ribeiro@l2f.inesc-id.pt
Abstract provided. Thus, in this paper we approach the problem using

an end-to-end classifier to predict the labels of the three levels
The three-level dialog act annotation scheme of the DIHANA in parallel, so that the relations between the levels are captured
corpus poses a multi-level classification problem in which the implicitly. Additionally, we explore approaches on segment and
bottom levels allow multiple or no labels for a single segment. context information representation which have recently been
We approach automatic dialog act recognition on the three lev- proved successful on the dialog act recognition task and were
els using an end-to-end approach, in order to implicitly cap- not used in our previous study on the DIHANA corpus.
ture relations between them. Our deep neural network classifier
In the remainder of the paper we start by providing an
uses a combination of word- and character-based segment rep-
overview of previous work on dialog act recognition on the DI-
resentation approaches, together with a summary of the dialog
HANA corpus and how it can be improved, in Section 2. Then,
history and information concerning speaker changes. We show
in Section 3, we describe our experimental setup, including the
that it is important to specialize the generic segment represen-
corpus, the network variations, and the training and evaluation
tation in order to capture the most relevant information for each
approaches. The results of our experiments are presented and
level. On the other hand, the summary of the dialog history
discussed in Section 4. Finally, Section 5 states the most impor-
should combine information from the three levels to capture de-
tant conclusions of this study.
pendencies between them. Furthermore, the labels generated
for each level help in the prediction of those of the lower levels.
Overall, we achieve results which surpass those of our previous 2. Related Work
approach using the hierarchical combination of three indepen-
Automatic dialog act recognition is a task that has been widely
dent per-level classifiers. Furthermore, the results even surpass
explored using multiple classical machine learning approaches,
the results achieved on the simplified version of the problem ap-
from Hidden Markov Models (HMMs) to Support Vector Ma-
proached by previous studies, which neglected the multi-label
chines (SVMs) [1]. However, recently, most approaches on the
nature of the bottom levels and only considered the label com-
task take advantage of Deep Neural Network (DNN) architec-
binations present in the corpus.
tures to capture different aspects of the dialog [4, 5, 6, 7, 8, 9].
Index Terms: dialog act recognition, DIHANA corpus, multi-
level classification, multi-label classification Similarly to the studies on English data, the first studies on
the DIHANA corpus employed HMMs using both prosodic [10]
and textual [11] features. The study using prosodic features fo-
1. Introduction cused on the prediction of the generic top level labels, while the
Dialog act recognition is an important task in the context of Nat- study using textual features considered the combination of the
ural Language Understanding (NLU) since dialog acts reveal multiple levels. Additionally, the latter study, as well as a more
the intention behind the uttered words, allowing the application recent one [12], explored the recognition of dialog acts on un-
of specialized interpretation approaches. Consequently, it has segmented turns using n-gram transducers. However, in those
been widely explored over the years on multiple corpora with cases, the focus was on the segmentation process. The results
different characteristics [1]. In this sense, the distinguishing as- of the HMM-based approaches were surpassed in a study that
pect of the DIHANA corpus [2], which features interactions in applied SVMs [13] to a feature set consisting of word n-grams,
Spanish between humans and a train information dialog system, the presence of wh-words and punctuation, and context infor-
is its three-level annotation scheme. While the top level refers mation from up to three preceding segments in the form of the
to the generic task-independent dialog act, the others comple- same features. All of these studies neglected the multi-label na-
ment it with task-specific information. Additionally, while each ture of the bottom levels of the dialog act annotation scheme of
segment has a single top-level label, it may have multiple or the DIHANA corpus and approached a simplified single-label
no labels on the other levels. Thus, the DIHANA corpus poses problem in which the label set consisted of the label combina-
both multi-level and multi-label classification problems. How- tions present in the corpus. However, this approach limits the
ever, most previous studies on this corpus approached the task possible combinations to those existing in the dataset and ne-
as a single-label classification problem in which the label of a glects the distinguishing characteristics of each individual label.
segment was the combination of all its labels. Contrarily, in Contrarily to those studies, in [3] we explored each level in-
[3] we explored each level independently, approaching the bot- dependently, approaching the bottom levels as multi-label clas-
tom levels as multi-label classification problems, and then com- sification problems, and then combined the best classifiers for
bined the best classifiers for each level hierarchically. Among each level hierarchically. In that study we compared those that
other conclusions, in that study we have shown that there are de- were the two top performing DNN-based approaches on seg-
pendencies between the multiple levels which the independent ment representation for dialog act recognition on the Switch-
classifiers cannot capture unless the information is explicitly board Dialog Act Corpus [14], which is the most explored cor-
301 10.21437/IberSPEECH.2018-63
pus for the task. One of those approaches uses a stack of Long covering 18% and 15%, respectively. This is consistent with the
Short Term Memory (LSTM) units to capture long distance re- information-transfer nature of the dialogs.
lations between tokens [7], while the other uses multiple paral- Although they share most labels, the two task-specific lev-
lel Convolutional Neural Networks (CNNs) with different con- els focus on different information. While Level 2 is related to
text window sizes to capture different functional patterns [8]. the information that is implicitly focused in the segment, Level
We have shown that the CNN-based approach leads to better 3 is related to the kind of information that is explicitly referred
results on every level of the DIHANA corpus. However, while to in the segment. There are 10 labels common to both levels
wider context windows are better for predicting the generic di- and three additional ones on Level 3. The most common Level
alog acts of the top level, the task-specific bottom levels are 2 labels are Departure Time, Fare, and Day, which are present
more accurately predicted when using narrower windows. Fur- in 32%, 14%, and 8% of the segments, respectively. On the
thermore, recently, we have shown that the performance can other hand, the Level 3 label distribution is more balanced, with
be improved by using a Recurrent Convolutional Neural Net- the most common labels, Destination, Day, and Origin, being
work (RCNN)-based segment representation approach that is present in 16%, 16%, and 13% of the segments, respectively.
able to capture long distance relations and discards the need for While a segment has a single Level 1 label, it may have
selecting specific window sizes for convolution [9]. Addition- multiple or no labels in the other levels. In this sense, only 63%
ally, we have shown that a character-based segment representa- of the segments have Level 2 labels, and that percentage is even
tion approach achieves similar or better results than an equiv- lower, 52%, when considering Level 3 labels. This is mainly
alent word-based approach on the Switchboard corpus and the due to the fact that Level 1 labels concerning dialog structuring
top level of the DIHANA corpus and that the information cap- or communication problems cannot be paired with any labels in
tured by both approaches is complementary [15]. the remaining levels.
In [3] we have also shown that context information concern-
ing the dialog history and the classification of the upper levels 3.2. End-to-End Neural Network Architecture
is relevant for the task. Concerning the dialog history, we have
explored the use of information from up to three preceding seg- In our experiments, we incrementally built the architecture
ments in the form of their classifications. On the first two levels, of our network by assessing the performance of different ap-
similarly to what happened in previous studies on the influence proaches for each step. However, due to space constraints,
of context on dialog act recognition [16, 8], we have observed we are not able to show individual figures for all of those ap-
that the first preceding segment is the most important and that proaches. Thus, in Figure 1 we show the architecture of the
the influence decays with distance. On the other hand, since the final network and use it to refer to the alternatives we explored.
bottom level refers to information that is explicitly referred to At the top are our two complementary segment representa-
in the segment, it is not influenced by information from the pre- tion approaches. On the left is the word-based approach, which
ceding segments, at least at the same level. Recently, we have captures information concerning both word sequences and func-
shown that the representation of information from the preceding tional patterns using the adaptation of the RCNN by Lai et
segments used in previous studies does not take the sequential- al. [19] which we introduced in [9]. In our adaptation we re-
ity of those segments into account and that the whole dialog his- placed the simple Recurrent Neural Networks (RNNs) used to
tory can be summarized in order to capture that information as capture the context surrounding each token by Gated Recurrent
well as relations with more distant segments [9]. Additionally, Units (GRUs), in order to capture relations with more distant to-
in this paper, we further explore the relations between levels by kens. To represent each word, in our experiments we used 200-
using an end-to-end approach to predict the labels of the three dimensional Word2Vec [20] embeddings trained on the Span-
levels in parallel and capture those relations implicitly. ish Billion Word Corpus [21]. On the right is the character-
based approach we introduced in [15], which uses three parallel
CNNs with different window sizes to capture relevant patterns
3. Experimental Setup concerning affixes, lemmas, and inter-word relations. We per-
This section presents our experimental setup, starting with a de- formed experiments using each approach individually, as well
scription of the corpus, followed by an overview of the aspects as their combination, which is depicted in Figure 1.
addressed by our experiments and the used network architecture The representation of the segment can then be combined
and a description of the training and evaluation approaches. with context information concerning the dialog history and
speaker changes. We provide the latter in the form of a flag
3.1. Dataset stating whether the speaker changed in relation to the previous
segment, as in [8, 9]. To provide information from the preced-
The DIHANA corpus [2] consists of 900 dialogs between 225 ing segments we use the approach we introduced in [9], which
human speakers and a Wizard of Oz telephonic train informa- summarizes the dialog history by passing the sequence of dia-
tion system. There are 6,280 user turns and 9,133 system turns, log act labels through a GRU. We performed experiments using
with a vocabulary size of 823 words. The turns were manually a single summary combining information concerning the three
transcribed, segmented, and annotated with dialog acts [17]. levels, as well as using per-level summaries which summarize
The total number of annotated segments is 23,547, with 9,715 the sequence of preceding labels of each level individually.
corresponding to user segments and 13,832 to system segments. To predict the dialog act labels for the segment, the com-
The dialog act annotations are hierarchically decomposed bined representation is passed through two dense layers. While
in three levels [18]. The top level, Level 1, represents the the first reduces its dimensionality and identifies the most rele-
generic intention of the segment, while the others refer to task- vant information present in that representation, the second gen-
specific information. There are 11 Level 1 labels, out of which erates the labels. In terms of the dimensionality reduction layer,
two are exclusive to user segments and four to system segments. we experimented using a single layer that captures the most
Overall, the most common label is Question, covering 27% of relevant information that is generic to the three levels, as well
the segments, followed by the Answer and Confirmation labels, as per-level dimensionality reduction layers, which capture the
302
the DIHANA corpus [10, 11]. In terms of evaluation metrics
w0 w1 ... wn-1 wn c0 c1 ... cn-1 cn
we use accuracy. This metric is penalizing for the multi-label
classification scenarios of Level 2 and Level 3, since it does not
account for partial correctness [26]. However, due to space con-
CNN CNN CNN
straints and since we are focusing on the combined prediction of
RCNN the three levels, we do not report results for specialized metrics.
(w = 3) (w = 5) (w = 7)
Word-Based
Segment Representation
4. Results
In this section we present the results of our experiments on each
Character-Based Segment Representation
level, as well as on the combination of the three levels. Both
the results on Level 1 and on the combination of the three levels
Combined Segment Representation
can be directly compared with those reported in [3]. However,
that is not the case for the remaining levels. In [3], since we
GRU Dialog History explored each level independently and the annotation scheme
does not allow segments with a Level 1 label concerning di-
alog structuring or communication problems to have labels in
Speaker Change
the remaining levels, we did not consider those segments when
training and evaluating the Level 2 and Level 3 classifiers. Con-
trarily, since in this study we use a single classifier to predict
Level 1
the labels for all levels, those segments are also considered.
Dense Dense
(Reduction) (Softmax)
L1 Label
We started by exploring the word- and character-based seg-
ment representation approaches, as well as their combination.
Thus, in these experiments, we did not provide context infor-
Level 2
mation to the network and we used a single dimensionality re-
duction layer for the three levels. In Table 1 we can see that, as
Dense Dense
(Reduction) (Sigmoid)
L2 Labels
we have previously shown in [15], the character-based approach
leads to better results than the word-based one on Level 1. Ad-
ditionally, the results achieved by the word-based approach are
above those reported in [15], which confirms that the word-level
Level 3
RCNN-based segment representation approach leads to better
Dense Dense
(Reduction) (Sigmoid)
L3 Labels
results than the CNN-based one we used in [15] and [3]. On the
other hand, the slight performance decrease of the character-
based approach can be explained by the combined prediction of
Figure 1: The architecture of our network. wi refers to the em- the three levels, which does not allow the classifier to special-
bedding representation of the i-th word, while ci refers to the ize in predicting Level 1 labels. On the remaining levels, the
embedding representation of the i-th character. The circles rep- character-level approach still performs better. However, the dif-
resent the concatenation of the different inputs. ference is smaller than on Level 1, which is explained by the
more prominent relation of the labels of these levels with spe-
cific words. Furthermore, the combination of both approaches
leads to the best results on every level.
most relevant information for each level, as depicted in Figure 1.
Since the first level poses a single-label classification problem,
the output layer uses the softmax activation and the categorical Table 1: Accuracy (%) results according to the segment repre-
cross entropy loss function. On the other hand, since the other sentation approach.
levels pose multi-label classification problems, the correspond-
Level 1 Level 2 Level 3 All
ing output layers use the sigmoid activation and the binary cross Approach µ σ µ σ µ σ µ σ
entropy loss function which, given the possibility of multiple la-
Word-Based 92.18 .13 79.36 .18 79.35 .20 75.82 .17
bels, is actually the Hamming loss function [22]. In both cases, Character-Based 95.31 .07 81.25 .47 81.24 .54 78.67 .51
for performance reasons, we use the Adam optimizer [23]. Combined 95.64 .07 82.46 .20 82.44 .21 79.88 .17
In [3], we have shown that the prediction of dialog act la-
bels of a certain level is improved when information concerning
By using per-level dimensionality reduction layers, the
the upper levels is available. Thus, as shown in Figure 1, we
classifier is able to select the information that is most relevant
also performed experiments that considered the output from the
for predicting the labels of each level. Thus, as shown in Ta-
upper levels in the dimensionality reduction layers.
ble 2, this adaptation leads to improved results on the two bot-
tom levels and on the combination of the three levels. However,
3.3. Training and Evaluation
the performance on Level 1 did not improve, which suggests
To implement our networks we used Keras [24] with the Ten- that the combined segment representation captures more infor-
sorFlow [25] backend. We used mini-batching with batches of mation concerning specific words in detriment of functional pat-
size 512 and the training phase stopped after 10 epochs with- terns relevant for the prediction of some Level 1 labels. Provid-
out improvement. The results presented in the next section re- ing information concerning the output generated for the upper
fer to the average (µ) and standard deviation (σ) of the results levels leads to further improvement, in line with that reported in
obtained over 10 runs. On each run, we performed 5-fold cross- [3] in spite of not using gold standard labels.
validation using the folds defined in the first experiments on As stated in Section 2, context information from the pre-
303
Table 2: Accuracy (%) results according to the dimensionality if we post-process the results to enforce that restriction, the im-
reduction approach. provement on the combination of the three levels is of just .03
percentage points when considering all segments and .1 when
Level 1 Level 2 Level 3 All considering user segments only. This shows that the network is
Approach µ σ µ σ µ σ µ σ
able to learn that restriction based on the training examples.
Single Reduction 95.64 .07 82.46 .20 82.44 .21 79.88 .17
Per-Level Reduction 95.64 .05 83.21 .11 83.17 .18 80.23 .16 Overall, the average accuracy of our best approach on the
Output Waterfall 95.65 .05 83.29 .21 83.36 .19 80.49 .24 combination of the three-levels is 95.67%. This result is 3.33
percentage points above the 92.34% we achieved in [3] us-
ing the hierarchical combination of independent classifiers for
ceding segments has been proved important in many studies each level. Furthermore, it is even above the 93.98% achieved
on dialog act recognition. The results in Table 3 confirm this when considering the single-label simplification of the problem,
importance for all levels. However, there are different conclu- which only considers the label combinations present in the cor-
sions to draw depending on the level. Concerning the first level, pus. This shows that the network is able to capture relevant
we can see that considering the dialog history leads to an av- relations between levels while still being able to identify the
erage accuracy improvement of 3.72 percentage points, which most important information for each level using the per-level
is above the 3.42 reported in [15]. Considering that the classi- dimensionality reduction layers.
fier fails to predict the correct Level 1 label for less than one
percent of the segments, this is a relevant improvement which 5. Conclusions
is explained by the representation of the dialog history in the
form of a summary. In [3] we have shown that the dialog his- In this paper we have presented our approach on dialog act
tory is not relevant when only the Level 3 is considered, since it recognition on the DIHANA corpus using an end-to-end clas-
refers to information that is explicitly referred to in the segment. sifier to predict the labels for the three levels defined in the
However, in Table 3 we can see an average accuracy improve- annotation scheme. This way, the relations between levels are
ment of 12.86 percentage points on Level 3 when considering captured implicitly, contrarily to what happened in our previous
the dialog history. This is explained by the fact that the provided approach on the task, which used independent per-level classi-
information concerns all levels and, as shown in [3], informa- fiers. Additionally, we have used approaches on segment and
tion from the preceding segments concerning the upper levels is context information representation which have recently been
relevant when predicting Level 3 labels. On the one hand, what proved more appropriate for the task.
is implicitly targeted at given time is expected to be explicitly First, we have shown that character-based segment repre-
referred to in the future. Thus, there is a relation between the sentation also performs better than word-based representation
Level 2 labels of preceding segments and the Level 3 labels of on the multi-label classification problems and that the combina-
the current one. On the other hand, the dialogs feature multiple tion of both approaches surpasses each individual approach. In
question-answer pairs for which the labels on the lower levels this sense, on the combination of all levels, the combined ap-
are the same. Thus, when the Level 1 label of the preceding proach surpassed the word- and character-level approaches by
segment is Question, the Level 3 labels of that segment are typ- around four and one percentage points, respectively.
ically present in the current segment as well. This relation be- Then, we have shown that it is important to have per-level
tween levels is further confirmed by the improved performance dimensionality reduction layers in order to specialize the seg-
when using a single summary for the whole dialog history in ment representation for each level. Additionally, the perfor-
comparison to when using independent per-level summaries. mance is improved when a cue for the hierarchical relation be-
tween the levels is provided by considering the output for the
Table 3: Accuracy (%) results when using context information upper levels when predicting the labels for each level.
from the preceding segments. Furthermore, we have shown that the relation between lev-
els is also important when providing context information con-
Level 1 Level 2 Level 3 All cerning the dialog history, as a combined summary of the clas-
Approach µ σ µ σ µ σ µ σ sifications of the preceding segments led to better results than
No Information 95.65 .05 83.29 .21 83.36 .19 80.49 .24 three independent per-level summaries.
Single Summary 99.37 .01 96.15 .11 96.22 .15 95.53 .14 Finally, by providing information concerning speaker
Per-Level Summary 99.34 .03 95.80 .11 95.87 .13 95.14 .14
changes and forcing the segments with Level 1 labels concern-
ing dialog structuring or communication problems to have no
As shown in previous studies, using information concern- labels on the remaining levels, we achieved 95.67% accuracy
ing speaker changes slightly improves the performance, up to on the combination of the three levels, which is over three per-
95.64% average accuracy on the combination of the three levels. centage points above our previous approach and even surpasses
More importantly, as discussed in [3], the system segments are the results achieved on the simplified single-label classification
scripted and, thus, are easier to predict than the user segments. problem approached by previous studies.
Furthermore, a dialog system is aware of its own dialog acts As future work, we intend to explore how our approach can
and must only predict those of its conversational partners. As be adapted to perform automatic segmentation of the turns in-
expected, the performance decreases if the classifier is trained stead of relying on a priori segmentation and assess the impact
and evaluated on user segments only. The average decrease on on the overall dialog act recognition performance.
the combination of the three levels is of 4.5 percentage points.
However, on Level 1 it is of just .67 percentage points.
Since we use a single classifier to predict the labels for the
6. Acknowledgements
three-levels, there is no explicit restriction that segments with This work was supported by national funds through Fundação
Level 1 labels concerning dialog structuring or communication para a Ciência e a Tecnologia (FCT) with reference
problems cannot have labels in the remaining levels. However, UID/CEC/50021/2013.
304
7. References [21] C. Cardellino, “Spanish Billion Word Corpus and Embeddings,”
http://crscardellino.me/SBWCE/, 2016.
[1] P. Král and C. Cerisara, “Dialogue Act Recognition Approaches,”
Computing and Informatics, vol. 29, no. 2, pp. 227–250, 2010. [22] J. Dı́ez, O. Luaces, J. J. del Coz, and A. Bahamonde, “Optimizing
Different Loss Functions in Multilabel Classifications,” Progress
[2] J.-M. Benedı́, E. Lleida, A. Varona, M.-J. Castro, I. Galiano,
in Artificial Intelligence, vol. 3, no. 2, pp. 107–118, 2015.
R. Justo, I. L. de Letona, and A. Miguel, “Design and Acquisition
of a Telephone Spontaneous Speech Dialogue Corpus in Spanish: [23] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Opti-
DIHANA,” in LREC, 2006, pp. 1636–1639. mization,” in ICLR, 2015.
[3] E. Ribeiro, R. Ribeiro, and D. M. de Matos, “Hierarchical Multi- [24] F. Chollet et al., “Keras: The Python Deep Learning Library,”
Label Dialog Act Recognition on Spanish Data,” Traitement Au- https://keras.io/, 2015.
tomatique des Langues (submitted to), vol. 59, no. 1, 2019. [25] M. Abadi et al., “TensorFlow: Large-Scale Machine Learning on
[4] N. Kalchbrenner and P. Blunsom, “Recurrent Convolutional Neu- Heterogeneous Systems,” https://www.tensorflow.org/, 2015.
ral Networks for Discourse Compositionality,” in Workshop on
[26] M. S. Sorower, “A Literature Survey on Algorithms for Multi-
Continuous Vector Space Models and their Compositionality,
Label Learning,” Oregon State University, Tech. Rep., 2010.
2013, pp. 119–126.
[5] J. Y. Lee and F. Dernoncourt, “Sequential Short-Text Classifi-
cation with Recurrent and Convolutional Neural Networks,” in
NAACL-HLT, 2016, pp. 515–520.
[6] Y. Ji, G. Haffari, and J. Eisenstein, “A Latent Variable Recur-
rent Neural Network for Discourse Relation Language Models,”
in NAACL-HLT, 2016, pp. 332–342.
[7] H. Khanpour, N. Guntakandla, and R. Nielsen, “Dialogue Act
Classification in Domain-Independent Conversations Using a
Deep Recurrent Neural Network,” in COLING, 2016, pp. 2012–
2021.
[8] Y. Liu, K. Han, Z. Tan, and Y. Lei, “Using Context Information
for Dialog Act Classification in DNN Framework,” in EMNLP,
2017, pp. 2160–2168.
[9] E. Ribeiro, R. Ribeiro, and D. M. de Matos, “Deep Dialog
Act Recognition using Multiple Token, Segment, and Context
Information Representations,” CoRR, vol. abs/1807.08587, 2018.
[Online]. Available: http://arxiv.org/abs/1807.08587
[10] V. Tamarit and C.-D. Martı́nez-Hinarejos, “Dialog Act Labeling in
the DIHANA Corpus using Prosody Information,” in V Jornadas
en Tecnologı́a del Habla, 2008, pp. 183–186.
[11] C.-D. Martı́nez-Hinarejos, J.-M. Benedı́, and R. Granell, “Statis-
tical Framework for a Spanish Spoken Dialogue Corpus,” Speech
Communication, vol. 50, no. 11–12, pp. 992–1008, 2008.
[12] C. D. Martı́nez-Hinarejos, J.-M. Benedı́, and V. Tamarit, “Un-
segmented Dialogue Act Annotation and Decoding with N-Gram
Transducers,” IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 23, no. 1, pp. 198–211, 2015.
[13] B. Gambäck, F. Olsson, and O. Täckström, “Active Learning for
Dialogue Act Classification,” in INTERSPEECH, 2011, pp. 1329–
1332.
[14] D. Jurafsky, E. Shriberg, and D. Biasca, “Switchboard SWBD-
DAMSL Shallow-Discourse-Function Annotation Coders Man-
ual,” University of Colorado, Institute of Cognitive Science, Tech.
Rep. Draft 13, 1997.
[15] E. Ribeiro, R. Ribeiro, and D. M. de Matos, “A Study on Dialog
Act Recognition using Character-Level Tokenization,” in AIMSA,
2018.
[16] ——, “The Influence of Context on Dialogue Act Recogni-
tion,” CoRR, vol. abs/1506.00839, 2015. [Online]. Available:
[17] N. Alcácer, J. M. Benedı́, F. Blat, R. Granell, C. D. Martı́nez, and
F. Torres, “Acquisition and Labelling of a Spontaneous Speech
Dialogue Corpus,” in SPECOM, 2005, pp. 583–586.
[18] C.-D. Martı́nez-Hinarejos, E. Sanchis, F. Garcı́a-Granada, and
P. Aibar, “A Labelling Proposal to Annotate Dialogues,” in LREC,
2002, pp. 1566–1582.
[19] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent Convolutional Neu-
ral Networks for Text Classification,” in AAAI, 2015, pp. 2267–
2273.
[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed Representations of Words and Phrases and their
Compositionality,” in NIPS, 2013, pp. 3111–3119.
305
Index
Abad, Alberto, 281 Cunha, Conceicao, 137

Aguilar, Lourdes, 112 Císcar, Vicent Andreu, 257
Aksnes, Mari, 172
Alba-Castro, José Luis, 204 D’Haro, Luis Fernando, 55
Alvarez, Aitor, 102, 267 Darna-Sequeiros, Javier, 64
Alías Pujol, Francesc, 132 de Benito, Diego, 224
Antón, Josu, 68 de Córdoba, Ricardo, 55
Antón-Martín, Javier, 262 de Velasco, Mikel, 68
Arnela, Marc, 132 Dehak, Najim, 236
Arzelus, Haritz, 102, 267 Delgado, Héctor, 194, 211
Deroo, Olivier, 172
Barbany, Oriol, 30 Dias, Ana Larissa, 77
Barras, Claude, 194, 211 Docío-Fernández, Laura, 35, 82, 204, 240,
Batista, Cassio, 77 277
Benderitter, Ian, 72 Doetsch, Patrick, 257
Bernath, Conrad, 102, 267 Dominguez, Monica, 40
Bonafonte, Antonio, 20, 30, 117, 152, 170 Dugan, Nazim, 231, 272
Borbély, Gábor, 286
Bordel, Germán, 249 Escolano, Carlos, 60
Bredin, Hervé, 194, 211 Escudero-Mancebo, David, 97, 112, 157,
Burga, Alicia, 40 161
Esposito, Anna, 172
Cabello, Maria, 245 Espín, Juan Manuel, 159
Calvo de Lara, José R., 227 Eszter, Iklódi, 286
Calvo, Jose R., 254 Evans, Nicholas, 194, 211
Campbell, Edward L., 227
Cannings, Nigel, 231, 272 Farrús, Mireia, 20, 40
Cardeñoso-Payo, Valentín, 97, 112, 157, Fernández-Ruanova, Begoña, 172
161 Ferreiros, Javier, 55
Carrilero, Mikel, 68 Fischer, Volker, 216
Castan, Diego, 208 Fodge, Kimber, 291
Castro-Bleda, Maria Jose, 286 Fonollosa, José A. R., 60
Chollet, Gérard, 172, 231, 272 Font, Roberto, 159
Codina, Marta, 291 Frahm, Jens, 137
Cordasco, Gennaro, 172 Freixes, Marc, 132
Corrales-Astorgano, Mario, 112
Cross Vila, Laura, 60 Gallardo-Antolín, Ascensión, 15
307
García, Eneritz, 267 Lleida, Eduardo, 1, 6, 87, 220, 296
García-Mateo, Carmen, 35, 72, 277 Llombart, Jorge, 296
García-Perera, L. Paola, 236 Lozano-Diez, Alicia, 179, 224
Garrido, Juan-María, 291 López Otero, Paula, 82, 240
Ghahabi, Omid, 184, 216 López-Espejo, Iván, 142
Gilbert, James M., 163 López-Gonzalo, Eduardo, 262
Gimeno, Pablo, 87, 220
Giménez, Adrià, 257 Machuca, María J., 97
Glackin, Cornelius, 172, 231, 272 Martinez Hinarejos, Carlos David, 92, 127,
Golik, Pavel, 257 174, 267
Gomez Suárez, Javier, 166 Martins de Matos, David, 281, 301
Gomez, Angel M., 45, 142 Martins, Paula, 137
Gomez-Alanis, Alejandro, 45 Martín-Doñas, Juan M., 142
Gonzalez-Dominguez, Javier, 179 Martínez, Carlos David, 102
Gonzalez-Rodriguez, Joaquin, 179 Martínez-Castilla, Pastora, 112
González López, José Andrés, 45, 117, 163 Martínez-Villaronga, Adrià, 257
González-Ferreras, César, 112 Maurice, Benjamin, 194
Gordeeveva, Olga, 172 McLaren, Mitchell, 208
Granell, Emilio, 92, 174, 267 Miguel, Antonio, 1, 6, 87, 220, 296
Green, Phil D., 163 Mingote, Victoria, 1
Guasch, Oriol, 132 Montalvo, Ana R., 254
Guinaudeau, Camille, 194 Montenegro, César, 172
Gully, Amelia, 163 Moreno, Asuncion, 170
Morros, Josep Ramon, 199
Hernaez, Inma, 50, 107, 122, 166 Municio Martín, Jose Antonio, 166
Hernandez, Gabriel, 227 Murphy, Damian, 163
Hernando, Javier, 10, 199
Nandwana, Mahesh Kumar, 208
Hernández-Gómez, Luis A., 262
Navas, Eva, 50, 107, 122, 147, 166
Huang, Zili, 236
Ney, Hermann, 257
India Massana, Miquel Angel, 199
Odriozola, Igor, 50
Inglés-Romero, Juan Francisco, 159
Oliveira, Catarina, 137
Jauk, Igor, 170, 189 Ortega, Alfonso, 1, 6, 87, 220, 296
Jorge, Javier, 257
Palau, Ponç, 199
Joseph, Arun, 137
Parcheta, Zuzanna, 127
Juan, Alfons, 257
Pascual, Santiago, 30, 117, 152
Justo, Raquel, 68, 172
Patino, Jose, 194, 211
Khan, Umair, 10 Pavão Martins, Isabel, 281
Khosravani, Abbas, 231 Peinado, Antonio M., 45, 142
Kimura, Takuya, 97 Peláez-Moreno, Carmen, 15
Kyslitska, Daria, 172 Pereira, Victor, 170
Külebi, Baybars, 25 Perero-Codosero, Juan M., 262
Peñagarikano, Mikel, 249
Labrador, Beltran, 224 Piñeiro-Martín, Andrés, 35
Lindner, Fred, 172 Pompili, Anna, 281
308
Povey, Daniel, 236 Sayrol, Elisa, 199
Schlögl, Stephan, 172
R. Costa-Jussà, Marta, 60 Serrano, Luis, 50, 107, 122, 147
Raman, Sneha, 107, 122 Serrà, Joan, 117, 152
Ramirez, Jose M., 254 Silva, Samuel, 137
Ramirez, Pablo, 224 Silvestre-Cerdà, Joan Albert, 257
Ramos-Muguerza, Eduardo, 204 Socoró, Joan Claudi, 132
Recski, Gábor, 286
Reiner, Miriam, 172 T. Toledano, Doroteo, 64, 224, 245
Ribeiro, Eugénio, 301 Tapias Merino, Daniel, 262
Ribeiro, Ricardo, 301 Tarrés, Laia, 170
Rituerto-González, Esther, 15 Tavarez, David, 122, 147
Roble, Alejandro, 254 Teixeira, António, 137
Rodríguez-Fuentes, Luis J., 249 Tejedor, Javier, 245
Romero, Verónica, 92, 174 Tejedor-García, Cristian, 97, 157
Ríos, Antonio, 97 Tenorio-Laranga, Jofre, 172
Tilves Santiago, Darío, 72
S. Kornes, Maria, 172 Torres, M. Inés, 68, 172
Safari, Pooyan, 10
Sagastiberri, Itziar, 199 Varona, Amparo, 249
Salamea, Christian, 55 Vicente-Chicote, Cristina, 159
Sampaio Neto, Nelson, 77 Villalba, Jesús, 236
San-Segundo, Rubén, 55 Viñals, Ignacio, 6, 87, 220
Sanchez, Jon, 50
Sanchis, Albert, 257 Wanner, Leo, 40
Santana, Riberto, 172 Yin, Ruiqing, 194, 211
Sarasola, Xabier, 122, 147
Saratxaga, Ibon, 122, 147 Öktem, Alp, 20, 25
309

IberSPEECH 2018 Proceedings

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

IberSPEECH 2018 Proceedings

Încărcat de

Drepturi de autor:

Formate disponibile

PROGRAM

14:00 Lunch Lunch

Coffee break Coffee break

Oral 3: Speech & Language Albayzin Evaluations

19:00 RTTH Assembly

At the time of release, the proceedings can be downloaded from:

IberSPEECH2018 website: iberspeech2018.talp.cat

Barcelona, November 2018

Welcome Message iii

Social Program xvi

Invited Speakers xix

Technical Program xxv

Author’s Index 307

Welcome to IberSPEECH2018 in Barcelona, November 21-23rd 2018, co-organized by Tele-

Technical Program Chair:

Plenary Talks & Round Table Chairs:

Alberto Abad, INESC-IST, Portugal

This conference has been organized by:

with the collaboration of:

Best Paper Award

Best Albayzin evaluation system

Best PhD Thesis award

Professional Career Prize in Speech Technologies

• José B. Mariño Acebal, Universitat Politècnica de Catalunya

• Antonio Rubio Ayuso, Universidad de Granada.

Main conference venue

WI-FI access instructions

• Network: Iberspeech 2018

Welcome Reception: Wednesday 21 November, 20:00.

How do I get there?

Tanja Schultz received her diploma and doctoral degree in Infor-

Differentiable Supervector Extraction for Encoding Speaker and Phrase Informa-

Phonetic Variability Influence on Short Utterances in Speaker Verification 6

Restricted Boltzmann Machine Vectors for Speaker Clustering 10

Speaker Recognition under Stress Conditions 15

Topics on Speech Technologies

Bilingual Prosodic Dataset Compilation for Spoken Language Translation 20

Building an Open Source Automatic Speech Recognition System for Catalan 25

Multi-Speaker Neural Vocoder 30

Improving the Automatic Speech Recognition through the improvement of

On the use of Phone-based Embeddings for Language Recognition 55

End-to-End Speech Translation with the Transformer 60

Emotion Detection from Speech and Text 68

Experimental Framework Design for Sign Language Automatic Recognition 72

Baseline Acoustic Models for Brazilian Portuguese Using Kaldi Tools 77

Converted Mel-Cepstral Coefficients for Gender Variability Reduction in Query-

A Recurrent Neural Network Approach to Audio Segmentation for Broadcast Do-

Improving Transcription of Manuscripts with Multimodality and Interaction 92

Improving Pronunciation of Spanish as a Foreign Language for L1 Japanese

Exploring E2E speech recognition systems for new languages 102

Listening to Laryngectomees: A study of Intelligibility and Self-reported Listening

Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial

LSTM based voice conversion for laryngectomees 122

Sign Language Gesture Classification using Neural Networks 127

Synthesis, Production & Analysis

Exploring Advances in Real-time MRI for Speech Production Studies of European

A postfiltering approach for dual-microphone smartphones 142

Speech and monophonic singing segmentation using pitch parameters 147

Self-Attention Linguistic-Acoustic Decoder 152

Show & Tell

Japañol: a mobile application to help improving Spanish pronunciation by

Ongoing Research Projects

Towards the Application of Global Quality-of-Service Metrics in Biometric Sys-

Incorporation of a Module for Automatic Prediction of Oral Productions Quality