Speech and Computer PDF

This book constitutes the proceedings of the 20th International Conference on Speech and Computer, SPECOM 2018, held in Leipzig, Germany, in September 2018. The 79 papers presented in this volume were carefully reviewed and selected from 132 submissions. The papers present current research in the area of computer speech processing, including recognition, synthesis, understanding and related domains like signal processing, language and text processing, computational paralinguistics, multi-modal speech processing or human-computer interaction.

121 downloads 3K Views 66MB Size

Report

Download pdf

Recommend Stories

Empty story

Idea Transcript

LNAI 11096

Alexey Karpov · Oliver Jokisch Rodmonga Potapova (Eds.)

Speech and Computer 20th International Conference, SPECOM 2018 Leipzig, Germany, September 18–22, 2018 Proceedings

123

Lecture Notes in Artiﬁcial Intelligence Subseries of Lecture Notes in Computer Science

LNAI Series Editors Randy Goebel University of Alberta, Edmonton, Canada Yuzuru Tanaka Hokkaido University, Sapporo, Japan Wolfgang Wahlster DFKI and Saarland University, Saarbrücken, Germany

LNAI Founding Series Editor Joerg Siekmann DFKI and Saarland University, Saarbrücken, Germany

11096

More information about this series at http://www.springer.com/series/1244

Alexey Karpov Oliver Jokisch Rodmonga Potapova (Eds.) •

Speech and Computer 20th International Conference, SPECOM 2018 Leipzig, Germany, September 18–22, 2018 Proceedings

123

Editors Alexey Karpov SPIIRAS St. Petersburg Russia

Rodmonga Potapova Moscow State Linguistic University Moscow Russia

Oliver Jokisch Leipzig University of Telecommunications Leipzig Germany

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Artiﬁcial Intelligence ISBN 978-3-319-99578-6 ISBN 978-3-319-99579-3 (eBook) https://doi.org/10.1007/978-3-319-99579-3 Library of Congress Control Number: 2018952051 LNCS Sublibrary: SL7 – Artiﬁcial Intelligence © Springer Nature Switzerland AG 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The International Conference on Speech and Computer (SPECOM) has become a regular event since the ﬁrst SPECOM that was held in St. Petersburg, Russian Federation, in 1996. SPECOM was established by the St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences and the Herzen State Pedagogical University of Russia, 22 years ago, thanks to the efforts of Prof. Yuri Kosarev and Prof. Rajmund Piotrowski. SPECOM is a conference with a long tradition that attracts researchers in the area of computer speech processing, including recognition, synthesis, understanding, etc. and related domains such as signal processing, language and text processing, computational paralinguistics, multi-modal speech processing, or human–computer interaction. The SPECOM International Conference is an ideal platform for know-how exchange – especially for experts working on Slavic or other highly inflectional languages – including both under-resourced and regular well-resourced languages. In its long history, the SPECOM conference was organized alternately by the St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS) and by the Moscow State Linguistic University (MSLU) in their home cities. Furthermore, in 1997 it was organized by the Cluj-Napoca Subsidiary of the Research Institute for Computer Technique (Romania), in 2005 by the University of Patras (Patras, Greece), in 2011 by the Kazan Federal University (Russian Federation, Republic of Tatarstan), in 2013 by the University of West Bohemia (Pilsen, Czech Republic), in 2014 by the University of Novi Sad (Novi Sad, Serbia), in 2015 by the University of Patras (in Athens, Greece), in 2016 by the Budapest University of Technology and Economics (Budapest, Hungary), and in 2017 by the University of Hertfordshire (Hatﬁeld, UK). SPECOM 2018 was the jubilee 20th event in the series, and this time it was organized by the Leipzig University of Telecommunications (HfT Leipzig), in cooperation with the St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS) and the Moscow State Linguistic University (MSLU). The conference was held jointly with the Third International Conference on Interactive Collaborative Robotics (ICR) – where problems and modern solutions of human–robot interaction were discussed – during September 18–22, 2018, at HfT Leipzig, a private university of applied sciences funded by Deutsche Telekom and located in the middle of a popular student and nightlife district in the south of Leipzig, Germany. During the conferences four invited talks were given: by Prof. Tanja Schultz (Cognitive Systems Lab at the University of Bremen, Germany, and Carnegie Mellon University, Pittsburgh, USA) on “Advances in Biosignal-Based Spoken Communication”; by Prof. Sebastian Möller (Quality and Usability Lab at the Technical University of Berlin and German Research Center for Artiﬁcial Intelligence, DFKI) on “Quality Engineering of Speech and Language Services”; by Prof. Eduardo Mizraji (Universidad de la República, Montevideo, Uruguay) on “Improving Neural Models of Language

VI

Preface

with Input–Output Tensor Contexts”; and Prof. Dongheui Lee (Technical University of Munich and Human-Centered Assistive Robotics Group at the German Aerospace Center, DLR) on “Robot Learning Through Physical Interaction and Human Guidance” (at ICR). This volume contains a collection of submitted papers presented at the conference, which were thoroughly reviewed by members of the Program Committee consisting of more than 100 top specialists in the conference topic areas. In total, 79 out of over 130 papers submitted for SPECOM/ICR were accepted by the Program Committee for presentation at the conference and for the inclusion in this book. Theoretical and more general contributions were presented in common plenary sessions. Problem-oriented sessions as well as panel discussions brought together specialists in limited problem areas with the aim of exchanging knowledge and skills resulting from research projects of all kinds. This year, beside the regular technical sessions, two special sessions were organized: “Positioning and Power Relations in Conversations” by Dr. Vered Silber-Varod (The Open University of Israel) and colleagues and “Advanced Cognitive Models for Human–Machine and Human–Robot Interaction” by Prof. Matthias Wolff (BTU Cottbus-Senftenberg, Germany) and colleagues. We would like to express our gratitude to all authors for providing their papers on time, to the members of the conference Program Committee and the organizers of the special sessions for their careful reviews and paper selection, and to the editors and proofreaders for their hard work in preparing this volume. Special thanks are due to the members of the Organizing Committee for their tireless effort and enthusiasm during the conference organization. September 2018

Oliver Jokisch Alexey Karpov Rodmonga Potapova

Organization

The International Conference SPECOM 2018 was organized by the Leipzig University of Telecommunications (HfTL), in cooperation with the St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS) and Moscow State Linguistic University (MSLU). SPECOM 2018 was sponsored by ASM Solutions Ltd. (Moscow, Russia), voice INTER connect GmbH (Dresden/Stuttgart, Germany), and Linguwerk GmbH (Dresden, Germany). The conference website is: http://www.specom2018.org.

General Co-chairs Oliver Jokisch Alexey Karpov Rodmonga Potapova

HfT Leipzig, Germany SPIIRAS, Russia MSLU, Russia

Program Committee Shyam Agrawal, India Omri Allouche, Israel Tanel Alumäe, Estonia Elias Azarov, Belarus Anton Batliner, Germany Laurent Besacier, France Peter Birkholz, Germany Jean-Francois Bonastre, France Marie-Luce Bourguet, UK Nick Campbell, Ireland Eric Castelli, Vietnam Vladimir Chuchupal, Russia Vlado Delic, Serbia Erinc Dikici, Austria Christoph Draxler, Germany Wolfram Erlhagen, Portugal Anna Esposito, Italy Yannick Estève, France Keelan Evanini, USA Vera Evdokimova, Russia Nikos Fakotakis, Greece Mauro Falcone, Italy Vasiliki Fouﬁ, Switzerland

Philip Garner, Switzerland Gábor Gosztolya, Hungary Peter beim Graben, Germany Abualsoud Hanani, Palestine Ivan Himawan, Australia Rüdiger Hoffmann, Germany Marek Hruz, Czech Republic Hussein Hussein, Germany Alexei Ivanov, USA Rainer Jäckel, Germany Kristiina Jokinen, Japan Oliver Jokisch, Germany Denis Jouvet, France Alexey Karpov, Russia Heysem Kaya, Turkey Tomi Kinnunen, Finland Irina Kipyatkova, Russia Kate Knill, UK Daniil Kocharov, Russia Liliya Komalova, Russia Evgeny Kostyuchenko, Russia Mikhail Kotov, Russia Ivan Kraljevski, Germany

VIII

Organization

Anat Lerner, Israel Haizhou Li, Singapore Boris Lobanov, Belarus Elena Lyakso, Russia Joseph Mariani, France Konstantin Markov, Japan Jindřich Matoušek, Czech Republic Yuri Matveev, Russia Peter Mihajlik, Hungary Eduardo Mizraji, Uruguay Bernd Möbius, Germany Helena Moniz, Portugal Iosif Mporas, UK Géza Németh, Hungary Thomas Niesler, South Africa Stavros Ntalampiras, Italy Hemant Patil, India Alexander Petrovsky, Belarus Alexey Petrovsky, Russia Branislav Popović, Serbia Vsevolod Potapov, Russia Rodmonga Potapova, Russia Gerhard Rigoll, Germany Andrey Ronzhin, Russia Milan Rusko, Slovakia Albert Ali Salah, Turkey Björn Schuller, Germany

Tanja Schultz, Germany Friedhelm Schwenker, Germany Milan Sečujski, Serbia Ingo Siegert, Germany Vered Silber–Varod, Israel Vasiliki Simaki, UK Victor Sorokin, Russia Richard Sproat, USA Stefan Steidl, Germany Mikhail Stolbov, Russia Tilo Strutz, Germany Sebastian Stüker, Germany György Szaszák, Hungary Ivan Tashev, USA Natalia Tomashenko, France Laszlo Toth, Hungary Jan Trmal, USA Jürgen Trouvain, Germany Khiet Truong, Netherlands Dirk Van Compernolle, Belgium Vasilisa Verkhodanova, Netherlands Klara Vicsi, Hungary Wenwu Wang, UK Benjamin Weiss, Germany Matthias Wolff, Germany Miloš Železný, Czech Republic

Organizing Committee Oliver Jokisch (Chair) Ingo Siegert Tilo Strutz Michael Maruschke Gunnar Auth Rodmonda Potapova Andrey Ronzhin

Alexey Karpov Anton Saveliev Alexander Denisov Ekaterina Miroshnikova Dmitry Ryumin Natalia Kashina

Contents

Gaze, Prosody and Semantics: Relevance of Various Multimodal Signals to Addressee Detection in Human-Human-Computer Conversations. . . . . . . . Oleg Akhtiamov and Vasily Palkov

1

A Continuous Vocoder Using Sinusoidal Model for Statistical Parametric Speech Synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Salah Al-Radhi, Tamás Gábor Csapó, and Géza Németh

11

Far Field Speech Enhancement at Low SNR in Presence of Nonstationary Noise Based on Spectral Masking and MVDR Beamforming . . . . . . . . . . . . Sergei Astapov, Aleksandr Lavrentyev, and Evgeniy Shuranov

21

Exploring End-to-End Techniques for Low-Resource Speech Recognition . . . Vladimir Bataev, Maxim Korenevsky, Ivan Medennikov, and Alexander Zatvornitskiy

32

Towards a Description of Pragmatic Markers in Russian Everyday Speech. . . Natalia Bogdanova-Beglarian, Tatiana Sherstinova, Olga Blinova, Gregory Martynenko, and Ekaterina Baeva

42

Adding Personality to Neutral Speech Synthesis Voices . . . . . . . . . . . . . . . . Christopher G. Buchanan, Matthew P. Aylett, and David A. Braude

49

Towards Network Simplification for Low-Cost Devices by Removing Synapses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Bulín, Luboš Šmídl, and Jan Švec

58

Generation of Synthetic Images of Full-Text Documents . . . . . . . . . . . . . . . Lukáš Bureš, Petr Neduchal, Miroslav Hlaváč, and Marek Hrúz

68

Speech Synthesizing Simultaneous Emotion-Related States. . . . . . . . . . . . . . Felix Burkhardt and Benjamin Weiss

76

An Approach to Automatic Summarization of Television Programs . . . . . . . . Marco Canora, Fernando García-Granada, Emilio Sanchis, and Encarna Segarra

86

The Prosody of Discourse Makers alors and et in French: A Corpus-Based Study on Multiple Speaking Styles . . . . . . . . . . . . . . . . . . George Christodoulides

94

X

Contents

Choosing a Dialogue System’s Modality in Order to Minimize User’s Workload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam Chýlek, Luboš Šmídl, and Jakub Nedvěd

103

A Free Synthetic Corpus for Speaker Diarization Research. . . . . . . . . . . . . . Erik Edwards, Michael Brenndoerfer, Amanda Robinson, Najmeh Sadoughi, Greg P. Finley, Maxim Korenevsky, Nico Axtmann, Mark Miller, and David Suendermann-Oeft

113

Speaker Diarization: A Top-Down Approach Using Syllabic Phonology . . . . Erik Edwards, Amanda Robinson, Najmeh Sadoughi, Greg P. Finley, Maxim Korenevsky, Michael Brenndoerfer, Nico Axtmann, Mark Miller, and David Suendermann-Oeft

123

Improving Emotion Recognition Performance by Random-Forest-Based Feature Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olga Egorow, Ingo Siegert, and Andreas Wendemuth

134

Coherence Understanding Through Cohesion Markers: The Case of Child Spoken Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Polina Eismont, Vladislav Metelyagin, and Elena Riekhakaynen

145

Context Modeling for Cross-Corpus Dimensional Acoustic Emotion Recognition: Challenges and Mixup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dmitrii Fedotov, Heysem Kaya, and Alexey Karpov

155

Functional Mapping of Inner Speech Areas: A Preliminary Study with Portuguese Speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlos Ferreira, Bruno Direito, Alexandre Sayal, Marco Simões, Inês Cadório, Paula Martins, Marisa Lousada, Daniela Figueiredo, Miguel Castelo-Branco, and António Teixeira

166

Semi-Supervised Acoustic Model Retraining for Medical ASR . . . . . . . . . . . Greg P. Finley, Erik Edwards, Wael Salloum, Amanda Robinson, Najmeh Sadoughi, Nico Axtmann, Maxim Korenevsky, Michael Brenndoerfer, Mark Miller, and David Suendermann-Oeft

177

You Sound Like Your Counterpart: Interpersonal Speech Analysis . . . . . . . . Jing Han, Maximilian Schmitt, and Björn Schuller

188

TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Estève LipsID Using 3D Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . Miroslav Hlaváč, Ivan Gruber, Miloš Železný, and Alexey Karpov

198

209

Contents

XI

From Kratzenstein to the Soviet Vocoder: Some Results of a Historic Research Project in Speech Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . Rüdiger Hoffmann, Peter Birkholz, Falk Gabriel, and Rainer Jäckel

215

LSTM Neural Network for Speaker Change Detection in Telephone Conversations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marek Hrúz and Miroslav Hlaváč

226

Noise Suppression Method Based on Modulation Spectrum Analysis. . . . . . . Takuto Isoyama and Masashi Unoki

234

Designing Advanced Geometric Features for Automatic Russian Visual Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Denis Ivanko, Dmitry Ryumin, Alexandr Axyonov, and Miloš Železný

245

On the Comparison of Different Phrase Boundary Detection Approaches Trained on Czech TTS Speech Corpora . . . . . . . . . . . . . . . . . . Markéta Jůzová

255

Word-Initial Consonant Lengthening in Stressed and Unstressed Syllables in Russian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatiana Kachkovskaia and Mayya Nurislamova

264

Phoneme Duration Prediction for Kazakh Language . . . . . . . . . . . . . . . . . . Arman Kaliyev, Sergey V. Rybin, and Yuri N. Matveev

274

Optimized Active Learning Strategy for Audiovisual Speaker Recognition . . . Stamatis Karlos, Konstantinos Kaleris, Nikos Fazakis, Vasileios G. Kanas, and Sotiris Kotsiantis

281

Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Irina Kipyatkova Labialization of Unstressed Vowels in Russian: Phonetic and Perceptual Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniil Kocharov, Vera Evdokimova, Karina Evgrafova, and Mariia Morskovatykh

291

301

Multilevel Annotation in the Corpus for Parsing Russian Spontaneous Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liubov Kovriguina, Ivan Shilin, Alina Putintseva, and Alexander Shipilo

311

The Influence of the Interlocutor’s Gender on the Speaker’s Role Identification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anat Lerner, Oren Miara, Sarit Malayev, and Vered Silber-Varod

321

XII

Contents

On the Stability of Some Idiolectal Features . . . . . . . . . . . . . . . . . . . . . . . . Tatiana Litvinova, Pavel Seredin, Olga Litvinova, Tatiana Dankova, and Olga Zagorovskaya

331

A Prototype of the Software System for Study, Training and Analysis of Speech Intonation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boris Lobanov, Vladimir Zhitko, and Vadim Zahariev

337

Speech Interaction in “Mother-Child” Dyads with 4−7 Years Old Typically Developing Children and Children with Autism Spectrum Disorders . . . . . . . Elena Lyakso and Olga Frolova

347

Speech Features of Adults with Autism Spectrum Disorders and Mental Retardation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elena Lyakso, Olga Frolova, Aleksey Grigorev, Viktor Gorodnyi, Aleksandr Nikolaev, and Yuri N. Matveev Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Manzini and Alan Black End-to-End Speech Recognition in Russian . . . . . . . . . . . . . . . . . . . . . . . . Nikita Markovnikov, Irina Kipyatkova, and Elena Lyakso Correction of Formal Prosodic Structures in Czech Corpora Using Legendre Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Matura and Markéta Jůzová On the Contribution of Articulatory Features to Speech Synthesis . . . . . . . . . Martin Matura, Markéta Jůzová, and Jindřich Matoušek QuARTCS: A Tool Enabling End-to-Any Speech Quality Assessment of WebRTC-Based Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Meszaros, Franziska Trojahn, Michael Maruschke, and Oliver Jokisch Automatic Phonetic Segmentation and Pronunciation Detection with Various Approaches of Acoustic Modeling . . . . . . . . . . . . . . . . . . . . . Petr Mizera and Petr Pollak Improving Neural Models of Language with Input-Output Tensor Contexts . . . Eduardo Mizraji, Andrés Pomi, and Juan Lin Sociolinguistic Variability of Predicate Groups in Colloquial Russian Speech. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anfisa Naumova

357

367 377

387 398

408

419 430

441

Contents

Building Real-Time Speech Recognition Without CMVN . . . . . . . . . . . . . . Thai Son Nguyen, Matthias Sperber, Sebastian Stüker, and Alex Waibel Choice of Signal Short-Term Energy Parameter for Assessing Speech Intelligibility in the Process of Speech Rehabilitation. . . . . . . . . . . . . . . . . . Dariya Novokhrestova, Evgeny Kostyuchenko, and Roman Meshcheryakov The Benefit of Document Embedding in Unsupervised Document Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaromír Novotný and Pavel Ircing A Comparative Survey of Authorship Attribution on Short Arabic Texts . . . . Siham Ouamour and Halim Sayoud

XIII

451

461

470 479

How Good Is Your Model ‘Really’? On ‘Wildness’ of the In-the-Wild Speech-Based Affect Recognisers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vedhas Pandit, Maximilian Schmitt, Nicholas Cummins, Franz Graf, Lucas Paletta, and Björn Schuller

490

RAMAS: Russian Multimodal Corpus of Dyadic Interaction for Affective Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olga Perepelkina, Evdokia Kazimirova, and Maria Konstantinova

501

Investigating Word Segmentation Techniques for German Using Finite-State Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gábor Pintér, Mira Schielke, and Rico Petrick

511

A Comparison of Language Model Training Techniques in a Continuous Speech Recognition System for Serbian . . . . . . . . . . . . . . . Branislav Popović, Edvin Pakoci, and Darko Pekar

522

Perceptual-Auditory Evaluation of the Aggressive Speech Behavior: Gender Aspect (on the Basis of Russian and Spanish Languages) . . . . . . . . . Rodmonga Potapova, Liliya Komalova, and Vsevolod Potapov

532

Main Determinants of the Acmeologic Personality Profiling . . . . . . . . . . . . . Rodmonga Potapova and Vsevolod Potapov Studying Mutual Phonetic Influence with a Web-Based Spoken Dialogue System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eran Raveh, Ingmar Steiner, Iona Gessinger, and Bernd Möbius Detecting Section Boundaries in Medical Dictations: Toward Real-Time Conversion of Medical Dictations to Clinical Reports . . . . . . . . . . . . . . . . . Najmeh Sadoughi, Greg P. Finley, Erik Edwards, Amanda Robinson, Maxim Korenevsky, Michael Brenndoerfer, Nico Axtmann, Mark Miller, and David Suendermann-Oeft

542

552

563

XIV

Contents

Seeing or Not Seeing Your Conversational Partner: The Influence of Interaction Modality on Prosodic Entrainment. . . . . . . . . . . . . . . . . . . . . Michelina Savino, Loredana Lapertosa, and Mario Refice Evaluating Novel Features for Aggressive Language Detection . . . . . . . . . . . Tina Schuh and Stephan Dreiseitl

574 585

Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech . . . . . . . . . . . . . . . . . . . . . . . . . Tatiana Sherstinova

596

Corpus Data on Adult Life-Long Trajectory of Prosody Development in American English, with Special Reference to Middle Age . . . . . . . . . . . . Tatiana Shevchenko and Tatiana Sokoreva

606

Context-Aware Generation of Personalized Audio Tours: Approach and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nikolay Shilov, Alexey Kashevnik, and Sergey Mikhailov

615

Utilizing Psychoacoustic Modeling to Improve Speech-Based Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ingo Siegert, Alicia Flores Lotz, Olga Egorow, and Susann Wolff

625

Prosodic Plot of Dialogues: A Conceptual Framework to Trace Speakers’ Role . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vered Silber-Varod, Anat Lerner, and Oliver Jokisch

636

Semi-Supervised Training of DNN-Based Acoustic Model for ATC Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luboš Šmídl, Jan Švec, Aleš Pražák, and Jan Trmal

646

Personality, Working Memory Capacity and Expert Manual Annotation of German Spontaneous Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anton Stepikhov and Anastassia Loukina

656

Using Dual-Element Microphone Arrays for Automatic Keyword Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikhail Stolbov, Marina Tatarnikova, and Quan Trong The

667

First Steps Towards Hybrid Speech Synthesis in Czech TTS System ARTIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Tihelka, Zdeněk Hanzlíček, Markéta Jůzová, and Jindřich Matoušek Lightweight Embeddings for Speaker Verification . . . . . . . . . . . . . . . . . . . . Maxim Tkachenko, Alexander Yamshinin, Mikhail Kotov, and Marina Nastasenko

676

687

Contents

A Perceptually Inspired Data Augmentation Method for Noise Robust CNN Acoustic Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . László Tóth, György Kovács, and Dirk Van Compernolle A Cognitive User Interface for a Multi-modal Human-Machine Interaction. . . Constanze Tschöpe, Frank Duckhorn, Markus Huber, Werner Meyer, and Matthias Wolff Investigating Language Variability on the Performance of Speaker Verification Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amir Vaheb, Ali Janalizadeh Choobbasti, S. H. E. Mortazavi Najafabadi, and Saeid Safavi Recurrent DNNs and Its Ensembles on the TIMIT Phone Recognition Task . . . Jan Vaněk, Josef Michálek, and Josef Psutka Comparative Analysis of Classification Methods for Automatic Deception Detection in Speech. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alena Velichko, Viktor Budkov, Ildar Kagirov, and Alexey Karpov

XV

697 707

718

728

737

Selecting Features for Automatic Screening for Dementia Based on Speech . . . Jochen Weiner and Tanja Schultz

747

A Fock Space Toolbox and Some Applications in Computational Cognition. . . Matthias Wolff, Günther Wirsching, Markus Huber, Peter beim Graben, Ronald Römer, and Ingo Schmitt

757

Algorithms for Automatic Accentuation and Transcription of Russian Texts in Speech Recognition Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olga Yakovenko, Ivan Bondarenko, Mariya Borovikova, and Daniil Vodolazsky

768

First Insight into the Processing of the Language Consulting Center Data . . . Zbyněk Zajíc, Lucie Zajícová, Josef V. Psutka, Petr Salajka, Jaromír Novotný, Aleš Pražák, and Luděk Müller

778

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

789

Gaze, Prosody and Semantics: Relevance of Various Multimodal Signals to Addressee Detection in Human-Human-Computer Conversations Oleg Akhtiamov1,2 ✉ and Vasily Palkov2 (

)

1

2

Ulm University, Ulm, Germany [email protected] ITMO University, Saint Petersburg, Russia [email protected]

Abstract. The present research is focused on multimodal addressee detection in human-human-computer conversations. A modern spoken dialogue system oper‐ ating under realistic conditions that may include multiparty interaction (several people solve a cooperative task by addressing the system while talking to each other) is supposed to distinguish machine- from human-addressed utterances. Machine-addressed queries should be directly responded to, while humanaddressed utterances should be either ignored or processed in an implicit way. We propose a multimodal system performing the visual, acoustic-prosodic, and textual analysis of users’ utterances. We managed to outperform the existing baseline for the Smart Video Corpus by applying our system. We also investigated the performance of diﬀerent models for separate speech categories with various speech spontaneity and determined that the acoustical model has diﬃculties in classifying constrained speech, and the textual model performs worse for spon‐ taneous speech, while the performance of the visual model drops for read humanaddressed speech and for spontaneous human-addressed speech signiﬁcantly due to the ambiguous behaviour of users. Keywords: Computational Paralinguistics · Oﬀ-Talk · Speaking style Text classiﬁcation · Frontal face detection · Spoken dialogue system

1

Introduction

Modern spoken dialogue systems (SDSs) are getting more complex and human-like and obtain more advanced capabilities. An eﬀective SDS is characterised not only by the set of diﬀerent tasks which can be solved with the system, but also by the system’s adapt‐ ability to users’ behaviour. Due to this quality, users stay engaged in the interaction process and express a higher loyalty level in favour of goods and services represented by the system. The system may adapt its behaviour to users’ emotions, choosing appro‐ priate phrases to calm down the user or considering negative emotions as a signal of some misunderstanding that may occur due to a system failure [1]. Another solution that improves the system’s adaptability is addressee detection (AD). In other words, the system learns to determine whether it is being addressed by users or not. This quality © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 1–10, 2018. https://doi.org/10.1007/978-3-319-99579-3_1

2

O. Akhtiamov and V. Palkov

comes in handy in multiparty conversations which may involve several humans talking to each other while interacting with the SDS. In such conversations, the system is supposed to respond only to system-addressed utterances (On-Talk). Human-addressed utterances (Oﬀ-Talk) should be either ignored or processed in a speciﬁc way without giving a direct response to the user [2]. How do humans specify addressees in real life? We possess several senses which allow us to collect the necessary multimodal information that we leverage to determine whether we are being addressed. The most evident cues for AD are visual, namely head position and gaze, since humans tend to look at the object they are talking to [3]. The next cue for AD is particularly relevant to human-computer interaction and is based on the observation that people tend to modify their normal manner of speech, making it more rhythmical and generally easier to understand as soon as they start talking to the system since they do not perceive the SDS as an adequate conversational agent [4]. An analogical phenomenon may be observed in conversations between adults and children having a lack of communicational experience [5] or between healthy people and people suﬀering from hearing impairments [2]. The third cue is based on the semantical content which may be extracted from verbal information. Users may specify desirable addres‐ sees explicitly by their names as well as implicitly relying on some contextual infor‐ mation, e.g., on the dialogue history or on diﬀerent roles of the interlocutors [6]. The present work is dedicated to multimodal addressee detection in human-humancomputer (H-H-C) conversations which involve two interlocutors talking to each other as well as to an artiﬁcial conversational agent. We operate with the Smart Video Corpus (SVC) [2], implement a multimodal approach to AD that includes the visual, acoustic-prosodic, and textual analysis of users’ utterances and investigate conditions under which our models face diﬃculties in solving the AD problem. Our best metamodel involving all three levels of analysis outperforms the current SVC baseline estab‐ lished in our previous paper [7].

2

Related Work

There exist several studies investigating the separate roles of visual, acoustical, and textual information for AD. The authors of [8] analyse human turn-taking at multiparty meetings by constructing models for predicting at what moment the next utterance begins. The models utilise head movements near the end of an utterance. In the paper [9], an eﬀective model for managing attention and interaction control in multimodal spoken dialogue systems is proposed. This model reveals that it is possible to monitor users’ attention exceptionally by tracking head movements. The authors of [10] model two dimensions of speaking style that talkers modify when changing addressees: speech rhythm and vocal eﬀort. For each dimension, they design features that do not require speech recognition output, session normalisation, speaker normalisation, or dialogue context, proving acoustic-prosodic analysis to be the most ﬂexible solution for AD in various domains. Text-based AD is considered in [11] where the authors demonstrate the advantages of applying deep recurrent models jointly with word embeddings over using shallow

Gaze, Prosody and Semantics

3

models such as n-gram language models, feedforward neural network-based models, and boosting classiﬁers. There exist few multimodal studies on AD. The paper [2] describes a multimodal approach to AD by applying a Viola-Jones frontal face detector to determine users’ focus of visual attention, prosodic n-gram to detect intonation changes occurring after an addressee change, and part-of-speech (POS) n-gram to track modiﬁcations in the syntac‐ tical structure of human- and computer-addressed utterances. In another paper [12], a large variety of modalities is investigated: speech, automatic speech recognition output (recognised text and recognition conﬁdence), dialogue state, gaze direction, and beam‐ forming. These features are listed according to their relative contribution to the total AD performance in descending order. The present paper is based on the SVC corpus introduced in [2] and continues a series of our previous studies on multimodal AD [7, 13–15]. We update the current SVC base‐ line established by us in [7] and draw conclusions why our models perform diﬀerently for separate speech categories.

3

Classiﬁcation

3.1 Visual Analysis The ﬁrst standalone model is a video classiﬁer utilising several features which represent head orientation. Eye tracking would also provide some additional information, e.g., gaze direction that may be useful for the AD problem [12]. However, the low video resolution of the SVC data does not allow us to extract any reliable information regarding pupil position, though leveraging this information would be reasonable under conditions of short-distance human-smartphone interaction. For On/Oﬀ-View classiﬁcation, it is suﬃcient to distinguish frontal faces from any others. One idea would be to train an image classiﬁer directly on On/Oﬀ-View labels by using the SVC data as the authors of the corpus did in their original paper [2]. Another idea is to apply a pre-trained model for frontal face detection or for facial landmark detection. Face detection is a common task for many real applications, therefore, there exist highly eﬀective and universal solutions which may be easily integrated into end systems. At the current stage of our research, we choose the second option: we apply the OpenCV library and its oﬀ-theshelf frontal face detector based on the Viola-Jones (VJ) algorithm [16] for image clas‐ siﬁcation. The main advantages of the VJ method are low resource consumption and short computational time that made it one of the most popular algorithms for object detection. In the VJ method, Haar-like features are calculated, then an integral image (also known as summed area table) is created. After that, a variant of the AdaBoost learning algorithm is applied to both select the best features and to train classiﬁers that use them. As a result, we obtain a boosted cascade of weak classiﬁers that is ready for object detection. The predictions returned by the frontal face detector are obtained by analysing video frames extracted from video utterances with an interval of 0.1 s. and are considered directly as On/Oﬀ-View predictions.

4

O. Akhtiamov and V. Palkov

3.2 Acoustic-Prosodic Analysis In addition to the video classiﬁer described above, we analyse prosodic information by applying a simple acoustical classiﬁer on the SVC data as shown in our previous paper [7]. As a feature set, the model utilises the ComParE features that were introduced at the Interspeech 2013 Computational Paralinguistics Challenge (ComParE) [17] and were used in all the subsequent challenges as a universal solution for many paralinguistic problems, e.g., emotion recognition [17], non-native accent detection [18], detection of people suﬀering from cold [5], and addressee detection in conversations between adults and children [5]. The set comprises 6373 static features resulting from the computation of various functionals over several low-level descriptor (LLD) contours: energy, spec‐ tral, cepstral (MFCC) and voicing related LLDs as well as a few LLDs including loga‐ rithmic harmonic-to-noise ratio (HNR), spectral harmonicity, and psychoacoustic spec‐ tral sharpness. For the SVC data, the ComParE features are extracted at the utterance level with the openSMILE toolkit [17]. In [7], we noticed that the entire ComParE feature set is redundant for the AD problem and conducted recursive feature elimination to reduce its size. We use the same approach to feature selection in the present study; ﬁrst, the features are normalised by bringing them to zero mean and unit variance on a training set. Next, we calculate a weight for each feature as the corresponding component of the normal vector of a linear Support Vector Machine (SVM) trained on the training set. After that, we sort the attributes in the order of descending weights, exclude 10 attributes with the lowest weights, train another SVM using the new reduced feature set, estimate its performance as unweighted average recall (UAR) on a development set and repeat the procedure for the next 10 attributes. As a result, we notice that the AD performance changes insigniﬁcantly if the feature set size is greater than 1000. Below this threshold, the performance starts dropping essentially. Therefore, we have decided to use a reduced feature set containing the ﬁrst 1000 ComParE features with the highest SVM weights as an eﬀective trade-oﬀ between problem dimensionality and classiﬁcation performance. 3.3 Textual Analysis The third standalone classiﬁer operates with raw text. We work with the manual tran‐ scripts of the SVC data, assuming that our automatic speech recogniser (ASR) has a word error rate close to 0, though we also showed in [7] that the AD performance drops insigniﬁcantly in case of multimodal analysis (with acoustical and textual information) after replacing the manual transcripts by the output of a real ASR with a word error rate of approximately 20%. In [7], we proposed two-stage textual analysis by applying two diﬀerent n-gram-based text representations in combination with a linear SVM: syntac‐ tical (part-of-speech trigrams) and lexical (word stem unigrams). Their joint use turned out to be an eﬀective solution for distinguishing On-Talk from Oﬀ-Talk. In the present study, we apply an LSTM-based text classiﬁer introduced in our previous paper [14]. Conceptually, this classiﬁer involves both syntactical and lexical stages of analysis, tracking the structure of sentences as well as their lexical content. Furthermore, the model takes into account semantic similarities between words, each of which is repre‐ sented as a GloVe vector. Global Vectors (GloVe) is a state-of-the-art unsupervised

Gaze, Prosody and Semantics

5

approach to building word embeddings based exceptionally on word co-occurrences in training data [19]. Therefore, a large amount of unlabelled out-of-domain textual data may be utilised to train such word representations that will work well for a wide range of applications. In the present study, we apply the spaCy toolkit and its pre-trained GloVe model for German to extract word embeddings. After obtaining word representations, each utterance is treated as a sequence of word vectors that will be fed to a neural network. First, the sequence is processed by a recurrent layer containing several Long Short-Term Memory (LSTM) cells [20]. The recurrent layer is followed by a dropout layer that randomly removes a percentage of the connections between the neighbouring layers for each training example to prevent overﬁtting [21]. Finally, the signal passes through a softmax layer with two neurons according to the number of target classes (the utterance belongs either to On-Talk or to Oﬀ-Talk). As a result, the network receives an utterance and returns a sequence of addressee predictions (one for each word). We average word-level predictions over the utterance to obtain an utterance-level prediction for a correct comparison with other models. The following network parameters were chosen after testing the network on the development set: 30 LSTM cells, 20% dropout, categorical cross-entropy as a loss function, RMSprop as an optimisation method, a learning rate of 0.01, 50 epochs, and a batch size of 32 examples. 3.4 Data Fusion The fusion of video, speech, and text is performed at the utterance level since the acous‐ tical classiﬁer and the text classiﬁer need to take into account the context of a whole utterance. However, the frontal face detector provides us with On/Oﬀ-View predictions at the frame level, and they should also be transformed into utterance-level ones. The following six utterance-level meta-features are thus extracted from the frame-level On/ Oﬀ-View predictions: number of frames, proportion of On-View frames, and this proportion separately for each quarter of the utterance [2]. The last ﬁve features play the roles of On/Oﬀ-View conﬁdence scores in diﬀerent utterance regions. Additionally, we perform data fusion by concatenating the video-based meta-features with conﬁdence scores returned by the other two classiﬁers. In order to normalise the feature vectors obtained after concatenation, we calculate their standard score, bringing each feature to zero mean and unit variance on the development set. The normalised feature vectors are fed to a meta-model based on a linear SVM. All the standalone models are trained separately on the training set, while the meta-model is trained on the development set. The resulting performance of each model is calculated as UAR on a test set. The same metric was used by the authors of the SVC corpus in [2]. The entire multimodal system is depicted in Fig. 1. All the models were implemented with the RapidMiner data science platform except the text classiﬁer that was implemented in Python with the TensorFlow library.

6

O. Akhtiamov and V. Palkov

Fig. 1. Proposed system for multimodal addressee detection.

4

Corpora

The SVC data (part of the Smart Web Project) [2] has been collected within large-scaled Wizard-of-Oz experiments and models the H-H-C conversation in German between two users and a multimodal SDS. The corpus includes queries in the context of a visit to a Football World Cup stadium in 2006. A user was carrying a mobile phone, asking ques‐ tions of certain categories (transport, sights, sport statistics, and also open-domain ques‐ tions) and discussing the obtained information with another human whose speech was not recorded. The data comprises 3.5 h of audio and video, 99 dialogues (one unique speaker per dialogue), 2193 automatically segmented utterances with manual tran‐ scripts, and 25073 words in total. The manual labelling of addressees was carried out for each word; four word classes were speciﬁed: On-Talk (NOT) – computer-addressed speech, 11556 words, read Oﬀ-Talk (ROT) – reading information aloud from the system display, 3222 words, paraphrased Oﬀ-Talk (POT) – retelling the information obtained from the system in arbitrary form, 4674 words, and spontaneous Oﬀ-Talk (SOT) – other

Gaze, Prosody and Semantics

7

human-addressed speech (including self-talk), 5621 words. The video data was captured with the frontal camera of the mobile phone. The manual labelling of video is framebased (7.5 frames per second), three classes were speciﬁed: On-View – a user is looking at the camera (79%), Oﬀ-View – a user is not looking at the camera (14%) and No Face – a user is out of view of the camera (2% of the corpus duration). No requirements regarding speech and behaviour were given to users in order to obtain realistic H-H-C interaction. The neural network for textual analysis is trained on the initial word-level labels, while the acoustical classiﬁer and the meta-model are trained on utterance-level labels. An utterance label is calculated as the mode of all word labels in the current utterance. After performing the word-to-utterance label transformation, we obtain 1087 NOT, 474 SOT, 323 POT and 309 ROT utterances which look as follows: : “Wann wurde Berlin gegründet? Kannst du mir das sagen?” (When was Berlin founded? Can you tell me that?), : “Berlin wurde im 1237 gegründet.” (Berlin was founded in 1237.), : “Oh, die Stadt ist ziemlich alt. Das Gründungsjahr ist 1237.” (Oh, the city is quite old. The foundation year is 1237.), : “Cool! Das wusste ich nicht.” (Cool! I didn’t know that.). We consider a two-class task only (On-Talk vs. the three Oﬀ-Talk classes) that is equivalent to the AD problem. Experiments with a four-class task may be found in the original paper [2]. After merging the three Oﬀ-Talk classes into one and performing the word-to-utterance label transformation, we obtain 1078 On-Talk and 1115 Oﬀ-Talk utterances.

5

Experimental Results

We examine our models on the SVC data by reproducing the authors’ experiment conducted in the original research. After splitting the entire data set into the ﬁrst (58 speakers) and the second (37 speakers) subset as speciﬁed in [2], we split the ﬁrst subset again into a training (44 speakers) set, which we utilise to train the standalone models, and a development (14 speakers) set, which is used for training the meta-model. The second subset is used only for testing. Four speakers were excluded due to technical problems. The resulting performance of each model is calculated as utterance-level UAR on the test set and depicted in Fig. 2.

Fig. 2. Classiﬁcation performance of the proposed models.

8

O. Akhtiamov and V. Palkov

The visual classiﬁer shows the lowest result among all the proposed models. Inter‐ estingly, its AD performance of 0.705 is exactly the same as for a baseline visual clas‐ siﬁer introduced in the original paper [2]. However, the baseline model demonstrates a higher UAR value of 0.884 on the On/Oﬀ-View classiﬁcation problem since this model was trained directly on the SVC data [22], while our visual model reaches an On/OﬀView performance of only 0.820. The further On/Oﬀ-View classiﬁcation improvement does not lead to any signiﬁcant improvement for On/Oﬀ-Talk classiﬁcation. The acoustical classiﬁer shows a better performance of 0.800, while the text classiﬁer demonstrates the best performance among all the standalone models that is equal to 0.912. Being combined at the meta-level, both models improve each other reaching a performance of 0.918 that is at the level of the current SVC baseline established by us in [7] and 0.7% higher than the result of the text classiﬁer alone. By fusing all three standalone models at the meta-level, we manage to improve the existing SVC baseline by 0.9% and reach a performance of 0.926. Besides general classiﬁcation performance, we also estimate class recall for diﬀerent categories of speech and determine the most problematic categories for each model. In Fig. 3, class recall values of several models are given for each category. The ﬁgure illustrates analogical trends which were noticed for acoustical and textual models in our previous paper [7], namely that the acoustical model can hardly classify constrained speech (mostly read Oﬀ-Talk, but also paraphrased Oﬀ-Talk), while the textual model faces diﬃculties in classifying spontaneous speech (mostly spontaneous Oﬀ-Talk, but also paraphrased Oﬀ-Talk). Read Oﬀ-Talk turns out to be the most problematic speech category for the visual classiﬁer demonstrating nearly a random-choice performance on it. However, there is a natural explanation for this: visual information is useless in this case since users are looking at the display while reading the information on it aloud to their partner, as if they were talking to the system. The second most problematic speech category for the visual classiﬁer is spontaneous Oﬀ-Talk. A possible explanation for this is that this class includes short self-talk phrases, e.g., swear words, bringing ambiguity to the On/Oﬀ-View patterns since users usually stay visually focused on the display in such short Oﬀ-Talk situations. The meta-model combining visual, acoustic-prosodic, and textual analysis provides a noticeable performance improvement for each category of speech in comparison with the performance of the best standalone model.

Fig. 3. Classiﬁcation performance for speech categories with various speech spontaneity.

Gaze, Prosody and Semantics

6

9

Conclusions and Future Work

By adding visual analysis, we managed to outperform the existing baseline for the SVC data. However, the contribution of the video modality is subsidiary, and it is impossible to reliably distinguish On-Talk from Oﬀ-Talk by using only visual information. The text classiﬁer remains the most eﬀective standalone model which also brings the most signiﬁcant contribution to the total classiﬁcation performance of the multimodal approach. A better On/Oﬀ-View classiﬁcation performance has no signiﬁcant inﬂuence on On/Oﬀ-Talk classiﬁcation since On/Oﬀ-View labels are weakly correlated with On/ Oﬀ-Talk labels for read Oﬀ-Talk due to its nature and for spontaneous Oﬀ-Talk due to short self-talk phrases which users pronounce without losing their visual contact with the system. Further On/Oﬀ-View classiﬁcation improvement is not only useless, but also problematic due to the low quality of the SVC visual data. We made an attempt to apply a solution based on neural networks, namely the Face Recognition library for facial landmark detection in Python. The following features were extracted at the frame level: number of recognised faces, number of recognised eyes on the largest face, area size of the largest face, distance between the centres of the eyes on the largest face, and distance between the eye level and the mouth level on the largest face. In theory, the last three attributes allow us to track the distance between the user and the system, head rotations side to side, and head rotations up and down, respectively. The features were fed to a linear SVM that returned frame-level On/Oﬀ-View predictions. However, the proposed idea brought no improvement to On/Oﬀ-View classiﬁcation mostly due to the low data resolution and bad lighting conditions leading to errors in detecting facial landmarks. Within our future work, we will broaden the scope of AD applications besides clas‐ sical SDS scenarios, namely in conversations between adults and children [5], and conduct a cross-domain study to compare acoustical similarities between computer- and child-addressed speech. Acknowledgements. This work was partially ﬁnancially supported by the Government of the Russian Federation (Grant No. 08-08) and by DAAD within the program ‘Research Grants for Doctoral Candidates and Young Academics and Scientists’ and within the program ‘LeonhardEuler’.

References 1. Spirina, A., Minker, W., Sidorov, M.: Could emotions be beneﬁcial for interaction quality modelling in human-human conversations? In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 447–455. Springer, Cham (2017). https://doi.org/ 10.1007/978-3-319-64206-2_50 2. Batliner, A., Hacker, C., Noeth, E.: To talk or not to talk with a computer. J. Multimodal User Interfaces 2(3), 171–186 (2008) 3. Maglio, P.P., Matlock, T., Campbell, C.S., Zhai, S., Smith, B.A.: Gaze and speech in attentive user interfaces. In: Tan, T., Shi, Y., Gao, W. (eds.) ICMI 2000. LNCS, vol. 1948, pp. 1–7. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-40063-X_1

10

O. Akhtiamov and V. Palkov

4. Lee, M.K., Kiesler, S., Forlizzi, J.: Receptionist or information kiosk: how do people talk with a robot? In: Proceedings of ACM Conference on Computer-Supported Cooperative Work, pp. 31–40 (2010) 5. Schuller, B., et al.: The INTERSPEECH 2017 computational paralinguistics challenge: addressee, cold & snoring. In: Proceedings of Interspeech, Stockholm (2017) 6. Ouchi, H., Tsuboi, Y.: Addressee and response selection for multi-party conversation. In: Proceedings of EMNLP, Austin, pp. 2133–2143 (2016) 7. Akhtiamov, O., Sidorov, M., Karpov, A., Minker, W.: Speech and text analysis for multimodal addressee detection in human-human-computer interaction. In: Proceedings of Interspeech, Stockholm, pp. 2521–2525 (2017) 8. Ishii, R., Shiro, K., Kazuhiro, O.: Prediction of next-utterance timing using head movement in multi-party meetings. In: Proceedings of the 5th International Conference on Human Agent Interaction. ACM (2017) 9. Skantze, G., Gustafson, J.: Attention and interaction control in a human-human-computer dialogue setting. In: Proceedings of SIGDIAL. Association for Computational Linguistics (2009) 10. Shriberg, E., Stolcke, A., Ravuri, S.: Addressee detection for dialog systems using temporal and spectral dimensions of speaking style. In: Proceedings of Interspeech (2013) 11. Ravuri, S., Stolcke, A.: Recurrent neural network and LSTM models for lexical utterance classiﬁcation. In: Proceedings of Interspeech, pp. 135–139 (2015) 12. Tsai, T.J., Stolcke, A., Slaney, M.: A study of multimodal addressee detection in humanhuman-computer interaction. IEEE Trans. Multimed. 17(9), 1550–1561 (2015) 13. Akhtiamov, O., Sergienko, R., Minker, W.: An approach to oﬀ-talk detection based on text classiﬁcation within an automatic spoken dialogue system. In: Proceedings of ICINCO, Lisbon, vol. 2, pp. 288–293 (2016) 14. Akhtiamov, O., Ubskii, D., Feldina, E., Pugachev, A., Karpov, A., Minker, W.: Are you addressing me? Multimodal addressee detection in human-human-computer conversations. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 152–161. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_14 15. Pugachev, A., Akhtiamov, O., Karpov, A., Minker, W.: Deep learning for acoustic addressee detection in spoken dialogue systems. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 45–53. Springer, Cham (2018). https://doi.org/ 10.1007/978-3-319-71746-3_4 16. Viola, P., Jones, M.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004) 17. Schuller, B., et al.: The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conﬂict, emotion, autism. In: Proceedings of Interspeech, Lyon (2013) 18. Schuller, B., et al: The INTERSPEECH 2016 computational paralinguistics challenge: deception, sincerity & native language. In: Proceedings of Interspeech (2016) 19. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of EMNLP, Doha, vol. 14, pp. 1532–1543 (2014) 20. Hochreiter, S., Schmidhuber, J.: Long short-term memory. J. Neural Comput. 9(8), 1735– 1780 (1997) 21. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overﬁtting. J. ML Res. 15(1), 1929–1958 (2014) 22. Noth, E., Hacker, C., Batliner, A.: Does multimodality really help? The classiﬁcation of emotion and of on/oﬀ-focus in multimodal dialogues. In: ELMAR. IEEE (2007)

A Continuous Vocoder Using Sinusoidal Model for Statistical Parametric Speech Synthesis Mohammed Salah Al-Radhi1 ✉ , Tamás Gábor Csapó1,2, and Géza Németh1 (

)

1

2

Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary {malradhi,csapot,nemeth}@tmit.bme.hu MTA-ELTE Lendület Lingual Articulation Research Group, Budapest, Hungary

Abstract. In our earlier work in statistical parametric speech synthesis, we proposed a source-filter based vocoder using continuous F0 (contF0) in combina‐ tion with Maximum Voiced Frequency (MVF), which was successfully used with deep learning. The advantage of a continuous vocoder in this scenario is that vocoder parameters are simpler to model than conventional vocoders with discontinuous F0. However, our vocoder lacks some degree of naturalness and still not achieving a high-quality speech synthesis compared to the well-known vocoders (e.g. STRAIGHT or WORLD). Previous studies have shown that human voice can be modelled effectively as a sum of sinusoids. In this paper, we firstly address the design of a continuous vocoder using sinusoidal synthesis model that is applicable in stat‐ istical frameworks. The same three parameters of the analysis part from our previous model have been also extracted and used for this study. For refining the output of the contF0 estimation, post-processing approach is utilized to reduce the unwanted voiced component of unvoiced speech sounds, resulting in a smoother contF0 track. During synthesis, a sinusoidal model with minimum phase is applied to reconstruct speech. Finally, we have compared the voice quality of the proposed system to the STRAIGHT and WORLD vocoders. Experimental results from objective and subjective evaluations have shown that the proposed vocoder gives state-of-the-art vocoders performance in synthesized speech while outperforming the previous work of our continuous F0 based source-filter vocoder. Keywords: Continuous vocoder · Speech synthesis · Sinusoidal model ContF0

1

Introduction

Statistical parametric speech synthesis (SPSS) based text-to-speech (TTS) systems have steadily advanced in terms of naturalness during the last two decades. Even though the quality of synthetic speech is still unsatisfying, the beneﬁts of ﬂexibility, robustness, and control denote that SPSS stays as an attractive proposition. Such a statistical frame‐ work is guided by the encoding-decoding (vocoder, which is also called voice coders) concept which is based on parameterization of the speech waveform and reconstruction. Hence, vocoder performance is the most important factor limiting the impact of overall voice quality in SPSS [1]. Vocoders attempt to produce a decoded signal that sounds © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 11–20, 2018. https://doi.org/10.1007/978-3-319-99579-3_2

12

M. S. Al-Radhi et al.

like the original speech. Therefore, several approaches based on mathematical and physical models have been suggested to model the overall speech signal. In recent years, a number of sophisticated source-ﬁlter based vocoders have been proposed and extensively used in speech synthesis. Speciﬁcally, for example, STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHT spectrum) vocoder [2] is probably the most used vocoder for SPSS which decomposes signals into spectral envelope, excitation, and aperiodicity parameters. For real-time processing, the computational issue is expensive in STRAIGHT. Furthermore, the Deterministic plus Stochastic Model (DSM) proposed by Drugman et al. [3] is based on a two-band mixed excitation in which the upper band was treated as noise and the lower band was modeled through a set of deterministic waveforms. More recently, a high-quality vocoder named WORLD was developed in [4] to meet the requirements of real-time processing. It consists of three modern algorithms for estimating speech parameters (F0 contour, spectral envelope, and excitation signal used as a band aperio‐ dicity), and one synthesis algorithm. In our earlier work, we proposed a computationally feasible residual-based source ﬁlter vocoder [5], using a continuous F0 model [6], and MVF [7]. In this method, the voiced excitation consisting of pitch synchronous PCA residual frames is low-pass ﬁltered while the unvoiced part is high-pass ﬁltered according to the MVF contour as a cutoﬀ frequency. In [8], we further control the time structure of the high-frequency noise component by estimating a suitable time True envelope. Continuous vocoder was espe‐ cially successful for modelling speech sounds with mixed excitation. Sinusoidal vocoder is an alternative category for the source-ﬁlter model of speech and has been successfully applied to a broad range of speech processing problems such as speech modiﬁcation and conversion. Sinusoidal modeling can be characterized by the amplitudes, frequencies, and phases of the component sine waves; and synthesized as the sum of a number of sinusoids that can generate high quality speech. Concisely, voiced speech can be modeled as a sum of harmonics (quasi periodic) spaced at F0 with instantaneous phases, whereas unvoiced speech can be represented as a sum of sinusoids with random phases [9]. Various sinusoidal model formulations have been discussed in the literature. In particular, Harmonic plus Noise Model (HNM) was developed in [10] and has shown the capability of providing high-quality copy synthesis and prosodic modiﬁcations. Based on time-varying frequency, HNM decomposes speech into deterministic lower band where the signal is modeled as a sum of harmonically related sinusoids and stochastic upper band where the signal is modeled by colored noise. Another sinusoidal based speech vocoder is being developed by Degottex and Stylianou [11] in which an adaptive Quasi-Harmonic vocoder (aQHM) and Adaptive Iterative Reﬁnement (AIR) method combined as an intermediate model to iteratively minimize the mismatch of harmonic frequencies. Hence, the full system is called aHM-AIR. Similarly, Perception based Dynamic sinusoidal Model (PDM) and Harmonic Dynamic Model (HDM) have been proposed in [12] and have both been applied during analysis and synthesis to be modelled in hidden Markov models (HMM) based speech synthesis. Thus, from a point of view of either objective or subjective measures, sinusoidal vocoders were preferred in terms of quality. However, these models have usually more

A Continuous Vocoder Using Sinusoidal Model

13

parameters (each frame has to be represented by a set of frequencies, amplitude, and phase) than in the source-ﬁlter models. Consequently, more memory would be required to code and store the speech segments. Although some experiments have been made to use either an intermediate model [11] or intermediate parameters (regularized cepstral coeﬃcients) [12] to overcome these issues, the computational complexity of SPSS can be quite high once additional algorithms are including [1]. By keeping the number of our vocoder parameters unchanged [8], which are simpler to model than traditional vocoders with discontinuous F0, the goal of this paper is twofold. The ﬁrst one is to consider whether a diﬀerent synthesis technique based on sinusoidal approach can more accurately synthesize speech in a continuous vocoder. Second, an experimental comparison has been made between synthetic speech generated using source-ﬁlter continuous vocoder and the proposed one. Besides, the estimated contF0 contours is smoothed by a post-processing phase to eliminate octave errors and isolated glitches. The rest of this paper is structured as follows: Sect. 2 introduced the baseline vocoder. The new form of the continuous vocoder is presented in Sect. 3. Experiment design and our evaluations are discussed in Sect. 4. Finally, Sect. 5 concludes the contributions of this paper.

2

Continuous Vocoder: Baseline

The baseline vocoder is based on our previous work [8]. During the analysis phase, F0 is calculated on the input waveforms by the open-source implementation1 of a simple continuous pitch tracker [6]. In regions of creaky voice and in case of unvoiced sounds or silences, this pitch tracker interpolates F0 based on a linear dynamic system and Kalman smoothing. After this step, MVF is calculated from the speech signal using the MVF_Toolkit2, resulting in the MVF parameter [7]. In the next step 24-order MelGeneralized Cepstral analysis (MGC) [13] is performed on the speech signal with alpha = 0.42 and gamma = –1/3. In all steps, 5 ms frame shift is used. The results are the contF0, MVF and the MGC parameter streams. Finally, we perform Principal Compo‐ nent Analysis (PCA) on the pitch synchronous residuals as shown in our earlier study [5]. During the synthesis phase of the baseline system, voiced excitation is composed of PCA residuals Overlap-Added (OLA) pitch synchronously, depending on the contin‐ uous F0. After that, this voiced excitation is lowpass ﬁltered frame-by-frame at the frequency given by the MVF parameter. In the frequencies higher than the actual value of MVF, white noise is used. True time envelope of the PCA residual has been applied in order to further control the time structure of the high-frequency component in the excitation and noise parts [8]. Voiced and unvoiced excitation is then added together. Finally, MGLSA ﬁlter is used to synthesize speech from the excitation and the MGC parameter stream [14].

1 2

https://github.com/idiap/ssp. http://tcts.fpms.ac.be/~drugman/ﬁles/MVF.zip.

14

3

M. S. Al-Radhi et al.

Proposed Methodology

This section will show our main contributions of improving the latest version of the continuous vocoder [8]. We illustrate the basic principles of the proposed method by ﬁrstly smoothing the contF0 contour to reduce contF0 estimation error, and secondly introducing the novelty of a continuous synthesizer. 3.1 Smoothing of the ContF0 Contour It is common practice for pitch values to have a small amount of noise variation if they are not estimated well; thereby often leads to extra buzziness or undependable feature measurements. Therefore, a smoothing stage is necessary in improving the quality of the pitch estimation. A variety of smoothing techniques have already been used to smooth the tracked F0 contour and reﬁne the voiced/unvoiced regions. A dynamic programming approach that used in RAPT [15], linear smoothing that use some kind of low-pass ﬁlter, and median (nonlinear non-recursive digital ﬁlter) approach which utilized in [16] are the most successful smoothing techniques for pitch tracking. In this paper, we propose to combine two smoothing steps in contF0: (1) median ﬁlter using 0.1s window to ignore isolated outliers while preserving both the ﬁne-grained variations and the sharpness of true step transitions; (2) linear smoother (zero-phase ﬁltering) with Hanning window is applied to remove higher-frequency resonance eﬀects and hence suppress the noisiness of the measurement. An example of contF0 estimation made with smoothing technique is shown in Fig. 1.

Fig. 1. Example of continuous F0 contour estimated by the baseline (red) and the smoothed approach (blue). Sentence: “Sometimes her dreams were ﬁlled with visions.”, from both speakers. (Color ﬁgure online)

3.2 Sinusoidal Synthesis Similarly to [10, 17], the synthesis procedure in this work decomposes the speech frames into a harmonic/voiced sv (t) component lower band and a stochastic/noise sn (t) compo‐ nent upper band based on MVF values. Thus, we deﬁne these components as

A Continuous Vocoder Using Sinusoidal Model

s(t) = sv (t) + sn (t).

15

(1)

The synthetic signal of the harmonic part is reconstructed by an OLA technique after generating the samples (harmonic amplitudes and phases) per each frame from their corresponding parameters. This can be expressed as:

(t) =

∑K i k=1

( ) Aik (t) cos wik t + ∅ik (t) ;

wik = 2𝜋k(contF0)i ,

(2) (3)

where Ak (t) and ∅k (t) are the amplitude and phase at frame i , t = 0, 1, … , N and N is the length of the synthesis frame. K is the time-varying number of harmonics that depends on the contF0 and MVF: ( ) ⎧ MVF i ⎪ round − 1, voiced frames Ki = ⎨ . contF0i ⎪ 0, unvoiced frames ⎩

Next, the amplitudes of the harmonics are calculated by sampling the cepstral enve‐ lope, whereas their phases are obtained through a minimum-phase approach given by the cepstral envelope (for more mathematical details, see [17]). If the current frame is voiced, the synthesized noise part n(t) is ﬁltered by a high-pass ﬁlter fh (t) with cutoﬀ frequency equal to the local MVF, and then modulated by its time-domain envelope e(t) as detailed in our previous work [8]:

[ ] sin (t) = ei (t) fhi (t) ∗ ni (t) . In this case, the phases are given random values, and the noise amplitude is obtained in the same way as the harmonic part. For unvoiced frames, the harmonic part is

Fig. 2. Block diagram of the sinusoidal-synthesis part in a continuous vocoder.

16

M. S. Al-Radhi et al.

obviously zero and the synthetic frame is typically equal to the generated noise while the fh (t) step is omitted. The overall architecture is depicted in the block diagram as shown in Fig. 2. Hence, we refer to this vocoder as a continuous sinusoidal model (CSM).

4

Experiments

The speech data used in this study consist of a database recorded for the purpose of developing TTS synthesis. Two English speakers were chosen from the CMU-ARCTIC database [18], denoted AWB (Scottish English, male) and SLT (American English, female), which respectively consists of 1138 and 1132 sentences. The waveform sampling rate of the database is 16 kHz. In the vocoding experiments, 100 sentences (50 sentences from each speaker) were chosen randomly to be analyzed and synthesized with the baseline [8], proposed vocoder, WORLD3 Vocoder [4] that is a fast and highquality vocoder, and the TANDEM-STRAIGHT4 vocoder [19] that has mostly become the state-of-the-art model in SPSS. The frame shift was set to 5 ms, while all other parameters remain at their default values. In order to achieve our goals and to verify the eﬀectiveness of the proposed method, objective and subjective evaluations were carried out in the next subsections. 4.1 Objective Evaluation A range of acoustic objective measures are considered to evaluate the quality of synthe‐ sized speech based on the proposed sinusoidal vocoder. We adopt the frequencyweighted segmental SNR (fwSNRseg) for the error criterion since it is said to be much more correlated to subjective speech quality than classical SNR [20]. Moreover, Jensen and Taal [21] introduced an Extended Short-Time Objective Intelligibility (ESTOI) measure that calculates the correlation between the temporal envelopes of clean and processed speech. We also measure the Itakura-Saito (IS) distance that has played a key role in speech analysis and synthesis [22]. To a large extent, most studies (such as [23]) conﬁrmed that when the IS distance is below 0.1, the two spectra would be perceptually nearly identical. For all objective measures, a calculation is done frame-by-frame and a higher value indicates better performance except for the IS measure (lower value is better). The results were averaged over the selected utterances (50 sentences) for each speaker. As Table 1 shows, the proposed vocoder tends to signiﬁcantly outperform the base‐ line approach among all metrics. In particular, it can be seen from IS measure that the proposed vocoder is slightly better than TANDEM-STRAIGHT in the AWB speaker whereas this is not the case with the SLT speaker. Hence, it can be concluded that the improved continuous vocoder presented in this paper has similar, or only slightly worse, performance to the reference vocoders. 3 4

https://github.com/mmorise/World. http://www.wakayama-u.ac.jp/~kawahara/tSTRAIGHT/TANDEM-STRAIGHT‐ withGUI.html.

A Continuous Vocoder Using Sinusoidal Model

17

Table 1. Average scores performance based on synthesized speech for male and female speakers. The bold font shows the best performance. Vocoder Baseline Proposed WORLD TANDEM-STRAIGHT

IS AWB 0.148 0.058 0.016 0.065

SLT 0.447 0.082 0.014 0.042

fwSNRseg AWB SLT 6.987 7.940 9.560 11.034 13.312 13.336 11.840 14.641

ESTOI AWB 0.517 0.749 0.808 0.772

SLT 0.676 0.867 0.951 0.933

In addition, Table 2 compares the parameters of the vocoders under study. It can be seen that the continuous vocoder uses only two one-dimensional parameters for modeling the excitation, the WORLD vocoder is applying a ﬁve-dimensional band aperiodicity, whereas TANDEM-STRAIGHT use high-dimensional parameters which is not suitable for statistical modeling. As a result, continuous vocoder has fewer param‐ eters; thus, its complexity of a synthesis algorithm is lower than those given in Table 2. Table 2. Parameters and excitation type of applied vocoders. Vocoder Continuous WORLD TANDEM-STRAIGHT

Parameter per frame F0: 1 + MVF: 1 + MGC: 24 F0: 1 + Band aperiodicity: 5 + MGC: 60 F0: 1 + Aperiodicity: 2048 + Spectrum: 2048

Excitation Mixed Mixed Mixed

4.2 Subjective Evaluation In order to evaluate the perceptual quality of the proposed systems, we conducted a webbased MUSHRA (MUlti-Stimulus test with Hidden Reference and Anchor) listening test [24]. We compared natural sentences with the synthesized sentences from the base‐ line, proposed, TANDEM-STRAIGHT, WORLD, and an anchor system. The anchor type was the re-synthesis of the sentences with a standard pulse-noise excitation vocoder. In the test, the listeners had to rate the naturalness of each stimulus relative to the refer‐ ence (which was the natural sentence), from 0 (highly unnatural) to 100 (highly natural). The utterances were presented in a randomized order. The listening test samples can be found online5. 13 participants (7 males, 6 females) with an age range of 20–42 years (mean: 31 years), mostly with engineering background were asked to conduct the online listening test. We evaluated ten sentences (ﬁve from each speaker). Altogether, 60 utterances were included in the test (2 speaker × 6 types × 5 sentences). On average, the test took 15 min to ﬁll. The MUSHRA scores for all the systems are showed in Fig. 3. According to the results, the proposed vocoder outperformed the baseline system for both speakers, but this diﬀerence is not statistically signiﬁcant (Mann-Whitney-Wilcoxon ranksum test, p < 0.05). For both speakers, the WORLD vocoder was ranked the best (for speaker 5

http://smartlab.tmit.bme.hu/specom2018.

18

M. S. Al-Radhi et al.

AWB, this diﬀerence is signiﬁcant). However, the proposed system was slightly preferred over TANDEM-STRAIGHT (not signiﬁcant), showing that the sinusoidal extension of our vocoder is similar to state-of-the-art high quality vocoders. After checking the samples, we found that in case of speaker SLT, some of the synthesized samples using STRAIGHT contained clipping, which can explain the relatively weak performance of this vocoder.

Fig. 3. Results of the MUSHRA listening test for the naturalness question. Error bars show the bootstrapped 95% conﬁdence intervals. The score for the reference (natural speech) is not included.

5

Conclusion

This paper proposed a new approach with the aim of designing a high quality continuous vocoder using a sinusoidal model. The performance of the systems has been evaluated through objective and subjective listening tests. Experiments demonstrate that our proposed model generates higher output speech quality than the baseline, that is a sourceﬁlter based model. Hence, our hypothesis was veriﬁed by successfully reconstructing the waveform from the same parameters used in the baseline. Moreover, it was found that the results obtained with the proposed vocoder were preferred over TANDEMSTRAIGHT and somewhat worse than with WORLD vocoders. The ﬁndings also point out that the continuous vocoder has few parameters and is computationally feasible; therefore, it is suitable for real-time operation. For future work, the authors plan to train and evaluate all continuous parameters (F0, MVF, and MGC) using deep learning algorithm such as feed-forward and recurrent neural networks to test the proposed vocoder in statistical parametric speech synthesis. Acknowledgements. The research was partly supported by the VUK (AAL-2014-1-183), and by the EUREKA (DANSPLAT E!9944) projects. The Titan X GPU used for this research was donated by NVIDIA Corporation. We would like to thank the subjects for participating in the listening test.

A Continuous Vocoder Using Sinusoidal Model

19

References 1. Zen, H., Tokuda, K., Black, A.: Statistical parameteric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009) 2. Kawahara, H., Masuda-Katsuse, I., de Cheveign, A.: Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27(3), 187–207 (1999) 3. Drugman, T., Wilfart, G., Dutoit, T.: A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis. In: 10th Proceedings of Interspeech, Brighton, pp. 1779–1782 (2009) 4. Morise, M., Yokomori, F., Ozawa, K.: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 7(E99-D), 1877–1884 (2016) 5. Csapó, T.G., Németh, G., Cernak, M.: Residual-based excitation with continuous F0 modeling in HMM-based speech synthesis. In: Dediu, A.-H., Martín-Vide, C., Vicsi, K. (eds.) SLSP 2015. LNCS (LNAI), vol. 9449, pp. 27–38. Springer, Cham (2015). https://doi.org/ 10.1007/978-3-319-25789-1_4 6. Garner, P.N., Cernak, M., Motlicek, P.: A simple continuous pitch estimation algorithm. IEEE Signal Process. Lett. 20(1), 102–105 (2013) 7. Drugman, T., Stylianou, Y.: Maximum voiced frequency estimation: exploiting amplitude and phase spectra. IEEE Signal Process. Lett. 21(10), 1230–1234 (2014) 8. Al-Radhi, M.S., Csapó, T.G., Németh, G.: Time-domain envelope modulating the noise component of excitation in a continuous residual-based vocoder for statistical parametric speech synthesis. In: 18th Proceedings of Interspeech, Stockholm, pp. 434–438 (2017) 9. McAulay, R.J., Quatieri, T.F.: Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust. Speech Signal Process. 34(4), 744–754 (1986) 10. Stylianou, Y., Laroche, J., Moulines, E.: High-quality speech modiﬁcation based on a harmonic + noise model. In: Proceedings of Eurospeech, Madrid, pp. 451–454 (1995) 11. Degottex, G., Stylianou, Y.: A full-band adaptive harmonic representation of speech. In: 13th Proceedings of Interspeech, Portland (2012) 12. Hu, Q., Stylianou, Y., Maia, R., Richmond, K., Yamagishi, J.: Methods for applying dynamic sinusoidal models to statistical parametric speech synthesis. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, pp. 4889–4893. IEEE (2015) 13. Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis - a uniﬁed approach to speech spectral estimation. In: 8th International Conference on Spoken Language Processing (ICSLP), Yokohama, pp. 1043–1046 (1994) 14. Imai, S., Sumita, K., Furuichi, C.: Mel Log Spectrum Approximation (MLSA) ﬁlter for speech synthesis. Electron. Commun. Jpn. (Part I: Communications) 66(2), 10–18 (1983) 15. Talkin, D.: A robust algorithm for pitch tracking (RAPT). In: Kleijn, B., Paliwal, K. (eds.) Speech Coding and Synthesis, pp. 497–518. Elsevier, Amesterdam (1995) 16. Rabiner, L., Sambur, M., Schmidt, C.: Applications of a nonlinear smoothing algorithm to speech processing. IEEE Trans. Acoust. Speech Signal Process. 23(6), 552–557 (1975) 17. Erro, D., Sainz, I., Navas, E., Hernaez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE J. Sel. Top. Signal Process. 8(2), 184–194 (2014) 18. Kominek, J., Black, A.W.: CMU ARCTIC databases for speech synthesis. Language Technologies Institute (2003). http://festvox.org/cmu_arctic/

20

M. S. Al-Radhi et al.

19. Kawahara, H., Morise, M.: Technical foundations of TANDEM-STRAIGHT, a speech analysis, modiﬁcation and synthesis framework. Sadhana 36(5), 713–727 (2011) 20. Ma, J., Hu, Y., Loizou, P.: Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. Acoust. Soc. Am. 125(5), 3387–3405 (2009) 21. Jensen, J., Taal, C.H.: An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Trans. Audio Speech Lang. Process. 24(11), 2009– 2022 (2016) 22. Itakura, F., Saito, S.: An analysis-synthesis telephony based on the maximum-likelihood method. In: Proceedings of the 6th International Congress on Acoustics, Tokyo, pp. C17– C20 (1968) 23. Chen, J., Benesty, J., Huang, Y., Doclo, S.: New insights into the noise reduction Wiener ﬁlter. IEEE Trans. Audio Speech Lang. Process. 14(4), 1218–1234 (2006) 24. ITU-R Recommendation BS.1534. Method for the subjective assessment of intermediate audio quality (2001)

Far Field Speech Enhancement at Low SNR in Presence of Nonstationary Noise Based on Spectral Masking and MVDR Beamforming Sergei Astapov1(B) , Aleksandr Lavrentyev2 , and Evgeniy Shuranov1,2 1

2

Department of Speech Information Systems, ITMO University, Kronverksky prospekt 49, St. Petersburg 197101, Russia {astapov,lavrentyev,shuranov}@speechpro.com Speech Technology Center, Krasutskogo Street 4, St. Petersburg 196084, Russia

Abstract. Low Signal to Noise Ratio (SNR) conditions are highly likely during remote speech acquisition. This paper handles a method of remote speech multi-channel signal processing for speech enhancement in presence of strong nonstationary noise. The presented approach builds upon the Minimum Variance Distortionless response (MVDR) method, additionally ﬁltering the multi-channel signal prior to MVDR beamforming coeﬃcient estimation with a spectral mask. This mask is obtained by applying mixture observation vector clustering based on a spatial correlation model, which is estimated by a Complex Gaussian Mixture Model (CGMM). The posterior probabilities obtained during the CGMM Expectation-Maximization (EM) algorithm are used to estimate the cumulative noise mask, which is applied to the mixture. The masked mixture is then used to calculate the MVDR covariance matrix and beamforming coeﬃcients. The method is tested on four mixtures acquired using a 66 microphone array at various low SNR. The results are compared to conventional MVDR and several other methods and validated using the Signal to Distortion Ratio (SDR) improvement metric. The results show that the presented method gives SDR improvement no less than 1–1.5 dB in the majority of cases, compared to MVDR, and performs best speciﬁcally at low SNR of −15 – −20 dB. Keywords: Speech enhancement · Low SNR Nonstationary noise · MVDR Complex Gaussian Mixture Model (CGMM)

1

· Microphone array

Introduction

Advances in close proximity speech enhancement and recognition have paved the way for various voice control related services. However, speech enhancement in far ﬁeld scenarios at low Signal to Noise Ratios (SNR) still poses a problem for tasks situated with remote speech signal acquisition and processing [2,10]. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 21–31, 2018. https://doi.org/10.1007/978-3-319-99579-3_3

22

S. Astapov et al.

Due to acoustic wave diﬀusion and acoustic signal energy attenuation, reverberation [6] in enclosed spaces and physical limitations of the microphone transducer aperture, the speech signal may distort even at relatively low noise levels. Furthermore, assuming that noise sources may appear in a closer proximity to the microphone than the speaker, a low SNR scenario is highly likely. It has become a common practice to use microphone arrays (MA) for remote speech acquisition and apply multi-channel signal processing methods for speech enhancement [3,15]. Single-channel methods are most eﬀective in cases of narrow-band or stationary noise, where noise statistics can be estimated by, e.g., the Wiener ﬁlter, or the signal of interest can be unmixed using, e.g., ICA [8]. Though dual-channel adaptive noise cancellation [2] may be applied in presence of nonstationary wide-band noise, it has its spatial limitations. Multi-channel speech processing, however, allows reducing both diﬀuse and spatially coherent noise by applying various beamforming techniques [15] and adaptive cancelers [3]. Spatially coherent noise incoming from point noise sources can be canceled by steering a null beamformer in their direction. Such an approach is sensitive to steering vector inaccuracy, does not consider multi-path signal propagation (inc. reverberation) and is prone to partial target signal cancellation. Methods like Minimum Variance Distortionless Response (MVDR), which calculate the beamforming weight coeﬃcients by estimating signal-noise mixture covariance matrices, are generally more robust, but can suﬀer from estimation errors [3]. The robustness of adaptive beamformers is increased by applying single [5] or multi-channel [4] masks generated through dereverberation [6], array frequency and phase response estimation [7], source separation [9] and other algorithms, which often employ deep learning for purposes of speech recognition [12]. In this paper we attempt to increase the robustness of MVDR for speech enhancement under heavy wide-band nonstationary noise by applying spectral masking to the multi-channel signal mixture prior to calculating the MVDR beamforming weight coeﬃcients. The spectral mask is obtained by applying observation vector clustering based on a spatial correlation model, which is estimated by a Complex Gaussian Mixture Model (CGMM). As a basis for the CGMM Expectation-Maximization (EM) algorithm we adopt a method proposed by Araki et al. [1]. The method is originally used for MVDR steering vector estimation, however, we apply the posteriors obtained after EM directly to the multi-channel signal. The approach is tested on several signal mixtures acquired in situ using a 66 microphone MA. The results are validated using the Signal to Distortion Ratio (SDR) metric and compared to the results of several other methods based on the classical Delay-Sum Beamformer (DSB).

2

Preliminary Information

This section regards the problem formulation and provides essential information about the methods applied in our approach to speech enhancement, namely MVDR beamforming, observation vector clustering via the CGMM EM algorithm and the DSB variations used for comparison with our approach.

Far Field Speech Enchancement Based on Spectral Masking and MVDR

2.1

23

Problem Formulation

The entire speech enhancement process is performed in the frequency domain. Let s(t, f ) be the Short-Time Fourier Transform (STFT) coeﬃcient of a clean T speech signal at time instance t and frequency bin f , and hs (f ) = [h1 , . . . , hM ] its steering vector, where M is the number of MA channels. The observation T vector y(t, f ) = [y1 (t.f ), . . . , yM (t, f )] then has the form y(t, f ) = s(t, f )hs (f ) +

N

nk (t, f )hk (f ) + nd (t, f ),

(1)

k=1

where nk (t, f ) is spatially coherent noise produced by a point source k, hk (f ) is its steering vector and nd (t, f ) is the diﬀuse noise component. (Note that all other acoustic sources, including other speakers not-of-interest, are considered spatially coherent noise.) In this paper we assume that the direction to the speaker and, therefore, the acoustic wave propagation vector in the far ﬁeld, are known (i.e., measured or estimated with zero error). On the other hand, the power spectral densities (PSD) of speech signal and noise components are unknown. We also assume that M ≥ N . The problem is then to estimate the speech signal sˆ(t, f ) from the observation vector y(t, f ). 2.2

MVDR Beamforming

The MVDR beamformer output at time instance t and frequency f is given as sˆ(t, f ) = wH (f )y(t, f ),

(2)

where w(f ) is a M × 1 vector of the beamforming weight coeﬃcients and (·)H denotes the conjugate transpose of a vector [3]. The optimum weights are selected to minimize the MA output power while maintaining unity gain in the direction of the steering vector of the desired signal hs (f ): w(f ) =

R−1 (f )hs (f ) , −1 (f )h (f ) hH s s (f )R

(3)

where R(f ) is a M × M covariance matrix of the signal-noise mixture, which is conventionally calculated as R(f ) = t y(t, f )yH (t, f ), and hs (f ), in our case of known direction to the speaker, is calculated based on the Time Diﬀerence of Arrival (TDOA) τi between the ﬁrst and the i-th microphone as hs (f ) = −j2πf τ T 2 1, e , . . . , e−j2πf τi , . . . , e−j2πf τM . In our experiments we calculate TDOA directly using known direction to speaker and also validate the measurements by applying two methods for mutual reassurance. The ﬁrst method consists of an exhaustive search for Angles of Arrival (AOA) in spherical coordinates which give cumulative spectral (θ, ϕ), energy maxima in the range of θ, ϕ ∈ − π2 , π2 . We calculate them by applying DSB beamforming to the AOA spherical plane in this given range with a discrete

24

S. Astapov et al.

step Δθ,ϕ and calculate the total spectral energy along all frequency bins. For the second method we calculate the MUltiple SIgnal Classiﬁcation (MUSIC) pseudospectrum in the same AOA range. (AOA estimation and speaker tracking can alternatively be performed using audio-visual methods developed in the Speech Technology Center [11].) 2.3

Observation Vector Clustering with CGMM

Araki et al. [1] attempt to solve the speaker separation problem in a meeting scenario by clustering the signal mixture observation vectors. We revise their EM algorithm according to our task and the deﬁnition of the signal mixture (1). Assuming that the speech signal s(t, f ) and spatially coherent noise nk (t, f ) 2 follow a Gaussian distribution of zero mean and variance |nk (t, f )| = φtf k : p (nk (t, f ); φtf k ) = N (0, φtf k ) ,

(4)

the observation vectors follow a complex Gaussian mixture model: p (y(t, f ); λ) =

N +1

αf k p (y(t, f )|C(t, f ) = k; λ) ;

(5)

k=1

where αf k

p (y(t, f )|C(t, f ) = k; λ) = Nc (0, φtf k Bf k ) , (6) N +1 ˆ k (f )h ˆ H (f ) is is a mixture weight αf k = 1 , and Bf k = h k k

a M × M spatial correlation matrix of noise source k. The value C(t, f ) = k, k = 1, . . . , N , corresponds to the coherent noise classes and C(t, f ) = N + 1 corresponds to the speech signal class. The log likelihood function is deﬁned as log p (y(t, f ); λ) = log αf k Nc (0, φtf k Bf k ) , (7) L(λ) = t

t

f

f

k

where λ = {λk } = {{αf k , φtf k , Bf k }} is the parameter set. The log likelihood function is maximized by using the EM algorithm. The posterior probability is hereafter denoted as Mk (t, f ) = p (C(t, f ) = k|y(t, f ), λ); the posterior for the speech signal is denoted as MN +1 (t, f ). E-step: Calculate the posterior: αf k p (y(t, f )|λk ) Mk (t, f ) = p (C(t, f ) = k|y(t, f ), λ) = . k αf k p (y(t, f )|λk ) M-step: Calculate the parameters λ as: 1 H φf tk = y (t, f )B−1 f k y(t, f ), M T Mk (t,f ) H t φtf k y(t, f )y (t, f ) Bf k = , T t Mk (t, f ) αf k =

T 1 Mk (t, f ). T t

(8)

(9) (10) (11)

Far Field Speech Enchancement Based on Spectral Masking and MVDR

2.4

25

Comparative Methods

We compare the proposed algorithm to four other methods, namely, DSB with adaptive frequency compensation, DSB with adaptive spectral subtraction, DSB with the Stolbov ﬁlter [13], and conventional MVDR (referred to as MVDRMIX). DSB with adaptive frequency compensation ﬁrst performs beamforming in the estimated directions and then executes exponential smoothing [2] over the estimated power of the speech signal sˆ(t, f ) and the residual noise n ˆ (t, f ): s˜(t, f ) = sˆ(t, f ) −

s(t, f )ˆ nH (t, f ) (1 − α)ˆ s(t − 1, f )ˆ nH (t − 1, f ) + αˆ 2

(1 − α) |ˆ s(t − 1, f )| + α |ˆ s(t, f )|

2

n ˆ (t, f ), (12)

where s˜(t, f ) is the enhanced speech signal. In Sect. 4 this method is addressed as DSB-compensate. The second method ﬁrst performs DSB and afterwards applies adaptive spectral subtraction [14] to the estimated power of the speech signal sˆ(t, f ) and the residual noise n ˆ (t, f ):

2 |ˆ n(t, f )| 2 2 (13) |ˆ s(t, f )| . |˜ s(t, f )| = max 0, 1 − 2 |ˆ s(t, f )| This method is hereafter denoted as DSB-spect subt. The Stolbov ﬁlter is integrated into the DS beamformer, where each channel is independently processed using an adaptive noise suppressor prior to channel summing [13]. It is shown to provide good noise suppression in presence of wideband noise. In Sect. 4 it is referred to as DSB-Stolbov.

3

Proposed Approach to Spectral Masking

Our proposed approach is aimed at canceling spatially coherent noise and also ﬁltering diﬀuse noise by applying a spectral mask to the MA channels prior to calculating the MVDR beamforming coeﬃcients. It is based on the procedure discussed in Sect. 2.3, however, due to the problems situated with inverting the CGMM spatial correlation matrices, this procedure is also adjusted. 3.1

Spectral Mask Application

The spectral mask is obtained by running the CGMM EM algorithm described by (8)–(11). The posteriors Mk (t, f ) are used as spectral-temporal coeﬃcients to emphasize the signal prior toMVDR application. N +1 Mk (t, f ) = 1, we estimate the cumulative As the sum of posteriors is k noise mask as N Mk (t, f ) = 1 − MN +1 (t, f ), (14) Mn (t, f ) = k=1

26

S. Astapov et al.

and apply it to the observation vector: ˜ (t, f ) = Mn (t, f )y(t, f ). y

(15)

The masked observation vector is then used to calculate the covariance matrix for ˜ )= y yH (t, f ), after which the speech signal is estimated using (3) as R(f t ˜ (t, f )˜ (2) as sˆ(t, f ) = w ˜ H (f )y(t, f ). This approach is denoted in Sect. 4 as MVDRCGMM. Alternatively we also test a similar approach under the assumption of known speech signal PSD. Our aim is to test, how well the diﬀuse noise is ﬁltered from the mixture by the MVDR-CGMM approach. Here we assume that the CGMM spatial correlation matrix corresponding to speech Bf,N +1 and signal variance φt,f,N +1 are a priori known: 2

φt,f,N +1 Bf,N +1 := |s(t, f )| hs (f )hH s (f ).

(16)

In this case the matrix (16) is ﬁxed, i.e., Eqs. (9) and (10) are not applied on the M-step for k = N + 1. For k = 1, . . . , N the EM algorithm is executed in a normal fashion. This approach is denoted as MVDR-CGMM-S in Sect. 4. 3.2

Avoiding Singularity in CGMM EM

Spatial correlation matrix singularity is highly probable under the assumption of unknown noise parameters and random noise source location. In such a case it is not guaranteed that the matrix will be full rank or Hermitian positive deﬁnite. We perform several adjustments over the CGMM EM algorithm to minimize the risk of converging to singular spatial correlation matrices. First, we scale the multivariate Gaussian probability density function to the natural logarithm. The density function of a complex n-variate Gaussian Z ∼ Nc (μ, Γ) is deﬁned as H

−1

e−(z−μ) Γ f (z; μ, Γ) = |πΓ|

(z−μ)

,

z ∈ Cn ,

(17)

where z is a complex vector, μ is the vector of mean values, and Γ is the complex variance. Substituting these arguments with ours and taking the natural logarithm yields:

H −1 −1 e−y φ B y ln f (y; 0, φB) = ln |πφB| 1 = − yH B−1 y − M (ln π + ln φ) − ln |B| . φ

(18)

Second, we perform spatial matrix regularization during the M-step if its −1 M |b | is below some set value. inverse condition number κ−1 (B) = max ij j=1 If this is the case, a small increment is iteratively added to the main diagonal: B ← B + I, until the condition number check is satisﬁed.

Far Field Speech Enchancement Based on Spectral Masking and MVDR

4

27

Experimental Results

The test signals for our experiments were acquired in a meeting room with a large table in the middle. Room parameters are as follows: dimensions L × W × H = 6 × 6 × 3.5 m, reverberation time T60 = 0.6 s. For signal acquisition we apply a rectangular MA, consisting of 6 rows of microphones, 11 in each row. The horizontal distance between successive microphones is equal to 35 mm, and the vertical distance – 50 mm. The microphone array is placed on the table at approximately 1.5 m from the wall facing the middle of the room. The speaker is standing facing the array 4 m away from it at AOA of (θ, ϕ)s = (7.154◦ , 7.395◦ ); the loudspeaker reproducing diﬀerent types of noise is placed facing the array 4 m away from it at AOA of (θ, ϕ)n = (−16.072◦ , −0.163◦ ). Table 1. Signal mixtures under test and their parameters. Mixture name

Speaker Noise

Fs , No. bits

speech+music1

male1

Solarstone - Solarcaster

16 kS/s, 16 bits

speech+music2

male1

Rammstein - Du Hast

16 kS/s, 16 bits

speech+babble

male2

Noisy crowded cafeteria, speech

16 kS/s, 16 bits

speech+white n female

White noise in band [150, 6000] Hz 16 kS/s, 16 bits

To guarantee accurate SNR readings on the mixture, the speech and noise segments are acquired separately. For each type of noise we sum these speech and noise signals in the frequency domain, while also tuning their gains accordingly to produce the mixtures at speciﬁc SNR. To obtain the baseline SDR we then proceed with the following: 1. Apply DSB to the separate speech signal in the direction of the speaker. Obtain the clean speech signal (S). 2. Apply DSB to the mixture in the direction of the speaker. Obtain the enhanced speech signal on the mixture (MIX). 3. Calculate the baseline SDR as SDR(MIX, S). Afterwards all the presented speech enhancement methods are validated using the SDR improvement metric. Each method is applied to the signal mixture and a speech signal estimate Sest is obtained. SDR improvement is then calculated as SDRimp = SDR(Sest , S) − SDR(MIX, S). The signals under test and their components are presented in Table 1. These four combinations are mixed at diﬀerent low SNR and put through the speech enhancement algorithms discussed in Sects. 2 and 3. STFT parameters for all tests remain the same and are as follows: window length 512 samples, Hann windowing function, overlap 256 samples. The results of SDR improvement are presented in Table 2. All three DSB variations fail to produce noteworthy SDR improvements over conventional DSB; DSB-compensate performs surprisingly

28

S. Astapov et al. Table 2. Results of SDR improvement in dB for all mixtures under test.

Speech + ...mix at specific SNR

DSBcompensate

DSB-spect subt

DSBStolbov

MVDRMIX

MVDRCGMM

MVDRCGMM-S

music1, –5 dB

–0.332

0.280

music1, –10 dB

–0.387

0.149

–1.738

–5.985

–1.150

0.113

–1.459

–1.511

0.642

music1, –15 dB

–0.420

0.140

–0.091

–1.480

2.513

3.009

music1, –20 dB

3.654

–0.473

–0.366

–1.551

5.696

5.772

6.649

music2, –5 dB

2.057

–0.304

–0.427

–4.505

–1.493

–1.270

music2, –10 dB

2.069

–0.596

–0.240

–0.206

0.230

2.250

music2, –15 dB

2.039

–1.050

–0.259

3.798

2.472

5.591

music2, –20 dB

1.968

–1.613

–0.197

6.941

5.287

8.493

babble, 5 dB

–5.506

–0.254

–3.322

–9.298

–4.635

–0.149

babble, 0 dB

–3.126

–0.046

–0.266

–5.697

–1.898

–0.085

babble, –5 dB

–2.008

–0.078

1.264

–2.581

0.058

–0.065

babble, –10 dB

–1.615

–0.201

1.496

–0.041

1.485

–0.037

white n, 5 dB

–8.953

–1.010

–11.330

18.366

30.578

34.127

white n, 0 dB

–5.813

0.797

–6.599

24.363

31.725

33.582

white n, –5 dB

–4.147

1.278

–2.683

29.244

33.487

34.792

white n, –10 dB

–3.514

1.165

–0.130

33.136

33.884

34.398

white n, –15 dB

−3.402

0.982

0.820

35.297

35.477

34.167

well in the music2 case, and DSB-Stolbov performs best in the babble noise case, which conforms with the results presented in [13]. Conventional MVDRMIX performs signiﬁcantly better, especially in lower SNR, which is expected, as given the speech source steering vector, very little room is left for estimation error. However, our MVDR-CGMM outperforms conventional MVDR, giving an improvement of no less than 1–1.5 dB in the majority of cases. This indicates the applicability of the proposed approach in low SNR conditions. MVDR-CGMM-S performs better than MVDR-CGMM at lower SNR mixtures. This may indicate insuﬃcient ﬁltering of diﬀuse noise and has to be investigated further. Handling stationary white noise does not pose a problem for any of the MVDR variations, however, our method performs better at higher SNR than the conventional MVDR. An example of speech enhancement by MVDR-CGMM is presented in Fig. 1. Here the speech signal enters the mixture at the 10th second. It can be seen that the music noise dominates almost the entire band of frequencies, however, this noise is eﬃciently attenuated above 2 kHz. The rhythmic music pattern remains evident only below 2 kHz, and the utterances become distinguishable even at such low SNR. Application of the spectral mask to the speech+music1 mixture is portrayed in Fig. 2. The mask Mn (t, f ) ∈ (0, 1) is presented in a blue-to-yellow color scheme. It clearly indicates the frequency bins belonging to speech, additionally emphasizing the speech signal in noise.

Far Field Speech Enchancement Based on Spectral Masking and MVDR

Fig. 1. Example of speech enhancement by MVDR-CGMM at SNR of –20 dB.

Fig. 2. Example of spectral mask application.

29

30

5

S. Astapov et al.

Conclusion

In this paper we have discussed the possibility of applying spectral masks for improved speech enhancement in low SNR conditions during remote speech acquisition. The established approach of spectral mask estimation has been proven to be applicable to speech enhancement, improving on the SDR results of the conventional MVDR and several other methods based on DSB. It has provided noticeable SDR improvement speciﬁcally at lower SNR conditions in presence of nonstationary noise. Research will be continued in the direction of meeting criteria for automatic speech recognition. Acknowledgments. This research was ﬁnancially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.575.21.0132 (IDRFMEFI57517X0132).

References 1. Araki, S., Okada, M., Higuchi, T., Ogawa, A., Nakatani, T.: Spatial correlation model based observation vector clustering and MVDR beamforming for meeting recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp. 385–389, March 2016 2. Benesty, J., Chen, J., Huang, Y., Cohen, I.: Noise Reduction in Speech Processing. Springer Topics in Signal Processing, vol. 2. Springer, Heidelberg (2010). https:// doi.org/10.1007/978-3-642-00296-0 3. Brandstein, M., Ward, D.: Microphone Arrays: Signal Processing Techniques and Applications. Digital Signal Processing, Heidelberg (2010). https://doi.org/10. 1007/978-3-662-04619-7 4. Cauchi, B., et al.: Combination of MVDR beamforming and single-channel spectral processing for enhancing noisy and reverberant speech. EURASIP J. Adv. Signal Process. 61 (2015) 5. Erdogan, H., Hershey, J.R., Watanabe, S., Mandel, M.I., Le Roux, J.: Improved MVDR beamforming using single-channel mask prediction networks. In: Proceedings of Interspeech Conference, INTERSPEECH, pp. 1981–1985 (2016) 6. Habets, E.A.P., Benesty, J.: A two-stage beamforming approach for noise reduction and dereverberation. IEEE Trans. Audio Speech Lang. Process. 21(5), 945–958 (2013) 7. Higuchi, T., Ito, N., Araki, S., Yoshioka, T., Delcroix, M., Nakatani, T.: Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR. IEEE/ACM Trans. Audio Speech Lang. Process. 25(4), 780–793 (2017) 8. Hong, L., Rosca, J., Balan, R.: Independent component analysis based single channel speech enhancement. In: Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology, Darmstadt, pp. 522–525, December 2003 9. Jaureguiberry, X., Vincent, E., Richard, G.: Fusion methods for speech enhancement and audio source separation. IEEE Trans. Audio Speech Lang. Process. 24(7), 1266–1279 (2016)

Far Field Speech Enchancement Based on Spectral Masking and MVDR

31

10. Korenevsky, M.L., Matveev, Y.N., Yakovlev, A.V.: Investigation and development of methods for improving robustness of automatic speech recognition algorithms in complex acoustic environments. In: Anisimov, K.V., et al. (eds.) Proceedings of the Scientiﬁc-Practical Conference “Research and Development - 2016”, pp. 11–20. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-62870-7 2 11. Oleinik, A.: A lightweight face tracking system for video surveillance. In: Campilho, A., Karray, F. (eds.) ICIAR 2016. LNCS, vol. 9730. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41501-7 46 12. Prudnikov, A., Korenevsky, M., Aleinik, S.: Adaptive beamforming and adaptive training of DNN acoustic models for enhanced multichannel noisy speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, pp. 401–408, December 2015 13. Stolbov, M., Aleinik, S.: Speech enhancement with microphone array using frequency-domain alignment technique. In: Proceedings of the Audio Engineering Society 54th International Conference, Audio Forensics, London, pp. 1–6, June 2014 14. Upadhyay, N., Karmakar, A.: Speech enhancement using spectral subtraction-type algorithms: a comparison and simulation study. Procedia Comput. Sci. 54, 574–584 (2015) 15. Zhao, Y., Jensen, J.R., Christensen, M.G., Doclo, S., Chen, J.: Experimental study of robust beamforming techniques for acoustic applications. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 86–90, October 2017

Exploring End-to-End Techniques for Low-Resource Speech Recognition Vladimir Bataev1(B) , Maxim Korenevsky2,3 , Ivan Medennikov2,3 , and Alexander Zatvornitskiy1,2,3 1

Speech Technology Center Ltd., St. Petersburg, Russia {bataev,korenevsky,medennikov,zatvornitskiy}@speechpro.com 2 STC-Innovations Ltd., St. Petersburg, Russia 3 ITMO University, St. Petersburg, Russia

Abstract. In this work we present simple grapheme-based system for low-resource speech recognition using Babel data for Turkish spontaneous speech (80 h). We have investigated diﬀerent neural network architectures performance, including fully-convolutional, recurrent and ResNet with GRU. Diﬀerent features and normalization techniques are compared as well. We also proposed CTC-loss modiﬁcation using segmentation during training, which leads to improvement while decoding with small beam size. Our best model achieved word error rate of 45.8%, which is the best reported result for end-to-end systems using in-domain data for this task, according to our knowledge. Keywords: Low-resource speech recognition End-to-end speech recognition · Connectionist Temporal Classiﬁcation

1

Introduction

Although development of the ﬁrst speech recognition systems began half a century ago, there has been a signiﬁcant increase of the accuracy of ASR systems and number of their applications for the recent ten years, even for low-resource languages [17,20]. This is mainly due to widespread applying of deep learning and very eﬀective performance of neural networks in hybrid recognition systems (DNN-HMM). However, for last few years there has been a trend to change traditional ASR training paradigm. End-to-end training systems gradually displace complex multistage learning process (including training of GMMs [9], clustering of allophones states, aligning of speech to clustered senones, training neural networks with cross-entropy loss, followed by retraining with sequence-discriminative criterion). The new approach implies training the system in one global step, working only with acoustic data and reference texts, and signiﬁcantly simpliﬁes or even completely excludes in some cases the decoding process. It also avoids the problem of out-of-vocabulary words (OOV), because end-to-end system, trained with c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 32–41, 2018. https://doi.org/10.1007/978-3-319-99579-3_4

Exploring End-to-End Techniques for Low-Resource Speech Recognition

33

parts of the words as targets, can construct new words itself using graphemes or subword units, while traditional DNN-HMM systems are limited with language model vocabulary. The whole variety of end-to-end systems can be divided into 3 main categories: Connectionist Temporal Classiﬁcation (CTC) [14]; Sequence-to-sequence models with attention mechanism [8]; RNN-Transducers [13]. Connectionist Temporal Classiﬁcation (CTC) approach uses loss functions that utilize all possible alignments between reference text and audio data. Targets for CTC-based system can be phonemes, graphemes, syllables and other subword units and even whole words. However, a lot more data is usually required to train such systems well, compared to traditional hybrid systems. Sequence-to-Sequence Models are used to map entire input sequences to output sequences without any assumptions about their alignment. The most popular architecture for sequence-to-sequence models is encoder-decoder model with attention. Encoder and decoder are usually constructed using recurrent neural networks, basic attention mechanism calculates energy weights that emphasize importance of encoder vectors for decoding on this step, and then sums all these vectors with energy weights. Encoder-decoder models with attention mechanism show results close to traditional DNN-HMM systems and in some cases surpass them, but for a number of reasons their usage is still rather limited. First of all, this is related to the fact, that such systems show best results when the duration of real utterances is close to the duration of utterances from training data. However, when the duration diﬀerence increases, the performance degrades signiﬁcantly [8, Fig. 4 “Utterance Length vs. Error”]. Moreover, the entire utterance must be preprocessed by encoder before start of decoder’s work. This is the reason, why it is hard to apply the approach to recognize long recordings or streaming audio. Segmenting long recordings into shorter utterances solves the duration issue, but leads to a context break, and eventually negatively aﬀects recognition accuracy. Secondly, the computational complexity of encoder-decoder models is high because of recurrent networks usage, so these models are rather slow and hard to parallelize. The idea of RNN-Transducer is an extension of CTC and provides the ability to model inner dependencies separately and jointly between elements of both input (audio frames) and output (phonemes and other subword units) sequences. Despite of mathematical elegance, such systems are very complicated and hard to implement, so they are still rarely used, although several impressive results were obtained using this technique. CTC-based approach is easier to implement, better scaled and has many “degrees of freedom”, which allows to signiﬁcantly improve baseline systems and achieve results close to state-of-the-art. Moreover, CTC-based systems are well compatible with traditional WFST-decoders and can be easily integrated with conventional ASR systems. Besides, as already mentioned, CTC-systems are rather sensitive to the amount of training data, so it is very relevant to study how to build eﬀective CTC-based recognition system using a small amount of training samples.

34

V. Bataev et al.

It is especially actual for low-resource languages, where we have only a few dozen hours of speech. Building ASR system for low-resource languages is one of the aims of international Babel program, funded by the Intelligence Advanced Research Projects Activity (IARPA). Within the program extensive research was carried out, resulting in creation of a number of modern ASR systems for low-resource languages. Recently, end-to-end approaches were applied to this task, showing expectedly worse results than traditional systems, although the diﬀerence is rather small. In this paper we explore a number of ways to improve end-to-end CTC-based systems in low-resource scenarios using the Turkish language dataset from the IARPA Babel collection. In the next section we describe in more details diﬀerent versions of CTC-systems and their application for low-resource speech recognition. Section 3 describes the experiments and their results. Section 4 summarizes the results and discusses possible ways for further work.

2

Related Work

Development of CTC-based systems originates from the paper [14] where CTC loss was introduced. This loss is a total probability of labels sequence given observation sequence, which takes into account all possible alignments induced by a given words sequence. Although a number of possible alignments increases exponentially with sequences lengths, there is an eﬃcient algorithm to compute CTC loss based on dynamic programming principle (known as Forward-Backward algorithm). This algorithm operates with posterior probabilities of any output sequence element observation given the time frame and CTC loss is diﬀerentiable with respect to these probabilities. Therefore, if an acoustic model is based on the neural network which estimates these posteriors, its training may be performed with a conventional error back-propagation gradient descent [24]. Training of ASR system based on such a model does not require an explicit alignment of input utterance to the elements of output sequence and thus may be performed in end-to-end fashion. It is also important that CTC loss accumulates the information about the whole output sequence, and hence its optimization is in some sense an alternative to the traditional ﬁne-tuning of neural network acoustic models by means of sequencediscriminative criteria such as sMBR [18] etc. The implementation of CTC is conventionally based on RNN/LSTM networks, including bidirectional ones as acoustic models, since they are known to model long context eﬀectively. The important component of CTC is a special “blank” symbol which ﬁlls in gaps between meaningful elements of output sequence to equalize its length to the number of frames in the input sequence. It corresponds to a separate output neuron, and blank symbols are deleted from the recognized sequence to obtain the ﬁnal result. In [10] a modiﬁcation of CTC loss was proposed, referred as Auto SeGmentation criterion (ASG loss), which does not use blank symbols. Instead of using “blank”, a simple transition probability model for an output

Exploring End-to-End Techniques for Low-Resource Speech Recognition

35

symbols is introduced. This leads to a signiﬁcant simpliﬁcation and speedup of computations. Moreover, the improved recognition results compared to basic CTC loss were obtained. DeepSpeech [15] developed by Baidu Inc. was one of the ﬁrst systems that demonstrated an eﬀectiveness of CTC-based speech recognition in LVCSR tasks. Being trained on 2300 h of English Conversational Telephone Speech data, it demonstrated state-of-the-art results on Hub5’00 evaluation set. Research in this direction continued and resulted in DeepSpeech2 architecture [7], composed of both convolutional and recurrent layers. This system demonstrates improved accuracy of recognition of both English and Mandarin speech. Another successful example of applying CTC to LVCSR tasks is EESEN system [22]. It integrates an RNN-based model trained with CTC criterion to the conventional WFSTbased decoder from the Kaldi toolkit [23]. The paper [21] shows that end-toend systems may be successfully built from convolutional layers only instead of recurrent ones. It was demonstrated that using Gated Convolutional Units (GLU-CNNs) and training with ASG-loss leads to the state-of-the-art results on the LibriSpeech database (960 h of training data). Recently, a new modiﬁcation of DeepSpeech2 architecture was proposed in [25]. Several lower convolutional layers were replaced with a deep residual network with depth-wise separable convolutions. This modiﬁcation along with using strong regularization and data augmentation techniques leads to the results close to DeepSpeech2 in spite of signiﬁcantly lower amount of data used for training. Indeed, one of the models was trained with only 80 h of speech data (which were augmented with noisy and speed-perturbed versions of original data). These results suggest that CTC can be successfully applied for the training of ASR systems for low-resource languages, in particular, for those included in Babel research program (the amount of training data for them is normally 40 to 80 h of speech). Currently, Babel corpus contains data for more than 20 languages, and for most of them quite good traditional ASR system were built [6,12,16]. In order to improve speech recognition accuracy for a given language, data from other languages is widely used as well. It can be used to train multilingual system via multitask learning or to obtain high-level multilingual representations, usually bottleneck features, extracted from a pre-trained multilingual network. One of the ﬁrst attempts to build ASR system for low-resource BABEL languages using CTC-based end-to-end training was made recently [11]. Despite the obtained results are somewhat worse compared to the state-of-the-art traditional systems, they still demonstrate that CTC-based approach is viable for building low-resource ASR systems. The aim of our work is to investigate some ways to improve the obtained results.

36

3 3.1

V. Bataev et al.

Experiments Basic Setup

For all experiments we used conversational speech from IARPA Babel Turkish Language Pack (LDC2016S10). This corpus contains about 80 h of transcribed speech for training and 10 h for development. The dataset is rather small compared to widely used benchmarks for conversational speech: English Switchboard corpus (300 h, LDC97S62) and Fisher dataset (2000 h, LDC2004S13 and LDC2005S13).

Fig. 1. Architectures

As targets we use 32 symbols: 29 lowercase characters of Turkish alphabet [5], apostrophe, space and special blank character that means “no output”. Thus we do not use any prior linguistic knowledge and also avoid OOV problem as the system can construct new words directly. All models are trained with CTC-loss. Input features are 40 mel-scaled log ﬁlterbank energies (FBanks) computed every 10 ms with 25 ms window, concatenated with deltas and delta-deltas (120 features in vector). We also tried to use spectrogram and experimented with diﬀerent normalization techniques.

Exploring End-to-End Techniques for Low-Resource Speech Recognition

37

For decoding we used character-based beam search [1] with 3-g language model build with SRILM package [4] ﬁnding sequence of characters c that maximizes the following objective [15]: Q(c) = log P (c|x) + α log Plm (c) + βwordcount(c), where α is language model weight and β is word insertion penalty. For all experiments we used α = 0.8, β = 1, and performed decoding with beam width equal to 100 and 2000, which is not very large compared to 7000 and more active hypotheses used in traditional WFST decoders (e.g. many Kaldi recipes do decoding with max active = 7000). To compare with other published results [2,11] we used Sclite [3] scoring package to measure results of decoding with beam width 2000, that takes into account incomplete words and spoken noise in reference texts and doesn’t penalize model if it incorrectly recognize these pieces. Also we report WER (word error rate) for simple argmax decoder (taking labels with maximum output on each time step and than applying CTC decoding rule collapse repeated labels and remove “blanks”). 3.2

Experiments with Architecture

We tried to explore the behavior of diﬀerent neural network architectures in case when rather small data is available. We used multi-layer bidirectional LSTM networks, tried fully-convolutional architecture similar to Wav2Letter [10] and explored DeepSpeech-like architecture developed by Salesforce (DS-SF) [25] (Fig. 1). The convolutional model consists of 11 convolutional layers with batch normalization after each layer. The DeepSpeech-like architecture consists of 5-layers residual network with depth-wise separable convolutions followed by 4-layer bidirectional Gated Recurrent Unit (GRU) as described in [25]. Our baseline bidirectional LSTM is 6-layers network with 320 hidden units per direction as in [11]. Also we tried to use bLSTM to label every second frame (20 ms) concatenating every ﬁrst output from ﬁrst layer with second and taking this as input for second model layer. The performance of our baseline models is shown in Table 1. 3.3

Loss Modiﬁcation: Segmenting During Training

It is known that CTC-loss is very unstable for long utterances [14], and smaller utterances are more useful for this task. Some techniques were developed to help model converge faster, e.g. sortagrad [7] (using shorter segments at the beginning of training). To compute CTC-loss we use all possible alignments between audio features and reference text, but only some of the alignments make sense. Traditional DNN-HMM systems also use iterative training with ﬁnding best alignment and then training neural network to approximate this alignment. Therefore, we propose the following algorithm to use segmentation during training:

38

V. Bataev et al. Table 1. Baseline models trained with CTC-loss Model

Step

Dropout

Argmax LM decoding

Sclite

Wav2Letter

20 ms –

88.4

78.3

71.5

67.5

6-layer bLSTM 10 ms –

69.9

61.1

56.3

51.7

6-layer bLSTM 20 ms –

69.0

59.6

55.7

51.1

DS-SF

20 ms No

72.7

64.1

57.7

53.3

DS-SF

20 ms Between each layer 71.8

59.41

55.7

50.8

DS-SF

20 ms Between modules

58.9

54.5

49.7

beam 100 beam 2000

68.6

– compute CTC-alignment (ﬁnd the sequence of targets with minimal loss that can be mapped to real targets by collapsing repeated characters and removing blanks) – perform greedy decoding (argmax on each step) – ﬁnd “well-recognized” words with length ≥ T (T is a hyperparameter): segment should start and end with space; word is “well-recognized” when argmax decoding is equal to computed alignment – if the word is “well-recognized”, divide the utterance into 5 segments: left segment before space, left space, the word, right space and right segment – compute CTC-loss for all this segments separately and do back-propagation as usual The results of training with this criterion are shown in Table 2. The proposed criterion doesn’t lead to consistent improvement while decoding with large beam width (2000), but shows signiﬁcant improvement when decoding with smaller beam (100). We plan to further explore utilizing alignment information during training. Table 2. Models trained with CTC and proposed CTC modiﬁcation

3.4

Model

Segmentation Argmax LM decoding Sclite beam 100 beam 2000

DS-SF

–

68.6

58.9

54.5

49.7

DS-SF

+

66.7

54.9

53.9

48.7

bLSTM -

69.0

59.6

55.7

51.1

bLSTM +

70.3

58.3

56.4

51.4

Using Diﬀerent Features

We explored diﬀerent normalization techniques. FBanks with cepstral mean normalization (CMN) perform better than raw FBanks. We found using variance with mean normalization (CMVN) unnecessary for the task. Using deltas and delta-deltas improves model, so we used them in other experiments. Models trained with spectrogram features converge slower and to worse minimum, but the diﬀerence when using CMN is not very big compared to FBanks (Table 3).

Exploring End-to-End Techniques for Low-Resource Speech Recognition

39

Table 3. 6-layers bLSTM trained using diﬀerent features and normalization Features

3.5

Argmax LM decoding Sclite beam 100 beam 2000

FBanks

76.3

66.3

60.6

56.3

FBanks + CNM

73.6

65.4

59.0

54.6

FBanks + CMVN

74.1

64.5

59.4

55.0

FBanks + CMN + deltas

69.0

59.6

55.7

51.1

FBanks + CMVN + deltas 73.8

64.3

59.0

54.5

spectrogram

84.0

74.8

68.1

64.0

spectrogram + CMN

74.2

63.9

59.1

54.4

Varying Model Size and Number of Layers

Experiments with varying number of hidden units of 6-layer bLSTM models are presented in Table 4. Models with 512 and 768 hidden units are worse than with 320, but model with 1024 hidden units is signiﬁcantly better than others. We also observed that model with 6 layers performs better than others. Table 4. Comparison of bLSTM models with diﬀerent number of hidden units. Units Layers Argmax LM decoding Sclite beam 100 beam 2000

3.6

320

6

69.0

59.6

55.7

51.1

512

6

71.4

62.2

57.1

52.5

768

6

69.9

62.3

56.2

51.7

1024

6

67.3

57.0

53.3

48.4

1024

5

70.7

61.9

56.0

51.3

1024

7

69.3

60.6

55.9

51.4

Training the Best Model

To train our best model we chose the best network from our experiments (6layer bLSTM with 1024 hidden units), trained it with Adam optimizer and ﬁnetuned with SGD with momentum using exponential learning rate decay. The best model trained with speed and volume perturbation [19] achieved 45.8% WER (Table 5), which is the best published end-to-end result on Babel Turkish dataset using in-domain data. For comparison, WER of model trained using indomain data in [11] is 53.1%, using 4 additional languages (including English Switchboard dataset) 48.7%. It is also not far from Kaldi DNN-HMM system [2] with 43.8% WER.

40

V. Bataev et al. Table 5. Using data augmentation and ﬁnetuning with SGD

4

Augmentation

Argmax LM decoding Sclite beam 100 beam 2000

speed

63.8

54.0

51.0

46.2

speed + volume 63.5

53.6

50.7

45.8

Conclusions and Future Work

In this paper we explored diﬀerent end-to-end architectures in low-resource ASR task using Babel Turkish dataset. We considered diﬀerent ways to improve performance and proposed promising CTC-loss modiﬁcation that uses segmentation during training. Our ﬁnal system achieved 45.8% WER using in-domain data only, which is the best published result for Turkish end-to-end systems. Our work also shows than well-tuned end-to-end system can achieve results very close to traditional DNN-HMM systems even for low-resource languages. In future work we plan to further investigate diﬀerent loss modiﬁcations (Gram-CTC, ASG) and try to use RNN-Transducers and multi-task learning. Acknowledgements. This work was ﬁnancially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.575.21.0132 (IDRFMEFI 57517X0132).

References 1. CTC Decoder for PyTorch. https://github.com/parlance/ctcdecode 2. Kaldi Recipe Results for Turkish Language. https://github.com/kaldi-asr/kaldi/ blob/master/egs/babel/s5d/results/results.105-turkish-fullLP.oﬃcial.conf.jtrmal1 %40jhu.edu.2015-11-28T144317-0500 3. Sclite Scoring Package. http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/ sclite.htm 4. The SRI Language Modeling Toolkit. http://www.speech.sri.com/projects/srilm/ 5. Turkish Alphabet. https://en.wikipedia.org/wiki/Turkish alphabet 6. Alum¨ ae, T., et al.: The 2016 BBN Georgian telephone speech keyword spotting system. In: Proceedings of ICASSP, pp. 5755–5759 (2017) 7. Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin (2015). arxiv:1512.02595 8. Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings of ICASSP, pp. 4960–4964 (2016) 9. Chernykh, G., Korenevsky, M., Levin, K., Ponomareva, I., Tomashenko, N.: State level control for acoustic model training. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS (LNAI), vol. 8773, pp. 435–442. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11581-8 54 10. Collobert, R., Puhrsch, C., Synnaeve, G.: Wav2letter: an end-to-end ConvNetbased speech recognition system (2016). arxiv:1609.03193

Exploring End-to-End Techniques for Low-Resource Speech Recognition

41

11. Dalmia, S., Sanabria, R., Metze, F., Black, A.W.: Sequence-based multi-lingual low resource speech recognition (2018). arxiv:1802.07420 12. Gales, M.J.F., Knill, K.M., Ragni, A.: Low-resource speech recognition and keyword-spotting. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 3–19. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-66429-3 1 13. Graves, A.: Sequence transduction with recurrent neural networks (2012). arxiv:1211.3711 14. Graves, A., Fern´ andez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classiﬁcation: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of ICML, pp. 369–376 (2006) 15. Hannun, A.Y., et al.: Deep speech: scaling up end-to-end speech recognition (2014). arxiv:1412.5567 16. Khokhlov, Y.Y., et al.: The STC keyword search system for OpenKWS 2016 evaluation. In: Proceedings of INTERSPEECH, pp. 3602–3606 (2017) 17. Khomitsevich, O., Mendelev, V., Tomashenko, N., Rybin, S., Medennikov, I., Kudubayeva, S.: A bilingual Kazakh-Russian system for automatic speech recognition and synthesis. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 25–33. Springer, Cham (2015). https://doi.org/ 10.1007/978-3-319-23132-7 3 18. Kingsbury, B.: Lattice-based optimization of sequence classiﬁcation criteria for neural-network acoustic modeling. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2009, pp. 3761–3764 (2009) 19. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Proceedings of INTERSPEECH (2015) 20. Levin, K., et al.: Automated closed captioning for Russian live broadcasting. In: Proceedings of INTERSPEECH, pp. 1438–1442 (2014) 21. Liptchinsky, V., Synnaeve, G., Collobert, R.: Letter-based speech recognition with Gated ConvNets (2017). arxiv:1712.09444 22. Miao, Y., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: Proceedings of ASRU, pp. 167– 174 (2015) 23. Povey, D., et al.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011) 24. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Neurocomputing: foundations of research. In: Learning Representations by Back-Propagating Errors, pp. 696–699. MIT Press, Cambridge (1988) 25. Zhou, Y., Xiong, C., Socher, R.: Improved regularization techniques for end-to-end speech recognition (2017). arxiv:1712.07108

Towards a Description of Pragmatic Markers in Russian Everyday Speech ( ) Natalia Bogdanova-Beglarian , Tatiana Sherstinova ✉ , Olga Blinova , Gregory Martynenko , and Ekaterina Baeva

Saint Petersburg State University, 7/9 Universitetskaya nab., St. Petersburg 199034, Russia {n.bogdanova,t.sherstinova,o.blinova, g.martynenko,e.baeva}@spbu.ru

Abstract. By “pragmatic markers” we mean discourse units (words and multi‐ word expressions) with a weakened referential meaning, which perform a variety of pragmatic or procedural speaker’s tasks. The paper aims at the development of approaches to systematize and describe the inventory of pragmatic markers in a wide scale way, based on comprehensive corpus data. The theoretical and methodological basis for the corpus study of pragmatic markers in Russian is introduced. The provisional version of pragmatic markers classiﬁcation is proposed. The description of pragmatic markers will be carried out on the material of representative corpora of Russian dialogic and monologic speech. Keywords: Modern Russian · Everyday speech · Pragmatic markers Spoken dialogue · Spoken monologue · Speech corpus

1

Introduction

In contrast to written discourse, spoken speech has its own rules and therefore requires special research methods and approaches. The functional units forming spoken discourse are traditionally divided into the following groups: 1. Basic units, which represent major speech content (e. g., lexical units fully possessing lexical and grammatical meaning). 2. “Auxiliary” units, which help speakers to build up and structure spoken discourse (e. g., various kinds of pragmatic markers, parentheses and functional words). As a rule, these units regularly occur in everyday speech by all Russian speakers, they can be used repeatedly, and therefore, they tend to be of high frequency. 3. Sound “artifacts” (vocalizations, breaks, and all kinds of speech-accompanying events, such as laugh, cough etc.). The paper focuses on pragmatic markers belonging to the class of “auxiliary” speech items. These markers mostly have merely pragmatic functions and are characterized by almost complete absence (or signiﬁcant weakening) of lexical and/or grammatical meaning. It should be noted that pragmatic markers are characterized by extremely high frequency, exceeding that of almost all content words in spoken discourse.

© Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 42–48, 2018. https://doi.org/10.1007/978-3-319-99579-3_5

Towards a Description of Pragmatic Markers

43

Pragmatic markers are functionally important for speech production and overcoming inevitable speech diﬃculties. They can be used for the following purposes: to mark utterance beginnings or endings, to attract interlocutor’s attention, to react to a mistake, to search for a word or other ways to continue the dialogue, to introduce someone else’s speech into the narrative, to introduce a new thought that has just occurred, to correct certain fragments of speech, and so on. Despite the fact that pragmatic markers are an integral part of everyday oral discourse, they still do not possess a satisfactory and comprehensive classiﬁcation. This paper aims at the development of approaches to systematize and describe the inventory of pragmatic markers in a wide scale way, based on large corpus data.

2

Theoretical and Methodological Basis for the Study of Pragmatic Markers

By “pragmatic markers” we mean discourse units (words and multiword expressions) with a weakened referential meaning, which perform a variety of pragmatic or proce‐ dural speaker’s tasks (in particular, expressing his/her attitude to the speech content, structuring the discourse, etc.). Several terms have been used to describe pragmatic units in contemporary linguistic studies, e. g. “discourse particles”, “discourse markers”, “modal particles”, “pragmatic markers” etc. Though most of the related publications describe pragmatic units in the English language, it is believed that there are analogues to the “well”, “you know”, “surely”, “I think” etc. in many other languages. It is claimed that pragmatic markers are a very common, if not a universal, feature of spoken speech. Among most prominent studies of pragmatic and discursive markers in the speech of native English speakers there are works by Brinton [1, 2], Aijmer [3], Claridge [4, 5], Frank-Job [6], Haselow [7, 8]. Also, one should mention studies by D. Schiffrin where she analyzed the usage and importance of expressions such as “oh”, “well”, “now”, “then”, “you know”, “I mean” and connectors “so”, “because”, “and”, “but”, “or” in discourse organization [9]. The term “pragmatic marker” was introduced by Fraser [10, 11]. The researcher proposed an extensive classiﬁcation of units deﬁned as “the linguistically encoded clues which signal the speaker’s potential communicative intentions” [11]. According to Fraser, the abovementioned functional class of expressions exists in any language. Although being embedded in sentences, pragmatic markers are “separate and distinct from the propositional content of the sentence”. They signal important aspects of the speaker’s message [10]. Russian linguists have been striving to detect and describe discourse markers in Russian speech material. More often than not, they apply the term “discourse words”. In [12] the following deﬁnition is provided: “There are units that, on one hand, ensure the coherence of the text and, on the other hand, directly reﬂects the process of interaction between the speaker and the listener”. Kiseleva’s and Pajar’s works [13, 14] may be considered fundamental research of discourse words in Russian. The authors claim that discourse words have such features as “desemantization of a word in its discourse usage”, the absence of clear boundaries

44

N. Bogdanova-Beglarian et al.

within the discourse words class, as well as “certain speciﬁc parameters associated with the relations between the “actors” of the discourse” [13]. The authors’ deﬁnition of discourse words is based primarily on functional criteria, the main one of which is related with connection of two (or more) discourse components. In this study, we adhere mostly to the approach by B. Fraser, which aims at studying communication as a complex phenomenon, as it seems to us to be the most relevant. Thus, as a reference point, we use the terms proposed by B. Fraser, in spite of the fact that the term “discourse words” (and more recently “discourse markers”) has been more commonly used in Russian linguistics. Often, pragmatic markers are studied using the methods of corpus linguistics. As for the English language studies, it was applied particularly in [15, 16]. Furthermore, there have been works researching English as a second language [17], Irish English [18], Spanish [19], and many other languages. Two representative speech corpora will be used as resources for spoken speech analysis within the proposed pragmatic approach: the corpus of Russian everyday speech “One Day of Speech” (the ORD corpus, containing mostly dialogic speech) [20] and “Balanced Annotated Text Collection” (SAT, containing monologic speech) [21].

3

The Inventory of Pragmatic Markers

The major part of pragmatic markers is a result of an active pragmaticalization process [22]. During this process some grammatical forms, or individual tokens, shift onto the pragmatic language level and become strictly pragmatical units, which express various speaker’s reactions to the circumstances of spoken communication. Moreover, they may have a form of a separate statement. This process can be accompanied by some usage changes (e. g, unrealized valence, unusual word order, etc.). As a result, the main func‐ tion of such units are supposed to be of pragmatic nature, which may be perceived via analysis of discourse structure and which can be deﬁned as pragmatic meanings of these units. The categories of pragmatic markers proposed by diﬀerent researchers may diﬀer signiﬁcantly. However, they are more or less universal in their functions. This is why it is possible to draw up their comprehensive classiﬁcation, which will be further applied to annotate speech material and to get reliable numeric data about conditions and speciﬁcs of their realization. This makes it possible to describe the system of pragmatic markers of Russian everyday speech in general as an important and an integral part of spoken discourse. 3.1 Provisional Version of Pragmatic Markers Classiﬁcation As a working theory for the typology of pragmatic markers we will use the following classiﬁcation of pragmatic units of spoken speech [23]: • Interjectional pragmatic markers, which diﬀer from the original interjections (which are often the etiquette ones) forms because they acquire either a new semantics, a

Towards a Description of Pragmatic Markers

•

•

• •

• •

• • • • •

45

new pragmatics or prosodic design (“dras’te pozhalsta!”, “shchas-shchas-shchasshchas”, “budet tebe”). Discourse pragmatic markers, which are used for structuring the oral text, include starting, guiding (navigational) and ﬁnal markers (“znachit”, “vot”, “dumaju chto”, “znaesh vot”, “vse”). Search pragmatic markers (“eto samoe”, “kak ego (ejo, ikh)”, “kak eto”, “chto eshche”), which provide the speaker with some extra time to ﬁnd the proper word or expression. Reﬂexives, or pragmatic markers that reﬂect the speaker’s reaction to his/her own words (“ili kak ego?”, “ili kak tam pravil’no skazat’?”, “ili kak tam?”). Approximators are pragmatic markers used for the replacement of some enumeration or its part. For this purpose, various “substitutes” are used, which signs that enumer‐ ation is possible or can be continued. In the ﬁrst case, the markers of complete replacement indicating the result of the “substitution strategy” action are used. In the second case, the markers of partial replacement, referring to the “combining strategy” of the speaker are employed (for more information on two types of these strategies see [24]) (“vse dela”, “vse takoe (prochee)”, “(i) to (i) drugoe”, “to-s’o”, “tudas’uda”, “p’atoe-des’atoe”, “bla-bla-bla”). Xenoindicators, or markers that introduce someone else’s speech into the utterance (“grit”/“gyt”, “ah”, “tipa togo (chto)”, “takoj”, “vot”, etc.). Metacommunicative pragmatic markers are meta-comments to the speech, aimed at establishing a contact with an interlocutor or listener, as well as at speech compre‐ hension by the speaker himself (“znaesh’(-te)”, “ponimaesh’(-te)”, “da”). Deictic pragmatic markers, whose function is related primarily to the discourse unit “vot” (“vot zdes’ vot”, “vot takoe vot”, “vot tak vot”). Rhythm-forming pragmatic markers are used to rhythmize spoken text (“vot”, “tam”, “koroche”, “tak”). Pragmatic markers of self-correction are used to correct an utterance (“eto”, “eto samoe”, “ne”, “ne tak”). Markers of speech “non-triviality” (“tak skazhem”, “kak eto”, “tak nazyvaemyj”). Hesitation markers (“tam”, “eto”, “m-m”, “e-e”).

We consider this classiﬁcation as an exploratory one, on the base of which a general typology of pragmatic markers of everyday Russian speech will be built. We expect that this provisional version is liable to undergo some changes or reﬁnements in the process of its validation on expanded speech material. 3.2 Multilevel Approach to Pragmatic Markers Description Often, pragmatic markers appear to be not just individual lexemes, but rather construc‐ tions that can be used in diﬀerent functions and contexts, for example, “eto samoe”, “kak skazat’”, “(nu) (ty) znaesh”, “vot (etot) vot”, “tuda-s’uda”, “kak ego (ejo, ikh)”, “kak eto”, “(ja) ne znaju”, “(ja) dumaju (chto)”, etc. It is important to note that pragmatic markers (as opposed to their “lexical twins”) may have some phonetic specialties (e. g., reduced or allegro forms, cf. “shchas”,

46

N. Bogdanova-Beglarian et al.

“zdras’te”, “zdra’t’e pozhalsta”, “shchas pr’am”, “grit”, “gr’u”, “gyt”, etc.) or may be regularly used with a speciﬁc intonation form (“shchas-shchas-shchas”, “nu znaesh”, “nu ja ne znaju”, etc.). As for syntactic level, a series of pragmatic constructions with a homogeneous meaning and/or syntactic function may be distinguished. In this regard, we should mention approximators (“vs’akoje takoje”, “vse dela”, “tyry-pyry”, etc.), the series of xenoindicators (“grit”, “tipa”, “tipa togo”) and so on. In other words, pragmatic units may reveal speciﬁc features on phonetical, lexical, and grammatical level, and therefore they should be analyzed in the framework of a multilevel system.

4

Pragmatic Markers in Dialogue and Monologue

The development of theoretical and methodical basis for study of pragmatic markers implies deﬁning principles for pragmatic markers identiﬁcation both in dialogic and monologic speech. The pragmatic markers’ analysis will be conducted on data from two speech corpora, referring to two basic types of spoken discourse—monologue and dialogue. Corpus “One Day of Speech” (“Odin Rechevoj Den”, or the ORD corpus) is the most repre‐ sentative resource of Russian everyday language, contains long-term audio recordings of daily communication. Currently the corpus comprises 1250 h of sound recordings, collected from 128 respondents and more than 1000 of their interlocutors, representing diﬀerent social groups (sociolects) of a modern Russian city, 2800 macroepisodes of communications, and more than 1 mln word usages in transcripts [25]. The corpus SAT (in Russian, “the Balanced Annotated Text Collection”, known as “Textotec”) contains monologic speech, both in sound recordings and transcripts. Speakers’ groups are balanced in terms of their social characteristics. To collect speech material an experimental author’s method was used. The recordings from each speaker for this corpus were made according to a special program, which implies reading and retelling of two texts, describing two images, and a free speech on a speciﬁc topic familiar to the speaker. SAT speech collection contains several modules of homoge‐ neous professional groups: speech by medical professionals (MED); speech by lawyers (JUR); speech by teachers of Russian as a foreign language (RKI), speech of diﬀerent student groups (STUD), speech by IT-professionals (COMP). At the moment, the collection includes data obtained from 153 speakers and comprises 772 monologue texts, having total duration of 30 h [21].

5

Conclusion

The results of the description of pragmatic markers in Russian is to be summarized in the “Dictionary of pragmatic markers of Russian everyday speech”. The dictionary will include diﬀerent types of pragmatic markers, grouped according to their functions, and their detailed description. The structure of the dictionary should cover several lexicographical zones:

Towards a Description of Pragmatic Markers

47

• the semantic zone, which implies interpretation both of the prototype unit and prag‐ matic marker’s components in the existing academic dictionaries, thereby providing a kind of “semantic background”; • the functional area, in which the list of common pragmatic functions for each prag‐ matic marker in everyday spoken speech will be given; • illustrations; • quantitative information of all allocated functions, as well as their correlation with the type of speech (monologue/dialogue, academic/public/everyday speech), the type of communication scenario and speaker’s characteristics. A comprehensive inventory of pragmatic markers of spoken discourse (both mono‐ logic and dialogic) with a detailed description of their functions and features in use, besides theoretical implications for the development of colloquialistics, discursive anal‐ ysis, corpus and ﬁeld linguistics, socio- and psycholinguistics, has undoubted impor‐ tance in applied realms such as language education and translation. Thus, should mate‐ rials of natural everyday conversations be included into course books for foreign students, it will help to prepare them for the real spoken Russian language and avoid a multitude of communicative failures. Besides, these data can be applied to automatic Russian language processing, in speech technologies and forensic linguistic expertise. Acknowledgements. The presented research was supported by the Russian Science Foundation, project #18-18-00242 “Pragmatic Markers in Russian Everyday Speech”.

References 1. Brinton, L.J.: Pragmatic Markers in English: Grammaticalization and Discourse Functions. Mouton de Gruyter, Berlin (1996) 2. Brinton, L.J.: The Comment Clause in English: Syntactic Origins and Pragmatic Development. Cambridge University Press, Cambridge (2008) 3. Aijmer, K.: English Discourse Particles: Evidence from a Corpus. John Benjamins, Amsterdam (2002) 4. Claridge, C.: The evolution of three pragmatic markers: as it were, so to speak/say and if you like. J. Hist. Pragmat. 14(2), 161–184 (2013) 5. Claridge, C.: News Discourse: Historical pragmatics, pp. 587–620 (2010) 6. Job, B.: A dynamic-interactional approach to discourse markers. In: Approaches to Discourse Particles. Studies in Pragmatics, vol. 1, pp. 395–413. Elsevier, Amsterdam (2006) 7. Haselow, A.: Discourse marker and modal particle: the functions of utterance-ﬁnal then in spoken English. J. Pragmat. 43(14), 3603–3623 (2011) 8. Haselow, A.: Arguing for a wide conception of grammar: the case of ﬁnal particles in spoken discourse. Folia linguistica 47(2), 375–424 (2013) 9. Schiﬀrin, D.: Discourse Markers. Cambridge University Press, Cambridge (1987) 10. Fraser, B.: An Approach to Discourse Markers. J. Pragmat. 14, 383–395 (1990) 11. Fraser, B.: Pragmatic Markers. Pragmatics 6(2), 167–190 (1996) 12. Baranov, A.N., Plungjan, V.A., Raxilina, E.V.: Putevoditel’ po diskursivnym slovam russkogo jazyka [Guide to Russian discursive words]. Pomovskij i partnery, Moskva (1993)

48

N. Bogdanova-Beglarian et al.

13. Kiseleva, K.L., Pajar, D.: Diskursivnye slova russkogo yazyka: opyt kontekstnosemanticheskogo opisaniya [Discursive words of Russian: experience of contextual and semantic description]. Metatekst Publ., Moskva (1998) 14. Kiseleva, K.L., Pajar, D. (eds.) Diskursivnye slova russkogo jazyka. Kontekstnoe var’irovanie i semantičeskoe edinstvo [Discourse words of Russian: contextual variation and semantic units]. Azbukovnik, Moskva (2003) 15. Torgersen, E.N., Gabrielatos, C., Hoﬀmann, S., Fox, S.: A corpus-based study of pragmatic markers in London English. Corpus Linguist. Linguistic Theory 7(1), 93–118 (2011) 16. Beeching, K.: Pragmatic Markers in British English: Meaning in Social Interaction. Cambridge University Press, Cambridge (2016) 17. Aijmer, K.: Pragmatic Markers in Spoken Interlanguage. Nordic J. Engl. Stud. 3, 173–190 (2004) 18. Murphy, B.: A corpus-based investigation of pragmatic markers and sociolinguistic variation in Irish English. In: Amador Moreno, C., McCaﬀerty, K. Vaughan, E. (eds.) Pragmatic Markers in Irish English. Pragmatic and Beyond New Series, vol. 258. John Benjamins, Amsterdam (2015) 19. Rivas, J., Brown, E.: No sé as a discourse marker in Spanish: a corpus-based approach to a cross-dialectal comparison. In: Sánchez, A., Cantos, P. (eds.) A Survey on Corpus-Based Research. Panorama de investigaciones basadas en corpus, pp. 631–645. AELINCO, Murcia (2009) 20. Asinovsky, A., et al.: The ORD speech corpus of russian everyday communication “one speaker’s day”: creation principles and annotation. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS (LNAI), vol. 5729, pp. 250–257. Springer, Heidelberg (2009). https://doi.org/ 10.1007/978-3-642-04208-9_36 21. Bogdanova-Beglarian, N.V., Sherstinova, T.J., Zajdes, K.D.K.: Sbalansirovannaja annotirovannaja tekstoteka”: metodika mnogourovnevogo analiza russkoj monologicheskoj rechi [The corpus “Balanced Annotated Text Collection”: Method of Multilevel Analysis of Russian Monological Speech]. In: Kocharov, D.A., Skrelin, P.A. (eds) Analiz razgovornoj rechi (AR3-2017): trudy sed’mogo mezhdisciplinarnogo seminara [Analysis of Spoken Russian Speech (AR3-2017): Proceedings of the 7th Interdisciplinary Seminar], St. Petersburg, pp. 8–13 (2017) 22. Günther, S., Mutz, K.: Grammaticalization vs. Pragmaticalization? the development of pragmatic markers in German and Italian. In: Bisang, W., Himmelmann, N.P., Wiemer, B. (eds.) What Makes Grammaticalization? A Look from Its Fringes and Its Components, pp. 77–107. Mouton de Gruyter, Berlin (2004) 23. Bogdanova-Beglarian, N.V.: Pragmatemy v ustnoj povsednevnoj rechi: opredelenie pon’atia i obshchaja tipologia [Pragmatems in Spoken Everyday Speech: Deﬁnition and General Typology]. Vestnik Permskogo universiteta. Rossijskaja i zarubezhnaja ﬁlologia [Perm University Herald. Russian and Foreign Philology], vol. 3(27), pp. 7–20 (2014) 24. Podlesskaya, V.I.: Nechetkaja nominacija v russkoj razgovornoj rechi: opyt korpusnogo issledovanija [Vague reference in Russian: evidence from spoken corpora]. In: Computational Linguistics and Intellectual Technologies: papers from the Annual International Conference “Dialog”. vol.12(19), pp. 561–573. RSUH, Moscow (2013) 25. Bogdanova-Beglarian, N., et al.: Sociolinguistic extension of the ORD corpus of Russian everyday speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 659–666. Springer, Cham (2016). https://doi.org/ 10.1007/978-3-319-43958-7_80

Adding Personality to Neutral Speech Synthesis Voices Christopher G. Buchanan(B) , Matthew P. Aylett, and David A. Braude CereProc Ltd., CodeBase Floor D, 3 Lady Lawson Street, Edinburgh EH3 9DR, UK {chrisb,matthewa,dave}@cereproc.com http://www.cereproc.com

Abstract. A synthetic voice personiﬁes the system using it. Previous work has shown that using sub-corpora with diﬀerent voice qualities (e.g. tense and lax) can be used to modify the perceived personality of a voice as well as adding expressive and emotional functionality. In this work we explore the use of LPC source/ﬁlter decomposition together with modiﬁcation of the residual to artiﬁcially add voice quality subcorpora to a voice without recording bespoke data. We evaluate this artiﬁcially enhanced voice against a baseline unit selection voice with pre-recorded sub-corpora. Although artiﬁcial modiﬁcation impacts naturalness, it has the advantage of adding emotional range to voices where none was recorded in the source data, deals with data sparsity issues caused by sub-corpora, and results in signiﬁcant eﬀects in terms of perceived emotion. Keywords: Voice modiﬁcation · Glottal signal modelling Glottal vocoding · Speech synthesis · Unit selection Expressive speech synthesis · Emotion · Prosody · Artiﬁcial personality

1

Introduction

Speech technology is rapidly entering the everyday through the large scale commercial impact of systems such as Apple Siri and Amazon Echo. With this comes an increase in the requirements for expressive, emotional voices which can convey a speciﬁc personality. For example, a voice that is giving yoga instructions should diﬀer from one reading out the news. Utterances giving instructions, warning a user they will be late, or enthusiastically supporting a user, should again be expressed diﬀerently to convey neutrality, authority, and enthusiasm. The majority of previous work in expressive speech synthesis has focused on modifying pitch, duration and loudness to alter perceived emotion [1]. On the other hand, voice quality is also an important factor in the perception of emotion [2] and personality [3]. Speech rate and pitch can be modiﬁed relatively easily using digital signal processing techniques such as PSOLA [4], however modifying voice quality is more diﬃcult, particularly if naturalness retention is a priority. For this reason, rather than modifying speech to create the eﬀect, an alternative c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 49–57, 2018. https://doi.org/10.1007/978-3-319-99579-3_6

50

C. G. Buchanan et al.

approach has been to record diﬀerent voice qualities directly. These alternative units can then be used during concatenative synthesis [5] to alter the perceived character and emotion of synthesised speech. Audio examples of commercial emotional synthesis implemented in this way in several languages (including English, French, Spanish and Russian) can be found at https://tinyurl.com/ ycljot2o. This corpora-based approach can also be used as training for parametric systems (such as those employing WaveNet [6]) to model voice quality changes for generating emotional output, but the requirement of extra recorded audio inevitably increases the cost and diﬃculty of data collection. Thus, given examples of tense and lax speech, it would be desirable to generalise the eﬀect across voices using either machine learning or algorithmic approaches. This in turn would make creating expressive voices with the ability to control voice quality easier and more eﬃcient. In this paper we describe CerePulse, a glottal vocoding framework employing Pitch-Synchronous Iterative Adaptive Inverse Filtering (PSIAIF) alongside Liljencrants-Fant (LF) pulse modelling that, when combined, permit us to modify the voicing of a single speaker. The output of this process is then used to artiﬁcially generate sub-corpora for a unit selection speech synthesis system. We evaluate the voice with artiﬁcial corpora against a commercially available unit selection voice with pre-recorded sub-corpora. We conclude by discussing how this approach may be improved in order to retain more naturalness. Although these techniques are presently applied to a unit selection system, they are generalisable across all data driven speech synthesis systems. Furthermore, the results gained by algorithmically inducing these diﬀerences give insight into what appropriate acoustic features and representations might be used in machine learningbased systems.

2 2.1

CerePulse: Voicing Modification via Glottal Signal Modelling Previous Work on Glottal Signal Manipulation

Previous work on glottal signal modiﬁcation as a means of manipulating highlevel voicing characteristics centred around obtaining a quantitative notion of vocal eﬀort via adaptive pre-emphasis ﬁltering [7]; this approach focused on the realistic induction of breathiness in a singing context, with steady-state vowel phonation dominating in presence over more variable voicing states typically found in speech. In [8], a novel Deterministic plus Stochastic Model framework is exploited to manipulate glottal pulses within the deterministic component of speech; this method of decomposition can have often unpredictable eﬀects on output voice quality. Sorin et al. succeed in incorporating a glottal vocoder into IBM’s Watson TTS system. They deﬁned a set of conﬁgurable global voice transformations via modiﬁcations of both the glottal source and vocal tract components of the vocoder independently at synthesis time, resulting in a hybrid unitselection parametric concatenative speech synthesis framework [9]. Whilst IBM’s

Adding Personality to Neutral Speech Synthesis Voices

51

Watson voice transformation system is geared toward dynamic parameterisation at synthesis time, we instead focus on the emotional expansion of an otherwise minimal concatenative speech synthesis corpus. Adopting a PSIAIF-LF analysissynthesis strategy as a means of improving the ﬂexibility and emotional potential of a speech coding system was a decision motivated by the general stability and maturity of LPC vocoder systems, alongside signiﬁcant success and experience with these systems in previous work [9]; however alternative vocoder strategies have been deployed in recent years [10–12]. 2.2

Glottal Signal Extraction and Verification

Pitch-Synchronous Iterative Adaptive Inverse Filtering (PSIAIF) [13] is an algorithmic approach widely used and accepted as a robust, semi-autonomous method of extracting a glottal signal from within voiced regions of recorded speech [14] and Glottal Closure Instants (GCIs). In brief, the routine derives the raw glottal signal using an iterative structure by ﬁrst estimating its contribution to the input speech spectrum, removing averaged glottal contribution, and modelling the vocal tract transfer function. The glottal excitation can then be obtained by cancelling out the eﬀects of the vocal tract using inverse ﬁltering. Additionally the eﬀects of lip radiation can also be removed by way of an integration ﬁlter, however, in the present work we opted for time derivative signal analysis, setting lip radiation aside as it will not aﬀect the results. PSIAIF requires the order of Linear Predictive Coding (LPC) to be speciﬁed during the modelling of both vocal tract and glottal signal estimation; this parameterisation can be exploited as part of an optimisation feedback loop. Once an estimated glottal signal is obtained, we must determine if it is suitable for modiﬁcation by determining the likelihood of it matching real glottal movements. Without the use of invasive and impractical procedures this is impossible to calculate, but one can analyse the speech signal and identify any expected features in the time and frequency domains to see if the estimation ﬁts the ‘good enough’ criterion. As an example, in the frequency domain we may safely assume glottal signal to have approximately ﬂat magnitude response apart from the glottal formant and spectral tilt. These can be attributed to vocal eﬀort, a voicing feature we intend to manipulate. In the time domain there is an expected pulse-like signal within pitch-synchronous time frames (see Fig. 1), albeit with a possible sublevel of Gaussian noise present. This pulse-like quality can be further scrutinized if we quantify its agreement to the glottal pulse estimated by an alternative technique such as the LF model, Rosenberg pulse or reduced-LF formulation [15– 17] which all have closed form solutions. In the present work, based on the positive conclusions of [9] we opted to use the full LF pulse framework, which also has a readily available functional representation in the MATLAB speech processing toolbox VOICEBOX [18]. If raw glottal signal features within a region are deemed not ‘good enough’, then this region of the estimated glottal signal is not suitable for modiﬁcation as is. However, as previously mentioned the order of LPC analysis at both glottal

52

C. G. Buchanan et al.

Fig. 1. Amplitude-normalized arbitrary LF pulse derivative within one pitchsynchronous time frame.

and vocal tract modelling stages of PSIAIF are parameters that must be chosen, and this choice is motivated by a maximisation criterion based on total amount of signal deemed suitable for modiﬁcation. At this point some regions will be marked as not suitable for modiﬁcation, this often occurs in unvoiced or semivoiced speech. 2.3

Glottal Pulse Generator Parameter Optimisation

CerePulse’s glottal pulse generator is based on the VOICEBOX glotlf function. It takes 4 parameters to describe an LF-pulse within an arbitrary pitch-period timeframe: – te , the ratio of the length of open-phase and pitch period – Ee−1 , the reciprocal of minimum amplitude of pulse within frame – tp , the fraction of the open phase between peak ﬂow and the end of the open phase (i.e. where the pulse will have a negative value). – ta , 1 minus the ratio of length of constant ﬂow region and pitch period (i.e. where the pulse is zero valued) These parameters are illustrated in Fig. 1. Using Pearson product-moment correlation coeﬃcients ρ, as per [9], we can obtain an objective measure of similarity between our PSIAIF-generated glottal pulse (see Sect. 2.4) and its pitchequivalent LF-pulse for each timeframe. We can then set and maximise ρ by choosing a set of parameters and LPC order to achieve optimal similarity through an iterative optimisation approach. This will increase the naturalness of the generated glottal pulses, and is done separately for each voice quality.

Adding Personality to Neutral Speech Synthesis Voices

2.4

53

Glottal Signal Modelling and Manipulation

The above procedure results in a set of parameters which deﬁne a set of glottal pulses for the speaker. However, not all regions of speech can be modiﬁed. We ﬁrst establish which frames can be modiﬁed by comparing the neutral glottal pulse to the individual pulse using ρ; if a user-set threshold is not satisﬁed, the frame of glottal signal in question will remain unchanged during synthesis. Additionally, unvoiced frames will remain unchanged as any detected glottal signal in unvoiced regions is spurious. While it would be preferable that the 4 parameters that deﬁne the pulse can be manipulated directly they are not completely independent. Blindly changing them will result in unnatural speech, but they were chosen so that it would be intuitive to understand. In practice, as a starting point we may substitute estimated parameters from the another voicing quality to change neutral speech into that voice quality, for example neutral to stressed. After just substituting the parameters they can then be further adjusted by inspection to improve naturalness. Using this procedure we can signiﬁcantly increase the amount of data we have available for a particular voice quality, whilst ensuring stability of the resulting glottal pulses. It is expected that there may be some degradation over natural speech, largely due to the result of any vocoding and voice modiﬁcation, but also due to the lack of natural aspiration noise in synthesised output. The latter can be mitigated somewhat by including shaped noise, derived from the voice quality. In CerePulse aspiration noise is generated using a heuristically derived IIR high-pass ﬁlter to shape white noise. This is then mixed with the glottal pulses at Glottal Closure Instants (GCIs). The ratio of mixing has been tied to glottal tension [7], and in CerePulse, is also partly determined by inspection to simulate the desired vocal stress level. Figure 2 shows the complete ﬂow diagram for the CerePulse system.

3

Synthesis

For evaluation we used the CereVoice unit-selection speech system and three voice qualities: neutral, tense, and lax. To create a voice for speech synthesis CerePulse was used to convert the neutral glottal signal into tense and lax speech, and then generate an artiﬁcial glottal signal for neutral speech to ensure homogeneous quality audio output. These signals were recombined with the LPC parameters from the neutral speech using a standard LPC vocoder. For the experimental system the artiﬁcial tense and lax samples were pregenerated and treated as sub-corpora, however, the generation procedure is fast enough that it can be integrated into a synthesis system. In the evaluations below the experimental system is referred to as cpulse. As a baseline system we used a normal unit selection voice from CereProc which has tense and lax sub-corpora recorded by the original speaker. However,

54

C. G. Buchanan et al.

Fig. 2. Flow diagram illustrating the sequence of CerePulse processing; x(n) is recorded input speech, y(n) is the output speech after CerePulse modiﬁcation.

in this voice the sub-corpora have far fewer samples than the main neutral voice quality. The baseline unit selection system is referred to as usel below. The CereVoice system allows control of which sub-corpora to use for synthesis using XML tags. While this does not guarantee that units will only be selected from the desired sub-corpora it is heavily biased to doing so. With the new samples created by CerePulse in the cpulse voice the system will always choose the correct sub-corpora as it has all the same units in all three voice qualities. In the usel voice, it may need to fall back on the neutral units if it is lacking in the appropriate phonetic coverage for the given text.

4

Evaluation

The following questions were addressed: 1. Is there a signiﬁcant preference for the naturalness of the unprocessed speech? 2. Is this preference retained for lax and tense stimuli where data sparsity reduces conventional unit selection quality?

Adding Personality to Neutral Speech Synthesis Voices

55

3. How does the artiﬁcial tense and lax data compare with the natural subcorpora in terms of achieving the perception of an emotionally stressed, calm, and neutral voice? In this paper we report results based on 7 subjects. Although this is not a suﬃcient subject pool to produce conclusive results, the signiﬁcance reported is high (based on a by-materials analysis), and is useful too as a pilot evaluation for determining the requirements of future work. 4.1

Naturalness

7 subjects listened to 30 matching stimuli A and B, 15 neutral, 8 biased for lax units, 7 biased for tense units, generated using the usel and cpulse voices, played in random order. They were asked to compare naturalness between the two stimuli on a 5 point scale, selecting strong preference for A, weak preference for A, no preference, weak preference for B and strong preference for B. The results were coded from 2 to −2 with a positive number indicating a preference for stimuli from the usel voice. Results were averaged by subject for each pair of stimuli. A consistent preference for naturalness is shown for the conventional unit selection voice usel (n = 30, mean = 1.03, p < 0.001 in a one sample test comparing the mean to 0). In addition a t-test was carried out to see if the preference was maintained comparing neutral stimuli to tense/lax stimuli. No signiﬁcant result was obtained suggesting lower naturalness from the usel voice was oﬀset by overall lower quality from the cpulse voice. 4.2

Perceived Emotional Stress

7 subjects listened to 24 matching stimuli, 9 neutral, 8 biased for lax units, 7 biased for tense units. The subjects could play either stimuli as many times as they wished and rated them in terms of how emotionally stressed they were perceived using a MUSHRA-like slider which was then converted to a 100 point scale. Results were averaged by subject to produce two values for each pair of stimuli. A by-materials repeated measure ANOVA was carried out, with voice type (usel, cpulse) as a repeated measure, and voice quality (tense, neutral, lax ) as a grouping variable. Both voice type (F (1, 21) = 11.861, p < 0.005) and voice quality (F (2, 21) = 22.693, p < 0.001) were signiﬁcant, with the usel rated less stressed than cpulse (mean 40.5 vs 46.9) and tense stimuli perceived as signiﬁcantly more emotionally stressed than neutral and lax stimuli (Tukey HSD, p < 0.01). Figure 3 shows the perceived stress by system and voice quality on a 1–100 scale. Post-hoc LSD analysis shows that cpulse was more eﬀective and creating stimuli that was perceived as three levels of emotional stress (Lax < Neutral < Tense LSD: p < 0.05) than usel where the lax and neutral stimuli are not rated as signiﬁcantly diﬀerent. This can be a problem with commercial unit selection

56

C. G. Buchanan et al.

Fig. 3. Perceived emotional stress of synthesised tense, neutral and lax audio stimuli created using voices with natural (usel ) and artiﬁcially created (cpulse) tense and lax sub-corpora. Error bars show standard error.

voices where a voice talent is chosen because of a default pleasant and calm voice making it hard to increase the perceived calmness or stress.

5

Conclusions and Future Work

The loss of naturalness means the approach presented here is inappropriate for replacing sub-corpora unit selection voices. However many free-to-use speech databases do not have such sub-corpora. Hence the approach is still potentially useful for adding emotion and personality to a neutral voice at the cost of some naturalness. For example, we invite the reader to listen to two versions of the Arctic STL voice [19], the ﬁrst as a standard unit selection voice https://tinyurl. com/y8p35b5a, and the second with artiﬁcial voice quality sub-corpora added https://tinyurl.com/y9u9bhf8. This voice is released free to use as part of ARIAVALUSPA AVP package1 . Furthermore the algorithmic approach resulted in a better discrimination between tense, neutral and lax stimuli in terms of perceived emotional stress. Signiﬁcant further work is required in order to improve quality however. Potentially non-linear machine learning techniques such as DNNs could be applied to this problem and learn mapping between voice qualities while maintaining higher naturalness. It is likely that such techniques will need to take into account voice-source decomposition issues as discussed in the methodological section of this paper.

1

https://github.com/ARIA-VALUSPA/AVP.

Adding Personality to Neutral Speech Synthesis Voices

57

Acknowledgements. This work was supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 645378 (Aria VALUSPA).

References 1. Schr¨ oder, M.: Emotional speech synthesis: A review. In: Seventh European Conference on Speech Communication and Technology (2001) 2. Gobl, C., Chasaide, A.N.: The role of voice quality in communicating emotion, mood and attitude. Speech Commun. 40(1–2), 189–212 (2003) 3. Aylett, M.P., Vinciarelli, A., Wester, M.: Speech synthesis for the generation of artiﬁcial personality. IEEE Trans. Aﬀect. Comput. (2017) 4. Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using PSOLA technique. Speech Commun. 11(2–3), 175–187 (1992) 5. Aylett, M., Pidcock, C.: Adding and controlling emotion in synthesised speech. Pat no. GB2447263A, September 2008 6. Gibiansky, A., et al.: Deep voice 2: multi-speaker neural text-to-speech. In: Advances in Neural Information Processing Systems, pp. 2966–2974 (2017) 7. Nordstrom, K.I., Tzanetakis, G., Driessen, P.F.: Transforming perceived vocal eﬀort and breathiness using adaptive pre-emphasis linear prediction. IEEE Trans. Audio, Speech Lang. Process. 16(6), 1087–1096 (2008) 8. Huber, S., Roebel, A.: On glottal source shape parameter transformation using a novel deterministic and stochastic speech analysis and synthesis system. In: Interspeech 2015 (2015) 9. Shechtman, A., Shechtman, S., Rendel, A.: Semi parametric concatenative TTS with instant voice modiﬁcation capabilities. In: Interspeech 2017 (2017) 10. Drugman, T., Wilfart, G., Dutoit, T.: A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis. In: Interspeech 2009 (2009) 11. Erro, D., Iaki Sainz, E.N., Hernaez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE J. Sel. Top. Signal Process. 8(2), 184–194 (2014) 12. Csap, T.G., Nmeth, G., Cernak, M., Garner, P.N.: Modeling unvoiced sounds in statistical parametric speech synthesis with a continuous vocoder. In: 24th European Signal Processing Conference (EUSIPCO), pp. 184–194 (2016) 13. Alku, P.: Glottal wave analysis with pitch synchronous iterative adaptive inverse ﬁltering. Speech Commun. 11, 109–118 (1992) 14. Alku, P.: Glottal inverse ﬁltering analysis of human voice productiona review of estimation and parameterization methods of the glottal excitation and their applications. Sadhana 36(5), 623–650 (2011) 15. Fant, G., Liljencrants, J., Lin, Q.G.: A four-parameter model of glottal ﬂow. STLQPSR 26(4), 001–013 (1985) 16. Rosenberg, A.E.: Eﬀect of glottal pulse shape on the quality of natural vowels. J. Acoust. Soc. Am. 49(2B), 583–590 (1971) 17. Fant, G.: The LF-model revisited. Transformations and frequency domain analysis. Speech Trans. Lab. Q. Rep. R. Inst. Tech. Stockh. 2(3), 40 (1995) 18. Brookes, M.: VOICEBOX: speech processing toolbox for MATLAB. http://www. ee.ic.ac.uk/hp/staﬀ/dmb/voicebox/voicebox.html. Accessed 13 Oct 2017 19. Kominek, J., Black, A.W.: The CMU arctic speech databases. In: Fifth ISCA Workshop on Speech Synthesis (2004)

Towards Network Simplification for Low-Cost Devices by Removing Synapses ˇ ıdl, and Jan Svec ˇ Martin Bul´ın(B) , Luboˇs Sm´ Department of Cybernetics, University of West Bohemia, Pilsen, Czech Republic {bulinm,smidl}@kky.zcu.cz, [email protected]

Abstract. The deployment of robust neural network based models on low-cost devices touches the problem with hardware constraints like limited memory footprint and computing power. This work presents a general method for a rapid reduction of parameters (80–90%) in a trained (DNN or LSTM) network by removing its redundant synapses, while the classification accuracy is not significantly hurt. The massive reduction of parameters leads to a notable decrease of the model’s size and the actual prediction time of on-board classifiers. We show the pruning results on a simple speech recognition task, however, the method is applicable to any classification data. Keywords: Pruning synapses · Network simplification Minimal network structure · Low-cost devices · Speech recognition

1

Introduction

The recent trend of integrating smart electronic devices into a human every-day life calls for new methods for making the software both capable of performing high accuracies and meeting the hardware limitations. This so called “smartness” is often supported by sophisticated machine learning models, being developed on powerful computing machines and usually using a huge amount of data, which makes them robust and recently even surpassing human skills in a variety of cognitive tasks [1,2]. The next step for a practical use, however, is to take the trained models and run them on low-cost devices, where the resources are constrained in terms of computing power and memory size. Out of the wide range of applications we can give an example of a keyword spotting microcontroller - an always-on chip inside today’s smartphones [3], where a robust neural network based model works on a hardware, which is limited in order to ﬁt in a phone. In [4], the authors made eﬀort to meet the resource limitations by investigating and choosing from various network architectures (DNN, CNN, LSTM, ...) and used the Google speech commands dataset [5] for comparison. They compared the performance of diﬀerent models in terms of memory footprint, number of operations needed for prediction and test accuracy. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 58–67, 2018. https://doi.org/10.1007/978-3-319-99579-3_7

Towards Network Simplification for Low-Cost Devices

59

In this work, we take 6 of their network schemes (3 DNN and 3 LSTM) as a baseline and make them learn the same data. Then we put our hypothesis that the number of operations and the memory footprint can be rapidly reduced by removing unimportant parameters (synapses) from fully-connected models, while the classiﬁcation accuracy of original predictors is not signiﬁcantly hurt. Hence, our contribution rests in presenting a general algorithm for ﬁnding and pruning redundant synapses in both feed-forward and recurrent neural networks. 1.1

Related Work

The problem of network pruning was touched by several researchers in the early 90’s of the last century - a good survey of developed pruning methods is given by Reed [6] and a comparison of pruning methods can be found in [7]. Clearly, when trying to remove redundant parts of a neural network, the crucial question is how to distinguish them from the important ones. To brieﬂy enumerate only the most relevant studies touching this problem: 1. Skeletonization [8] - a measure called “relevance” was introduced. It is computed as the error when the synapse is removed minus the error when the synapse is left in place. 2. Optimal Brain Damage [9] - Yann Lecun and his team presented a measure called “saliency” estimated by the second derivative of the error with respect to the weight. 3. Karnin’s measure [10] - the author used the change in weight during the training process to compute a measure called “sensitivity” given as: Sk =

N −1

[Δwk (n)]

n=0

2

wk (tf ) η(wk (tf ) − wk (0)

(1)

where n runs over training epochs, wk (tf ) is the value of weight wk after training and wk (0) is its initial value, η is a constant. Eq. 1 is shown here on purpose as it is relevant to the investigated measure introduced in Sect. 2.1 of this work. 1.2

Contribution of This Work

The aim of this work is to contribute to the state-of-the-art network minimalization research by introducing a method for a rapid reduction (80–90%) of the number of parameters by removing unimportant synapses from the network. As well as in case of the quantization ﬂow in [4], the classiﬁcation accuracy does not drop signiﬁcantly after the intervention. The reduction of redundant parameters leads to the reduction of the model size as well as the number of operations needed for prediction, which makes the method a perfect tool for designing on-board prediction models. Unlike the other studies mentioned in the previous section, we come with a simple (in terms of computational demands and processing time) measure for distinguishing important synapses from the redundant ones (Sect. 2.1) and we

60

M. Bul´ın et al.

also introduce a general network pruning proceeder (Sect. 2.2). Although the performance is shown on the Google speech commands dataset [5] only, the approach is general and applicable on any classiﬁcation problem.

2

Network Pruning

The rule of thumb in using artiﬁcial neural networks for classiﬁcation nowadays is taking a fully connected structure - each neuron is synaptically connected to all units in the following layer in case of feedforward neural networks and similarly all possible synapses are present in case of recurrent networks. This leads to enormous numbers of parameters for networks with many neurons. We agree that a fair amount of neurons is needed for a suﬃcient network performance, however, we believe that the number of parameters can be rapidly reduced by removing single synapses. Here we put the hypothesis that some of the synapses in fully connected (feedforward as well as recurrent) networks do not contribute to the classiﬁcation at all and so their removal would not cause a signiﬁcant classiﬁcation accuracy drop. This idea is graphically illustrated in Fig. 1.

Fig. 1. Hypothesis: Removal of redundant synapses does not influence the performance.

2.1

Determining Synapse Significance

The crucial problem is to identify the redundant synapses in fully connected networks and to distinguish them from the important ones. To face this challenge we introduce a measure called WSF - Weight Significance Factor (Eq. 2). WSF(wk ) = |wk (tf ) − wk (0)|

(2)

where wk (0) is the initial value of weight wk and wk (tf ) is its value after network training. The idea is that the weight change over network training is related with the classiﬁcation importance of the corresponding synapse, so that weights of redundant synapses do not signiﬁcantly evolve during the training. Therefore, synapses with low WSF are considered less important than those with high WSF after training.

Towards Network Simplification for Low-Cost Devices

2.2

61

General Pruning Proceeder

The developed network pruning algorithm is an iterative process that is general in terms of using any of the discussed measures of weight signiﬁcance [8–10]. The procedure is illustrated in Fig. 2. First of all, a relevant (big enough in terms of number of layers/neurons) network is chosen and trained to a maximal test accuracy for given data. Next, the initial so-called percentile level (by default P=75) must be deﬁned. Once the original network is trained, we call it a processed network and iteratively repeat the following steps: 1. Copy the processed network and so get the working copy 2. Take the working copy and remove P% of the synapses (the least important ones based on the chosen measure) and so get the pruned working copy 3. Retrain the pruned working copy with training data up to the best possible validation accuracy 4. Evaluate the pruned working copy on testing data and check if the required classiﬁcation accuracy is kept – yes (accuracy kept) → take the pruned working copy as processed network and go to step 1 – no (accuracy broken) → go to step 5 2. Check if the current percentile level P can be decreased (P>0) – yes → decrease the percentile level and go to step 1 – no → pruning ﬁnished, take the processed network as a result

Fig. 2. Network pruning algorithm.

The retraining (step 3) can be skipped to speed up the process, however, in general the network reduction is much more signiﬁcant when the retraining step is applied. The percentile level is usually being decreased in a predeﬁned manner, by default 75 → 50 → 30 → 20 → 10 → 5 → 1 → 0. Once the “percentile 0”

62

M. Bul´ın et al.

is reached, only one synapse, the one with the lowest WSF, was removed in the working copy. If even a single synapse removal breaks the accuracy, the percentile level is not decreased anymore and the network is considered pruned. In [7] we provide several experiments showing that the derived network has a minimal possible structure for given data in terms of number of synapses. 2.3

Dimensionality Reduction in Feed-Forward Networks

Getting back to the main motivation of this work, the goal is to make a network smaller in terms of a number of parameters, however, the number of operations and the memory footprint during prediction are the overall qualities that make the trained model useful for a target device. The pruning algorithm described in previous sections is able to reduce the number of parameters by driving unimportant weights to zero. However, even though these parameters equal zero, they are still present and therefore the original dimensions of weight matrices are kept. The next step then is to take advantage of the pruning result by reducing these dimensions in order to decrease the number of operations as well as the memory footprint. The following approach is applied to weight matrices layer by layer: 1. Remove all zero rows1 corresponding to neurons with no inputs. 2. Remove the columns (see Footnote 1) in the weight matrix of the following layer corresponding to outputs of the removed neurons in the currently processed layer. 3. Remove all zero columns (see Footnote 1) corresponding to neurons with no outputs. 4. Remove the rows (see Footnote 1) in the weight matrix of the previous layer corresponding to inputs of the removed neurons in the currently processed layer.

Fig. 3. Illustration of dead neurons in a pruned feedforward network.

Assuming the case in Fig. 3, corresponding dimensionality reduction of the weight matrix (hidden layer) after the removal of dead neurons is shown below. ⎡ ⎤ w11 0

⎢w21 w22 ⎥ w11 0 0 0 ⎢ ⎥ → Wreducted = (3) Wpruned = ⎣ 0 0 ⎦ w41 w42 w41 w42 1

Depending on the implementation rows/columns might correspond to layer inputs/outputs or outputs/inputs.

Towards Network Simplification for Low-Cost Devices

3

63

Experiments and Results

In this work, we use the Google speech commands dataset [5] and 6 baseline neural network architectures inspired by [4] to demonstrate the ability of the pruning algorithm: 1. to ﬁnd a rapidly simpliﬁed (in terms of number of parameters, number of operations and size on drive) and comparably good classiﬁcation models; 2. to deal with diﬀerent network architectures (feedforward, recurrent). 3.1

Data for Demonstration

The dataset [5] consists of 65K samples - one second long audio clips recorded by thousands of diﬀerent people. There are 30 diﬀerent words among the samples (see Fig. 4) plus clips representing “silence” - combination of diﬀerent kinds of noise like doing the dishes, miaowing or an artiﬁcially made white noise. We chose 10 keywords - “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “oﬀ”, “stop” and “go” out of the dataset. These keywords alongside with “silence” and “unknown” represent 12 classes for training our models. The “unknown” group consists of the remaining 20 words from the dataset (the transparent ones in Fig. 4) like in [4,11].

Fig. 4. Distribution of samples in the demonstration dataset.

The provided lists of validation and test samples ensure a controlled dataset split in the ratio of 80:10:10, while words of the same person stays in one set. We do not use any data augmentation. The feature vectors are formed diﬀerently for DNN and LSTM models, however, in both cases we use 10 MFCC features out of a window of length 40 ms with a 20 ms shift (settings adopted from [4]).

64

3.2

M. Bul´ın et al.

Experimental Setup

The experimental setup follows the baseline in [4] and the purpose is to show how the methods introduced in this work can contribute to the overall goal of model minimalization. Out of the wide scale of models presented in [4], we chose two architectures (DNN, Basic LSTM) and designed 3 versions of each diﬀering in number of neurons. The last column in Table 1 (ops) stands for the number of operations needed for prediction of one sample (see [4]). Table 1. Selected model architectures for experiments. # of params Size on drive Prediction time ops

Model

Hidden neurons

dnn s

FF(144)-FF(144)-FF(144) 113K

468 kB

332 ms

158.8K

dnn m

FF(256)-FF(256)-FF(256) 258K

1.0 MB

334 ms

397.1K

dnn l

FF(436)-FF(436)-FF(436) 596K

2.4 MB

336 ms

990.2K

lstm s

LSTM(118)

62K

261 kB

554 ms

5.9M

lstm m LSTM(214)

195K

793 kB

558 ms

18.9M

lstm l

493K

2.0 MB

558 ms

47.9M

LSTM(344)

We used the Keras API [12] running on top of the TensorFlow [13] backend for training all models. Layers are followed by tanh activation and we used the RMSprop optimizer with a manually tuned learning rate individually for every model. Then we used the standard categorical crossentropy as the loss function and categorical accuracy is the observed metric. We fed the networks with samples in batches of size 512 and give them 1000 epochs at maximum for training (early stopping is performed when the validation loss is evidently impaired). 3.3

Training Results

The pruning algorithm takes a trained network as the input. Therefore the ﬁrst step is to train all the models (from Table 1) up to their best possible performance using the conﬁguration described in the previous section (Table 2). Table 2. Training results. Model

Train acc. Val. acc. Test acc. # of epochs Epoch time

dnn s

90.5%

82.6%

80.1%

543

1s

dnn m 93.3%

82.9%

81.5%

432

1s

dnn l

94.2%

83.1%

81.8%

586

1s

lstm s

94.8%

89.9%

89.2%

150

13 s

lstm m 96.5%

90.5%

89.7%

108

14 s

lstm l

91.7%

90.8%

105

15 s

97.9%

Towards Network Simplification for Low-Cost Devices

65

Some of the training results are slightly worse compared to those published in [4] as the training conﬁguration is also a bit diﬀerent, however, achieving the best training results is not the goal of this work. 3.4

Pruning Results

The approach introduced in Sect. 2 was applied on the six models described in Table 1. We set up 25 retraining epochs (step 3 of the algorithm, Sect. 2.2), maximally 50 pruning iterations and the default sequence of percentile levels. The pruning result is highly depended on the required classiﬁcation accuracy we intend to keep. It is a parameter we choose, but naturally it must be less or equal the maximal possible accuracy of the original network. Table 3. Pruning results. Number of parameters needed to reach required accuracy. Model

Original # parameters in pruned nets Acc # param. Acc kept Acc −1% Acc −2% Acc −5% Acc −10%

dnn s

80.1% 113K

91K

58K

41K

14K

4K

dnn m 81.5% 258K

237K

154K

89K

23K

4K

dnn l

81.8% 596K

322K

134K

89K

31K

9K

lstm s

89.2% 62K

62K

20K

19K

15K

15K

lstm m 89.7% 195K

181K

40K

37K

26K

32K

lstm l

405K

118K

75K

72K

48K

90.8% 493K

In Table 3 and in Fig. 5, we can see results (the number of synapses) for various settings of the required-accuracy parameter.

Fig. 5. Actual number of synapses needed to reach desired classification accuracy.

66

M. Bul´ın et al.

For instance, the lstm l model (originally using 493K parameters with the accuracy of 90.8%) was reduced to 405K parameters, while the accuracy was not broken at all and, as another experiment, was reduced to 118K (24% of the original number) parameters, while the accuracy decreased by 1% to 89.8% only.

Fig. 6. Proportional number of synapses (with respect to the original network) needed to reach desired classification accuracy.

Figure 6 presents the same results as Fig. 5 did, however, here we have the proportional scale in order to illustrate the immense model reduction more clearly. One can see that the classiﬁcation accuracy starts to decrease signiﬁcantly, when the number of synapses is reduced to 30–20% of the original number for LSTM networks and to 10–5% for DNN models. Figure 7 shows demonstrative DNN and LSTM models in terms of their size when saved on drive and the number of parameters. The goal is to keep them close to the origin in Fig. 7 and performing a high accuracy at the same time.

Fig. 7. Model size on drive vs. number of parameters for dnn large and lstm large.

Towards Network Simplification for Low-Cost Devices

4

67

Conclusion

The call for neural network based models runnable on low-cost devices for today’s practical applications forces us to deal with constrained hardware parameters like limited memory footprint and computing power. In this work, we introduced a general network pruning algorithm, capable of removing a notable amount of synapses from a trained network (generally 80– 90%) that are believed to be unimportant for classiﬁcation and so the ﬁnal test accuracy is not signiﬁcantly hurt. This immense reduction of model parameters leads to a decrease of the model’s size and the prediction time. The results of the pruning procedure are presented on the Google speech commands dataset [5] and the baseline network architectures designed for pruning are adopted from [4]. We showed the capability of the algorithm to deal with feedforward (DNN) and recurrent (LSTM) structures. The developed methods are implemented in Python and are compatible with Keras [12], which makes it all together a powerful and fast tool for getting a minimized network structure for any classiﬁcation data. Acknowledgments. This research was supported by the Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506.

References 1. Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. arXiv preprint arXiv:1707.01629 (2017) 2. Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., Stolcke, A.: The Microsoft 2017 conversational speech recognition system. CoRR,abs/1708.06073 (2017) 3. Chen, G., Parada, C., Sainath, T.N.: Query-by-example keyword spotting using long short-term memory networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015). ISBN 978-1-4673-6997-8 4. Zhang, Y., Suda, N., Lai, L., Chandra, V.: Hello edge: keyword spotting on microcontrollers. arXiv arXiv:1711.07128v3 (2018) 5. Warden, P.: speech commands: a public dataset for single-word speech recognition (2017). http://download.tensorflow.org/data/speech commands v0.01.tar.gz 6. Reed, R.: Pruning algorithms - a survey. IEEE Trans. Neural Netw. 4, 740–747 (1993) 7. Bul´ın, M.: Optimization of neural network. Master thesis. University of West Bohemia. Univerzitn´ı 8, 30100 Pilsen, Czech Republic (2017) 8. Mozer, M., Smolensky, P.: Skeletonization: a technique for trimming the fat from a network via relevance assessment. University of Colorado, Boulder, Department of Computer Science (1989) 9. LeCun, Y., Denker J.S., Solla, S.: Optimal brain damage. In: Advances in Neural Information Processing Systems, pp. 598–605 (1990) 10. Karnin, E.D.: A simple procedure for pruning back-propagation trained neural networks. IEEE Trans. Neural Netw. 1, 239–242 (1990) 11. Kaggle Inc.: TensorFlow speech recognition challenge (2017). https://www.kaggle. com/c/tensorflow-speech-recognition-challenge 12. Chollet, F., et al.: Keras (2015). https://keras.io 13. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). tensorflow.org

Generation of Synthetic Images of Full-Text Documents Luk´ aˇs Bureˇs1,2 , Petr Neduchal1,2 , Miroslav Hlav´ aˇc1,2,3 , and Marek Hr´ uz1(B) 1

Faculty of Applied Sciences, NTIS, UWB, Pilsen, Czech Republic {lbures,neduchal,mhruz}@ntis.zcu.cz 2 Department of Cybernetics, Faculty of Applied Sciences, UWB, Pilsen, Czech Republic [email protected] 3 ITMO University, St. Petersburg, Russia

Abstract. In this paper, we present an algorithm for generating images of full-text documents. Such images can be used to train and evaluate models of optical character recognition. The algorithm is modular, individual parts can be changed and tweaked to generate desired images. We describe a method for obtaining background images of paper from already digitalized documents. We use a Variational Autoencoder to train a generative model of these backgrounds enabling the generation of similar background images as the training ones on the ﬂy. The module for printing the text uses large text corpora, font, and suitable positional and brightness noise to obtain believable results. We use Tesseract OCR to compare the real world and generated images and observe that the recognition rate is very similar indicating the proper appearance of the synthetic images. Furthermore, the mistakes made by the OCR system in both cases are alike. Finally, the system generates detailed, structured annotation of the synthesized image. Keywords: Generating images · Character recognition Computer vision · Machine learning

1

Introduction

Optical character recognition (OCR) is a well-developed ﬁeld of computer vision and machine learning. In this paper, we present a method for digitalized text corpus generation with the focus on typewritten documents. Such documents are often targets of OCR for the purpose of data digitalization. Data digitalization is used for example for digital archiving and it enables indexing and searching for speciﬁc documents, information retrieval, or full-text search. In the past, OCR was a rule-driven system often divided into detection and recognition parts. An example of such system is Tesseract [8]. Nowadays machine learning methods such as deep neural networks (DNN) are used in an end-to-end fashion to detect and read texts [2–4,9]. These networks are successful mainly in the domain of c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 68–75, 2018. https://doi.org/10.1007/978-3-319-99579-3_8

Generation of Synthetic Images of Full-Text Documents

69

“wild” text detection and recognition; a task of detecting sparse text in realworld environments – meaning a lot of noise and clutter is present, and the text can be in any font/form/style. Systems of machine learning need a lot of labeled training data to learn the task and generalize it. Human labeling is a costly activity and can be faulty. For the purpose of wild text reading an algorithm for generating synthetic training data was proposed in [1]. However, the algorithms for wild text reading are not suitable for full text documents reading. Generally, this is due to a signiﬁcant diﬀerence in the appearance of the considered image data. We propose an automatic system of data generation and labeling for the purpose of training machine learning algorithms. We have access to a collection of digitalized full text documents from post World War II Czechoslovakia which we are trying to emulate. We use several consecutive algorithms to generate believable typewritten documents. First, we automatically obtain samples of background (paper) from real documents to be able to generate empty documents. A variational autoencoder (VAE) is trained from these examples which is then able to generate reasonably looking backgrounds from random noise of given properties. Next, a text is printed onto this empty document with brightness and per-character location noise to make it look authentic. Finally, an annotation of the text is automatically generated. This process is able to generate limitless amount of text data provided a large text corpus. The generation is language independent and has several tweaking mechanisms in place.

2

Background Extraction

In this section, we describe the algorithm used for background extraction. The input of the algorithm is digital scans of original typewritten documents with dark text on a bright background. This assumption is valid not only for our data but a large set of digitalized documents. Thanks to this assumption the proposed algorithm can be based on computing the mean color. In the ﬁrst step, the input image is loaded in original colors (RGB) and in grayscale. The text is found in the grayscale image using Otsu’s method [7] of brightness thresholding. In the thresholded binary image, the text pixels are stored as ones and non-text pixels as zeros. The binary image is dilated using a square window of suitable size in order to comprise pixels on the border between text and non-text pixels. These pixels are labeled as text pixels too because they can negatively aﬀect the next step of the algorithm. Next, the mean color of the non-text pixels is computed for all color components of RGB image. The binary image from the previous step is used as a mask that removes all text pixels from the computation. All text pixels are then replaced by the mean values. The algorithm has to be extended by one more step. The reason is that the diﬀerence between pixel values and the mean color values in particular image parts can be large – i.e the result contains brightness discontinuities which are not desired. Thus, it is necessary to re-compute the color that replaces the text color by a mean value from the text pixel’s local neighborhood. All pixels are used in

70

L. Bureˇs et al.

the local mean computation, even the already replaced text pixels. This creates a statistical bias towards the mean background color which is actually desired. This process can be viewed as a blurring using local neighborhood averaging. An example of the results of the proposed algorithm is shown in Fig. 1. When there are dark artifacts in the input image, the algorithm is not able to remove them completely. Moreover, the procedure sometimes results in high contrast noise in the output image. However, these issues are handled by the VAE generating the synthetic backgrounds and thus they need not to be addressed by this algorithm.

Fig. 1. Examples of extracted background regions.

3

Background Generator

In recent years, the task of generating an image from a given set of samples is handled by a special type of neural network called autoencoder. This network consists of two parts. The ﬁrst part (encoder) encodes an image from a training set into a latent variable vector. The second part (decoder) then decodes the latent variable vector into the original input image. A typical autoencoder does not apply any restrictions on the latent space and thus serves as a type of memory of the training images. To further build a generative model a constraint is applied to the latent space to force it to follow a Gaussian distribution. This model is then called VAE [5]. The main disadvantage of VAE is that it produces blurry images. Various methods have been proposed to improve the sharpness of the reconstructions, most notably merging VAE with a Generative Adversarial Network [6]. We have discovered that the behavior of VAE is actually beneﬁcial for the creation of images of old paper in the sense that it looks better. The blurry images generated by VAE eliminate the shortcomings of the Background Extraction algorithm (Sect. 2) by not producing the artifacts present in the training data. Real images of 685 old paper pages were used as training data. The pages were rid of the text by the algorithm in Sect. 2. The original images were resized

Generation of Synthetic Images of Full-Text Documents

71

Table 1. Structure of VAE. The encoder is composed of four convolutional layers with 64 kernels and ReLU activation function. Every other convolutional layer is followed by batch normalization. The intermediate layer with 500 neurons has tanh activation function. The latent space is represented by two fully connected layers with 250 neurons and linear activation function. Encoder

Decoder

64 conv(2 × 2), ReLU

64 deconv(3 × 3), ReLU

64 conv(2 × 2), BN, ReLU 64 deconv(3 × 3), ReLU 64 conv(3 × 3), ReLU

64 deconv(3 × 3), ReLU

64 conv(3 × 3), BN, ReLU conv(2 × 2), sigmoid 500 fully-connected

to a resolution of 128 × 96 × 3 (width × height × channels). The structure of the network is described in the Table 1. The network was trained over 1000 epochs by RMSprop optimizer with the starting learning rate of 0.001. The latent space was represented by a Gaussian with the zero mean and variance of 1.0. The size of the latent dimension of mean and variance was 250. The output images are generated by the decoder part of the trained VAE. The input is a vector of the same size as the latent space – 250 and it is generated from a normal distribution with mean = 0 and std = 0.15. The value of the standard deviation was selected experimentally based on the analysis of the quality of the generated images. The size of generated images is 128 × 96 × 3 (width × height × channels) which are then resized to the original resolution of 2480 × 3504 × 3 using linear interpolation.

4

Synthetic Image with Text Generator

The input of our algorithm is the trained VAE model which has been described in Sect. 3. We use this model for generating background images on the ﬂy. We also need a font for generating the text. We have selected a well-known font Bohemian typewriter. An example of the selected font can be seen in Fig. 2. This font was used in our original dataset of scanned typewritten documents. Generally, any kind of font can be used, since the impact of the font on the proposed algorithm is minimal. We have experimentally set the font size to 45 pixels; this size has been based on the resolution of the original images and the measurements of the font in multiple randomly selected images from our private dataset.

Fig. 2. An example of Bohemian typewriter font which has been used for generating synthetic documents.

72

L. Bureˇs et al.

For generating the synthetic images we need an input text which is loaded from a text ﬁle - this text can be anything in UTF-8, but the algorithm is limited by the selected font. When the font does not contain character which should be printed then the dummy placeholder is printed instead. In the next step, the algorithm generates a synthetic background using the trained VAE model - we have set the resolution of the generated image to 2480× 3504. This generated image has 3 channels (in RBG format). The generated background image is always unique and unpredictable in the sense of colors. We can easily change the pre-trained VAE mode to generate diﬀerent kinds of background images. For the ﬁrst printing, we use an empty white image with identical resolution as the generated synthetic background image (2480 × 3504 × 3). In this white image, the algorithm prints out character by character with zero brightness. The loaded text is printed into a predeﬁned area - the areas can be easily added and/or modiﬁed, so we are able to print the text into any predeﬁned layout, e.g. 2 columns, 2 paragraphs, zig-zag, etc. The printing area is deﬁned by x and y positions of the top-left corner, and width w and height h. If the text does not ﬁt into one area, the printing process continues smoothly into the next predeﬁned area. When there is not an area left to print into, the rest of the input text is omitted. The predeﬁned areas have a bounded random oﬀset which is added to x and y positions so that the generated documents look more natural. For every character, the position is calculated based on the size of the printed character. Then the character oﬀset generator is applied. This oﬀset generator helps us randomly select position of a character in ±3 pixels in x and y directions. The oﬀset generator is in place to imitate the real old documents that have been typed with a typewriter. The oﬀsets are set experimentally based on the resolution and size of the font. Next, we want to simulate the noise present in the original data. brightness A vector of Gaussian noise N μ, σ 2 (where μ = 0 and σ 2 = 0.3) is created and height = 3504 and reshaped into an image of size width = 2480 8 8 . Then, the image is upsampled using linear interpolation into the size of the original image. This image with noise is summed with the image with printed text (only at the locations where the characters of the text are present) and the ﬁnal values are clipped. In the summed image a blurring is applied with a local averaging window of size 7 × 7. The summed image with printed characters is blended with synthetically generated VAE background image into the ﬁnal image. The image is ﬁnally blurred once again using the local averaging window of size 5 × 5. We show the bounding boxes of characters in Fig. 3. These are obtained by knowing several facts; the x, y position of the printed characters, the generated character oﬀset, and the known size of the font. The size is constant since the typewriter font has a ﬁxed width characters. We want to reﬁne these boxes to better ﬁt the printed characters so that the bounding box is minimal. The combined image is transformed into grayscale and contours of the characters are found. Each bounding box is then reduced to touch the contour. The words, lines, and pages bounding boxes have been adjusted too, see Fig. 3.

Generation of Synthetic Images of Full-Text Documents

73

Fig. 3. The example of character’s bounding boxes (left) and the example of ﬁne–tuned character’s bounding boxes (right).

All retrieved information about characters, words, lines and images is stored in ﬁles (in pickle data ﬁles and png images) for further analysis and experiments. The pickle data ﬁle contains a structure for every page. In every page there are the printing areas of deﬁned layouts in which objects of lines are stored. All line objects contain list of words objects and ﬁnally every word object contains list of characters which the word is assembled from. Every object contains precise x and y global coordinates in original input image, and width and height of its bounding box. The information about the objects counts is stored too. This structure can be utilized in many ways. We plan to use it as annotated data for DNN learning, but it can be also used as labeled test data for already existing algorithms. Examples of the generated text can be seen in Fig. 4.

Fig. 4. Examples of generated text. Top row: original digitalized text. Next rows: generated text.

5

Experiment

The goal of our experiment is to verify whether OCR results of generated data have similar accuracy as the results of real scanned documents w.r.t. ground truth. The Tesseract system was used for the OCR task in this experiment. The accuracy w.r.t. ground truth is a distance ratio between two strings. The strings are the OCR output and the handmade ground truth text. Both are

74

L. Bureˇs et al. Table 2. OCR accuracy of real and generated scans. Dataset

Mean accuracy [%] Standard deviation

Real data

78.427

13.610

Generated 1 89.551

8.677

Generated 2 85.285

10.375

Generated 3 77.927

12.132

preprocessed by erasing all characters that do not inﬂuence the meaning of the text; e.g. newlines, spaces, dots, colons etc. The accuracy is computed as follows: l p= 1− · 100 [%] , (1) m where l is the Levenshtein distance between the two strings and m is the length of the longer string. In this experiment, four datasets were tested. The ﬁrst one is a dataset of real documents and the other three are artiﬁcial datasets of documents generated by our proposed algorithms from the same text as is in the real dataset. Totally, there were 24 documents with 25 292 valid characters. The artiﬁcial datasets were generated with various amount of character noise – i.e. damaged the structure of printed characters in the scan (see Sect. 4). The results are shown in Table 2. We were able to generate dataset which has similar results as the real data as well as a dataset with signiﬁcantly better accuracy result. The results in Table 2 show that the dataset denoted Generated 3 is similarly challenging for the Tesseract system as the real data. Moreover, when observing the recognized texts, the OCR makes very similar mistakes as in the real scanned documents. The character noise used in this dataset can be used for generating benchmark datasets for comparing state of the art OCR systems. We can also control the amount of the character noise to generate less challenging data. This attribute of the algorithm can be beneﬁcial for training DNNs. First, a DNN can be trained on easier data and then the diﬃculty of the data can be progressively raised. One can also generate baseline datasets that should be recognized ﬂawlessly.

6

Conclusion and Future Work

We have presented an algorithm for generating images emulating real world scanned typewritten documents. The algorithm can be easily modiﬁed to generate diﬀerent types of documents. The algorithm as a whole is composed of several modular algorithms. The background images representing the paper can be trained on any kind of images using a Variational Autoencoder. The font used for printing the characters can be changed easily. Layout of the page in the form of deﬁning printing areas enables generation of arbitrary types of documents.

Generation of Synthetic Images of Full-Text Documents

75

The character noise controls the quality of the output. The algorithm provides a detailed structured annotation of the generated documents that can be used for machine learning or testing of existing algorithms. We show experimentally that the generated documents present a similar challenge to existing OCR system as the real scanned documents. We plan to generate huge amount of training data for deep neural networks. The performance of the trained DNN on real world data will tell us more about the quality of our proposed algorithm. There is a lot of room for tweaking different parts of the algorithm which will be our main concern in the future. Acknowledgments. This paper was supported by Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506. The work has also been supported by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

References 1. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition (2016) 2. Huang, W., Qiao, Y., Tang, X.: Robust scene text detection with convolution neural network induced MSER trees. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 497–511. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2 33 3. Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 512–528. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910593-2 34 4. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vision 116(1), 1–20 (2016). https://doi.org/10.1007/s11263-015-0823-z 5. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: The International Conference on Learning Representations (2014) 6. Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML 2016, vol. 48, pp. 1558–1566. JMLR.org (2016), http://dl.acm.org/citation.cfm?id=3045390. 3045555 7. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979) 8. Smith, R.: An overview of the tesseract ocr engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633 (2007) 9. Zhou, X., et al.: East: an eﬃcient and accurate scene text detector. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2642–2651 (2017)

Speech Synthesizing Simultaneous Emotion-Related States Felix Burkhardt1(B) and Benjamin Weiss2 1

2

audEERING GmbH, Friedrichstraße 68, 10117 Berlin, Germany [email protected] Technische Universit¨ at Berlin, Ernst-Reuter-Platz 7, 10587 Berlin, Germany [email protected]

Abstract. We describe an approach to simulate ﬁrst and secondary emotional expression in synthesized speech simultaneously by targeting diﬀerent parameter categories. The approach is based on the open-source system “Emoﬁlt” which utilizes the diphone-synthesizer “Mbrola”. The evaluation of the approach by a perception experiment showed that the pure emotions were all recognized above chance. Whereas the results are promising, the ultimate aim to validly synthesize two emotions simultaneously was not fully reached. Apparently, some emotions dominate the perception (fear), and the salience or quality of synthesis does not seem to be equally distributed over the two feature bundles. Keywords: Speech synthesis

1

· Emotion simulation · Mixed emotions

Introduction

Current state-of-the-art synthesizers support the simulation of speciﬁc speaking styles in one way or the other. A speciﬁc form of speaking style is emotional speech. Since decades articles in the literature can be found on strategies on how to simulate a single emotional expression described by a categorical designation or by single point in an emotion-dimensional space, see [1] or [12] for some more recent examples. The expression of only one emotional state in speech is a ﬁrst step towards more naturalness. Nonetheless, it is an over-simpliﬁcation to only model one emotional state at every given time. In the real world, there are many situations conceivable where at least two emotion-related states inﬂuence the speaking style. Especially when the term “emotion” gets broadened to “emotionrelated state”, i.e. includes mood, alertness or personality. Psychologists have been very interested in the topic of mixed or blended emotions, emphatically debating the degree to which conﬂicting emotions can be simultaneously experienced. One perspective suggests that the ability to experience conﬂicting emotions simultaneously is limited, as positive and negative emotions represent opposite dimensions on a bipolar scale. A second perspective argues the opposite, namely, that emotional valence is represented by two independent dimensions. Thus, not only can one simultaneously experience conﬂicting emotions, such a joint experience may be natural and frequently occurring c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 76–85, 2018. https://doi.org/10.1007/978-3-319-99579-3_9

Speech Synthesizing Simultaneous Emotion-Related States

77

[2,18]. For the case of facial expressions, mixed emotions have been successfully acted by providing situational descriptions and prototypical pictures [8], and even models to blend basic emotions exist [13]. The research on the simulation of aﬀective speaking styles with speech synthesis has a long history [3,14,15] and started with the simulation of one single speaking style or emotional expression. Mixing two speaking styles has later also been studied, for example [17] interpolated the HMM models of two diﬀerent emotional speaking styles to generate a mixed expression. They did not report on the success of the method with respect to an expression that is perceived by listeners as a mixture between two emotions. In a similar fashion, [11] learned parameter clusters for HMM speech synthesis to model speaker identity and emotional expression. This method was used to model expression even for speakers whose model was not trained on emotional data by using prosodic models trained on speakers that included expressive samples, while the spectral features are meant to encode the speaker identity. So the foremost aim of this research was to transplant expressive speaking styles from one source speaker to another. To our knowledge until now no one reported on the attempt to ﬁnd a strategy to display more than one aﬀective state at the same time not using interpolation between speaker expression models. We describe an approach to simulate more than one emotion utilizing the open source program “Emoﬁlt” which in itself is based on the diphone synthesizer “Mbrola” [9] as well as a text-to-phoneme converter, for example the text-to-speech framework “Mary” [16]. The approach is based on the idea of mixing conﬁgurations for several feature categories during the synthesis process. Feature categories are for example: articulation, phonation, pitch or duration parameters. We evaluated this approach with a perception experiment. In a systematic confusion, each of Darwins four “basic emotions” (joy, sadness, fear and anger) was combined with all other emotions and used as an emotional model to synthesize four target phrases taken from the Berlin emotional database EmoDB. The two German target phrases were generated with a male and female Mbrola voice (de6 and de7). This article is structured as follows. Firstly we describe the speech synthesizer in Sect. 2. We then report on the way we approached the simultaneous simulation of two aﬀective states in Sect. 3. Section 4 describes the perception experiment that was used to verify our approach. Lastly, Sect. 5 discusses the results and insights that could be gained from the experiment. We conclude the paper with an overview and some ideas for improvements in Sect. 6.

2

Emofilt

Emoﬁlt [4] is a software program intended to simulate emotional arousal with speech synthesis based on the free-for-non-commercial-use MBROLA synthesis engine [9]. It acts as a transformer between the phonetisation and the speechgeneration component. Originally developed at the Technical University of Berlin

78

F. Burkhardt and B. Weiss

Fig. 1. Emoﬁlt Developer Graphical User Interface.

in 1998 it was revived in 2002 as an open-source project and completely rewritten in the Java programming language. The input format for Emoﬁlt is MBROLA’s PHO-format. Each phoneme is represented by one line, consisting of the phoneme’s name and its duration (in milliseconds). Optionally following is a set of F0 description tuples consisting of a F0 -value (in Hertz) and a time value denoting a percentage of the duration. Here is an example of such a ﬁle: _ v O x @

50 35 0 95 42 95 84 99 55 18 99 27 103 36 107 45 111 50 30 0 178 16 175 80 160

Emoﬁlt’s language-dependent modules are controlled by external XML-ﬁles and it is as multilingual as MBROLA which currently supports 35 languages. Emoﬁlt consists of three main interfaces: – Emoﬁlt-Developer: a graphical editor for emotion-description XML-ﬁles with visual and acoustic feedback (see Fig. 1). – Emoﬁlt itself, taking the emotion-description ﬁles as input to act as a transformer in the MBROLA framework. – A storyteller interface that can be used to mark phrases in a dialog with colors that correspond to emotional expression [6].

Speech Synthesizing Simultaneous Emotion-Related States

79

The input format for Emoﬁlt is MBROLA’s PHO-format. Each phoneme is represented by one line, consisting of the phoneme’s name and its duration (in ms). The valid phoneme-names are declared in the MBROLA-database for a speciﬁc voice and must be known by Emoﬁlt. In a ﬁrst step each syllable gets assigned a stress-type. Emoﬁlt diﬀerentiates three stress-types: – unstressed – word-stressed – (phrase) focus-stressed As the analysis of stress involves an elaborate syntactic and semantic analysis and this information is not part of the MBROLA PHO-format, Emoﬁlt assigns only focus-stress to the syllables that carry local pitch maxima. However, for research scenarios it is possible to annotate the PHO-ﬁles manually with syllable and stress markers. The emotional simulation is achieved by a set of parameterized rules that describe manipulation of the following aspects of a speech signal: – Pitch changes, for example: “Model a rising contour for the whole utterance by ordering each syllable pitch contour in a rising manner”. – Duration changes, for example: “Shorten each voiceless fricative by 20%”. – Voice Quality, for example the simulation of jitter by alternating F0 values and support of a multiple-voice-quality database. – Articulation precision changes by a substitution of centralized and decentralized vowels. The rules were motivated by descriptions of emotional speech found in the literature [3]. As we naturally can not foresee all modiﬁcations that a future researcher might want to apply, we extended Emoﬁlt by an extensible pluginmechanism that enables users to integrate customized modiﬁcations more easily.

3

Data Generation

As stated in Sect. 2, Emoﬁlt’s modiﬁcation rules are categorized into four modiﬁcation categories: pitch, duration, voice-quality and articulation. The ﬁrst naive idea on how to simulate two diﬀerent states at the same time would perhaps be to simply fuse the modiﬁcation parameters for each desired expression by using the average value. For example if anger leads to an increase of stressed syllables by 20% and sadness leads to a decrease of 20%, use 0% modiﬁcation because it’s the average value. But, as can be seen directly from the example, this may easily lead to an equalization between the two expressions and thus neither expression would be detectable. So instead we used the distinction between prosodic features (i.e. pitch and duration) to express the more “foreground” emotion and the other feature categories, namely voice-quality and articulation, to express the secondary emotional

80

F. Burkhardt and B. Weiss

state. This distinction lacks a basis in psychological models, but was motivated purely by pragmatic motivation. The following example displays the conﬁguration for happy as a primary and sadness as a secondary emotion.

< j i t t e r r a t e=” 10 ” /> < a r t i c u l a t i o n>

As modiﬁcations to display happiness, the pitch-contour gets assigned the so-called “wave model” (which means a ﬂuent up-and-down contour between stressed syllables, see [4] for details) and the duration of the voiceless fricatives gets lengthened by 40%. At the same time, the phonation and articulation parameters get altered according to the emotion model deﬁned for sadness, i.e. jitter is added, the vocal eﬀort is set to “soft” and the articulation target values are set to “undershoot”. To generate test samples for evaluation in a systematic confusion, each of Darwins four “basic emotions” (joy, sadness, fear and anger) was combined with all other emotions and used as primary as well as secondary emotional state. As a reference we added neutral versions, but did not combine neutral with the emotional states. This resulted in 17 samples (4 emotions by 4 + neutral). The target phrases were taken from the Berlin emotional database EmoDB [5]. We used two short and two longer ones. All target phrases were synthesized with a male and female Mbrola German voice (de6 and de7). The resulting number of samples was thus 134 (17*4*2).

4

Perception Experiment

In a forced-choice listening experiment, 32 listeners (16 males, 16 females, 20– 39 years old, mean = 27.26, standard deviation = 3.75) assigned all stimuli to one of the four emotions or “neutral”. A second rating was asked for as “alternative” categorization. The “neutral” emotion was introduced as default in case of uncertainty. The evaluation was done with the Speechalyzer Toolkit [7]. For playback of the stimuli in randomized order, AKG K-601 headphones were used. One single session took about 40 min.

Speech Synthesizing Simultaneous Emotion-Related States

81

A validation of the full emotions (256 ratings per category) conﬁrmed the synthesis quality for basic emotions, as all ﬁve synthesized categories are labeled on average with 52,4% as intended (see Table 1). Table 1. Confusion matrix for the single basic emotions only. Primary rating in % divided by 100. Highest values bold. Prim. Rat. Anger Fear Joy Emotion

Neutr. Sadn. F1

Anger

.496

.156 .117 .211

.020

.536

Fear

.223

.367 .180 .133

.098

.411

Joy

.066

.180 .383 .320

.051

.435

Neutral

.043

.039 .082 .582

.254

.488

Sadness

.023

.043 .000 .141

.793

.716

The intended complex emotions were categorized with a primary label 3072 times. Excluding all full single emotions, and thus also all primary ratings for “neutral”, resulted in 2244 answers. The complex emotions as intended with set 1 (prosody) are recognized most frequently. However, anger is equally often confused with fear (Table 2). A similar confusion matrix for the second intended emotion (voice quality, articulation) however, shows no identiﬁcation by the listeners except for anger (Table 3). The alternative ratings are dominantly “neutral”, indicating diﬃculties to assign two separate emotions to the stimuli (Tables 4 and 5). The remaining data without any “neutral” responses, i.e. actually assigned to the four emotions in question, account only for 38% of the 3072 responses. Still, there are systematic results visible (Table 6): Within the limits of those actually rating a secondary emotion, combinations of anger and fear as well as fear and sadness are dominantly classiﬁed irregardless of the assignment of emotions to the features. Joy combined with fear is most often correctly rated for joy synthesized with prosodic information. In sum, fear was the best performing emotion to be combined with others. Interestingly, all confusions had one emotion in common, whereas another was dominantly replaced with fear.

5

Discussion

The pure emotions were all recognized above chance. Results for the complex emotions indicate that the prosodic parameters signiﬁcantly elicit the intended emotion, whereas the second bundle (voice-quality and articulation precision) reveals mixed results, even for the primary rating. In particular, the secondary rating was dominantly “neutral”. Nevertheless, when analyzing the pairs of nonneutral ratings, the intended complex emotions including fear work especially

82

F. Burkhardt and B. Weiss

well. Even the confusion pattern for the other targets show systematic eﬀects in favor of fear, always retaining one of the intended emotions that is not dependent on the features bundle. Therefore, these results are most likely originated in the quality of the material and evaluation method at the current state of synthesizing complex emotions, and can not be taken to indicate invalidity of the concept of complex emotions. Table 2. Confusion matrix for the emotions synthesized with prosody. Primary rating in % divided by 100. Highest values bold. Prim. Rating Anger Fear Joy Emotion Set 1

Sadness F1

Anger

.375

.337 .239 .049

.3866

Fear

.173

.518 .202 .108

.4977

Joy

.248

.206 .427 .119

.4340

Sadness

.184

.085 .031 .700

.6976

Table 3. Confusion matrix for the emotions synthesized with voice quality and articulation. Primary rating in % divided by 100. Highest values bold. Prim. Rating Anger Fear Joy Sadness F1 Emotion Set 2 Anger

.343

.222 .215 .220

.346

Fear

.336

.176 .238 .250

.163

Joy

.168

.325 .199 .308

.214

Sadness

.130

.483 .247 .140

.141

Whereas the results are promising, the ultimate aim to validly synthesize two emotions simultaneously was not fully reached. Apparently, some emotions dominate the perception (fear), and the salience or quality of synthesis does not seem to be equally distributed over the two feature bundles. Table 4. Confusion matrix for the emotions synthesized with prosody. Secondary rating in % divided by 100. Highest values bold. Sec. Rat. Anger Fear Joy Neutr. Sadn. F1 Emo. Set 1 Anger

.123

.196 .066 .511

.104

.176

Fear

.136

.202 .097 .392

.173

.277

Joy

.090

.194 .100 .498

.117

.177

Sadness

.116

.211 .035 .525

.112

.115

Speech Synthesizing Simultaneous Emotion-Related States

83

Table 5. Confusion matrix for the emotions synthesized with voice quality and articulation. Secondary rating in % divided by 100. Highest values bold. Sec. Rat. Anger Fear Joy Neutr. Sadn. F1 Emo. Set 2 Anger

.151

.201 .082 .435

.131

.206

Fear

.104

.234 .067 .475

.120

.283

Joy

.108

.189 .072 .521

.110

.136

Sadness

.106

.178 .081 .481

.153

.172

Table 6. Confusion matrix for the complex emotions separated for prosodic and nonprosodic feature order. Primary and Secondary ratings pooled (in % divided by 100). Highest values bold, intended categories in italics. Dual Ratings Anger: Anger: Anger: Fear: Fear: Joy: Joy Sadness Joy Sadness Sadness Complex Emotions Fear Anger-Fear

.461

.113

.174

.148

.087

.017

Fear-Anger

.424

.094

.079

.180

.180

.043

Anger-Joy

.418

.154

.088

.143

.164

.033

Joy-Anger

.308

.288

.144

.115

.077

.067

Anger-Sadness

.420

.037

.074

.247

.198

.025

Sadness-Anger

.067

.053

.400

.000

.413

.067

Fear-Joy

.195

.076

.042

.288

.373

.025

Joy-Fear

.181

.108

.072

.349 .205

Fear-Sadness

.227

.034

.034

.227

.445

.034

Sadness-Fear

.070

.020

.320

.020

.480

.090

Joy-Sadness

.108

.054

.068

.243

.324

.203

Sadness-Joy

.057

.014

.200

.000

.629

.100

.084

From a methodological point of view, hiding the true aim while assessing two emotions per stimulus seemed to be diﬃcult. However, asking for only one emotion and analyzing the frequencies of replies would require comparable perceptual salience of each emotion involved. Fortunately, judging from conversations with the participants and the high amount of neutral second ratings, the cover story of asking for a ﬁrst and an alternative impression worked. As alternative, openly asking for the mixture of emotions risks to induce eﬀects of social desirability, which might still allow for testing the quality of synthesizing stereotypical emotion combinations, but not for testing validity of the complex emotions. Therefore, a more sophisticated evaluation paradigm applying social situations, in which complex emotions do occur, might be more meaningful.

84

6

F. Burkhardt and B. Weiss

Conclusions and Outlook

We described an approach to simulate ﬁrst and secondary emotional expression in synthesized speech simultaneously. The approach is based on the combination of diﬀerent parameter sets with the open-source system “Emoﬁlt” which utilizes the diphone-synthesizer “Mbrola”. An evaluation of the technique was done in a perception experiment which showed only partial results. The ultimate aim to validly synthesize two emotions simultaneously was not fully reached, but, as the results are promising, the synthesis quality, especially for voice quality and articulation, needs to be optimized in order to establish comparable strength and naturalness of the emotions over both feature bundles. Especially the simulation of articulation precision, which is done by replacing centralized phonemes with decentralized ones and vice versa [4], could be enhanced when using a diﬀerent synthesis technique. Data-based synthesis (like diphone synthesis or non-uniform unit-selection synthesis) is not well suited for manipulations of the articulation precision or voice quality. In this respect the simulation rules that were based on prosodic manipulation (set 1) were of course more eﬀective. As unrestricted text-to-speech synthesis is not of importance while this is still predominantly a research topic, one possibility would be to use articulatory synthesis where the parameter sets can be modeled more elaborately by rules. After quality testing such optimizations, an improved evaluation methodology should be applied to study validity of complex emotions synthesized with “Emoﬁlt”. The approach did result in success with emotions that are neighbors with respect to the emotional dimensional space that’s spanned by the PAD dimensions pleasure, arousal and dominance. For example the combination of sadness and anger as well as fear and sadness share two of the three dimensions and were recognized by the majority of the judges. For future work it would be a possibility to try combinations of emotions that can be envisaged by the listeners more easy than systematic variation, for example by embedding the test sentences into situations that are appropriate for the targeted emotion mix. It would also be an interesting research to investigate the acoustic manifestation of mixed emotions by analysis of natural data, for example the Vera am Mittag corpus [10]. As this corpus consists of real-life emotional expression happening in a TV-show, mixed emotions are very likely to occur. A set of clear representations would have to be identiﬁed by a new label process and then analysed for acoustic properties. The outcomes could then be synthesized to validate the ﬁndings in a more controlled environment.

Speech Synthesizing Simultaneous Emotion-Related States

85

References 1. Barra-Chicote, R., Yamagishi, J., King, S., Monero, J.M., Macias-Guarasa, J.: Analysis of statistical parametric and unit-selection speech synthesis systems applied to emotional speech. Speech Commun. 52(5), 394–404 (2010) 2. Berrios, R., Totterdell, P., Kellett, S.: Eliciting mixed emotions: a meta-analysis comparing models, types, and measures. Front. Psychol. 6, 428 (2015) 3. Burkhardt, F.: Simulation emotionaler Sprechweise mit Sprachsynthesesystemen. Shaker (2000) 4. Burkhardt, F.: Emoﬁlt: the simulation of emotional speech by prosody transformation. In: Proceedings of Interspeech. Lisbon (2005) 5. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Proceedings of Interspeech. Lisbon (2005) 6. Burkhardt, F.: An aﬀective spoken story teller. In: Proceedings of Interspeech. Florence (2011) 7. Burkhardt, F.: Fast labeling and transcription with the speechalyzer toolkit. In: Proceedings of LREC (Language Resources Evaluation Conference), Istanbul (2012) 8. Du, S., Tao, Y., Martinez, A.: Compound facial expressions of emotion. Proc. Natl. Acad. Sci. 111(15), E1454–62 (2014) 9. Dutoit, T., Pagel, V., Pierret, N., Bataille, F., Van der Vreken, O.: The MBROLA project: towards a set of high-quality speech synthesizers free of use for noncommercial purposes. In: Proceedings of ICSLP 1996, Philadelphia, vol. 3, pp. 1393–1396 (1996) 10. Grimm, M., Kroschel, K., Narayanan, S.: The Vera am Mittag German audio-visual emotional speech database. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Hannover (2008) 11. Latorre, J., et al.: Speech factorization for HMM-TTS based on cluster adaptive training. In: Proceedings of Interspeech. Portland (2012) 12. Lee, Y., Rabiee, A., Lee, S.: Emotional end-to-end neural speech synthesizer. CoRR (2017) 13. Martin, J.C., Niewiadomski, R., Devillers, L., Buisine, S., Pelachaud, C.: Multimodal complex emotions: gesture expressivity and blended facial expressions. Int. J. Humanoid Rob. 3, 269–292 (2006) 14. Murray, I.R., Arnott, J.L.: Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. JASA 93(2), 1097–1107 (1993) 15. Schr¨ oder, M.: Emotional speech synthesis - a review. In: Proceedings of Eurospeech 2001, Aalborg, pp. 561–564 (2001) 16. Schr¨ oder, M., Trouvain, J.: The German text-to-speech synthesis system mary: a tool for research, development and teaching. Int. J. Speech Technol. 6, 365–377 (2003) 17. Tachibana, M., Yamagishi, J., Masuko, T., Kobayashi, T.: Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing. IEICE Trans. Inf. Syst. 88(11), 2484–2491 (2005) 18. Williams, P., Aaker, J.: Can mixed emotions peacefully coexist? J. Consum. Res. 28(4), 636–649 (2002)

An Approach to Automatic Summarization of Television Programs Marco Canora, Fernando Garc´ıa-Granada(B) , Emilio Sanchis, and Encarna Segarra Departamento de Sistemas Inform´ aticos y Computaci´ on, Universitat Polit`ecnica de Val`encia, Camino de Vera s/n, 46022 Valencia, Spain [email protected], {fgarcia,esanchis,esegarra}@dsic.upv.es

Abstract. In this paper we present an approach to document summarization based on unsupervised techniques. We study the adequacy of these techniques to the problem of documents in which many topics of diﬀerent duration are present, in our case the transcriptions of Spanish TV programs. The paper compares a classical Latent Semantic Analysis approach to a new proposal based on Latent Dirichlet Allocation. It is also studied the application of the summarization process to the diﬀerent segments obtained in a previous process of topic segmentation. The topic segmentation is performed by considering distances between paragraphs, that are represented by means of continuous vectors obtained from the words contained in them. Experiments on some TV programs of political and miscellaneous news have been performed. Keywords: Document summarization Latent Semantic Analysis

1

· Latent Dirichlet Allocation

Introduction

Multimedia content summarization is an important issue in recent years. Due to the great amount of information available in the web it is necessary to have diﬀerent tools to help the users to digest that contents in an easy way. For this reason, summarization techniques are a current goal in Natural Language research [9,14]. Traditionally, summarization methods are classiﬁed in two categories: extractive and abstractive. Extractive approaches consist of detecting the most salient sentences and the summary generated is composed by those sentences, while abstractive approaches try to be more similar to human summaries and they generate new sentences that may not be in the original document. Although, logically, these last approaches are a more ambitious challenge, recent works have shown promising expectations [3,11]. In the framework of extractive approaches most systems are based on unsupervised learning models. This is the case of Latent Semantic Analysis (LSA) [7], or graph-based [4]. Other systems are based on supervised methods such as Recurrent Neural Networks [3], Conditional Random Fields (CRFs) [13], or Support Vector Machine (SVM) [5]. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 86–93, 2018. https://doi.org/10.1007/978-3-319-99579-3_10

An Approach to Automatic Summarization of Television Programs

87

The organization of evaluation competitions has been an important help for the development of this area. This is the case of DUC1 and TAC2 conferences. They have become a forum to compare the diﬀerent approaches. To do this, some evaluation corpora have been developed that can be used not only for test purposes but also for training models. Some of the most popular corpora in summarization tasks are the corpus used in DUC and the CNN/DailyMail corpus. This last corpus has widely used for learning models in Neural Networks approaches [3]. Other authors have explored the summarization considering audio documents as input [6]. This task has the additional problems of dealing with diﬀerent kinds of errors, as speech recognition errors and errors in punctuation of sentences. Moreover, some expressions that appear due to spontaneous speech characteristics must be speciﬁcally processed since they could be not relevant for the summary. In this work, we present an extractive approach to document summarization based on unsupervised techniques, in particular Latent Dirichlet Allocation (LDA) [2]. This approach can be considered as topic-based because some topics can be automatically detected and used to determine the most salient sentences according to the topics that appear in the document. Another issue of this work is that we have addressed the problem of summarization of TV programs, in particular a magazine of news. Some characteristics of this task generate speciﬁc challenges to the summarization problem. Apart from the speech recognition problems, that are not considered in this work, the most interesting problem is that this kind of programs have a very variable structure, and usually many topics of diﬀerent duration are present in them. We have studied two strategies of summarization: in the ﬁrst one, the transcription of the program is the input to the summarization system, and in the second one, a preprocess of segmentation of the program is done, and from the concatenation of the summaries of each segment the ﬁnal summary is obtained. We have performed some experiments on Spanish TV programs in order to study the behavior of the proposed techniques. The paper is organized as follows. In Sect. 2, the diﬀerent methodologies developed are described. In Sect. 3, a description of the system architecture is presented. In Sect. 4, we show the characteristics of the corpus. In Sect. 5, we present the experimental results, and in Sect. 6, the conclusions and future works are presented.

2

System Description

Given a document, considered as a set of sentences, the objective of an extractive summarization technique consists of assigning weights to the sentences, that represent the relevance of them. From this ranked set of sentences the system selects the ﬁrst ones in order to build the summary. 1 2

https://duc.nist.gov/. https://tac.nist.gov/.

88

2.1

M. Canora et al.

Latent Semantic Analysis

Many unsupervised summarization systems are based on LSA. This technique permits to extract the representative sentences for the automatically detected topics in the documents. This is done applying the singular value decomposition of word-sentences matrices. That is, given the word-sentence matrix C the Singular Value Decomposition generates the U , Σ, and V T matrices, where V T represents the association of underlying topics to sentences. C = U ΣV T From this decomposition there are diﬀerent ways of assigning weights to sentences and then selecting those ones to appear in the summary. Some of them are based on the most salient sentence for each topic, others are based on the combination of the results of the matrix decomposition. We have chosen the Cross method that permits to extract more than one sentence associated to the most important topics [12]. 2.2

Latent Dirichlet Allocation

Another way for discovering hidden topics in documents is the LDA approach. This methodology has been successfully used for topic identiﬁcation, and can also be used for summary purposes. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes a generative process for each document in a corpus, given the a-priori parameters α and β, that characterize the probabilistic distributions. It assumes that for each word in a document, a topic is chosen given the multinomial distribution of topics, and then a word is chosen given multinomial probability of words conditioned by the selected topic. In order to use LDA, it is necessary to compute the posterior distribution of the hidden variables given a document, and to do this, one of the most popular approach is the Gibbs Sampling. Once the process is done for a ﬁxed number of topics, two matrices are obtained: one of them represents the probability that a concept appears in a document, and the other one represents the probability that a word belongs to a topic (word-topic matrix). Once these matrices have been obtained, we used the word-topic matrix to assign a weight to each word in a sentence. From this information we obtain a sentence-topics matrix that is the input to an adaptation of the Cross method used in the LSA approach. 2.3

Document Segmentation

Sometimes, as in our case, the documents to be summarized are long and heterogeneous, that is, they are composed by diﬀerent sections, each one focused on a diﬀerent subject. For this reason it could be convenient to split the document in diﬀerent pieces, that is know as topic segmentation.

An Approach to Automatic Summarization of Television Programs

89

The approach that we have developed consists of obtaining vector representations of two consecutive paragraphs and deﬁning a distance between vectors to decide if they belong to the same or to diﬀerent topics. Then, an overlapped sliding window of paragraphs across the document provides the distances between two pairs of consecutive paragraphs. That is, we calculate at the end of each sentence the distance between the previous n sentences and the following n sentences. The length of the sliding window is experimentally determined. In order to represent the paragraphs a semantic-based approach was done, in particular a Word2vec representation [10]. To do this, it was necessary to learn the Word2vec values from a large corpus. This was done from Wikipedia articles in Spanish. Once the word representation was obtained, the way to represent the paragraphs was done by the addition of vectors of the words contained in them. The measure used to determine the distance between consecutive paragraphs was the cosine distance.

3

System Architecture

We have explored diﬀerent approaches to the problem of summarization. Figure 1 shows the architecture of the ﬁrst system. In it, the documents are the input to the LSA or LDA processes, and the matrix obtained is the input for the Cross method process.

Fig. 1. Architecture of the system.

Figure 2 shows the architecture for the summarization system when a previous phase of topic segmentation is performed. That is, ﬁrst of all, the documents are segmented, and each segment is summarized. Then, a concatenation of this topic-dependent summaries is performed in order to generate the ﬁnal summary.

4

Corpus Description

The corpus consists of seven Spanish TV programs of news including some miscellaneous topics, such as music, gastronomy, culture, etc. We used the correct

90

M. Canora et al.

Fig. 2. Architecture of the system with a previous topic segmentation phase.

transcriptions of the speech, in particular the screenplay of the program presenter. It should be noted that the structure of these programs is very heterogeneous. Sometimes a sequence of short news, one or two sentences, of diﬀerent topics is followed by a long sequence of sentences related to one topic (for example a musical group that presents a new disc, even including interviews with the musicians). Some characteristics of this corpus are shown in Table 1. In order to evaluate the results, a summary of a 20% of the original document was performed for each document. They were manually built by an expert. Table 1. Corpus characteristics. Total number of words

27,881

Average number of words per TV program

3,983

Number of words of the shortest TV program 2,924 Number of words of the longest TV program

5

4,980

Experiments

Two series of experiments were done. The ﬁrst one consisted in the application of both methodologies, LSA and LDA to the set of documents, and the second one was the application of the same methodologies with the previous topic segmentation process. We have used diﬀerent ROUGE [8] measures to evaluate the summaries. The ROUGE metrics include: the ROUGE-n that measures the overlap of ngrams between the system and reference summaries, the ROUGE-L based on the Longest Common Subsequence (LCS), the ROUGE-W that is a Weighted LCS-based statistic, the ROUGE-S that is a skip-bigram based co-occurrence statistic, and ﬁnally, the ROUGE-SU that is a skip-bigram plus unigram-based

An Approach to Automatic Summarization of Television Programs

91

co-occurrence statistic. The most widely used in the literature are the ROUGE-1, ROUGE-2, and the ROUGE-L. The results of applying LDA and LSA directly to the transcriptions of the programs are shown in Tables 2 and 3 respectively. Results show that both methods have a good behavior and there is not a relevant diﬀerence between them. This can be explained by the fact that both approaches are based on the underlying topics of the documents, although each one of them has its particular way to model the semantics of the document. Tables 4 and 5 show the results when a previous segmentation was done. The pk value [1] of the segmentation was 0.59. It should be noted that the systems with a previous segmentation do not outperform the direct application of the proposed methodologies to the whole document. This could be explained by the fact that the topic segmentation approach is based on a decoupled architecture. That kind of decoupled architecture is very sensitive to the errors in the ﬁrst phase of the process. This way the errors are transmitted to the following phases, the summarization in our case. Table 2. Evaluation using LDA. Recall

Precision F1

ROUGE-1

0.57134 0.59537

0.58298

ROUGE-2

0.28718 0.29915

0.29299

ROUGE-3

0.22941 0.23884

0.23399

ROUGE-4

0.21471 0.22352

0.21899

ROUGE-L

0.53478 0.55706

0.54558

ROUGE-W-1.2 0.13903 0.27932

0.18561

ROUGE-S*

0.29909 0.32546

0.31145

ROUGE-SU*

0.29976 0.32615

0.31213

Table 3. Evaluation using LSA. Recall

Precision F1

ROUGE-1

0.58019 0.60525

0.59232

ROUGE-2

0.27962 0.29257

0.28588

ROUGE-3

0.20853 0.21844

0.21333

ROUGE-4

0.18838 0.19743

0.19275

ROUGE-L

0.52826 0.55124

0.53938

ROUGE-W-1.2 0.13183 0.26544

0.17612

ROUGE-S*

0.30823 0.33603

0.32124

ROUGE-SU*

0.30890 0.33672

0.32192

92

M. Canora et al. Table 4. Evaluation using LDA when a previous topic segmentation is done. Recall

Precision F1

ROUGE-1

0.51899 0.54040

0.52937

ROUGE-2

0.22402 0.23387

0.22879

ROUGE-3

0.16544 0.17291

0.16905

ROUGE-4

0.15117 0.15808

0.15452

ROUGE-L

0.48046 0.50050

0.49017

ROUGE-W-1.2 0.12027 0.24191

0.16062

ROUGE-S*

0.25231 0.27433

0.26264

ROUGE-SU*

0.25297 0.27501

0.26331

Table 5. Evaluation using LSA when a previous topic segmentation is done. Recall

6

Precision F1

ROUGE-1

0.51915 0.54059

0.52954

ROUGE-2

0.22379 0.23363

0.22856

ROUGE-3

0.16549 0.17296

0.16911

ROUGE-4

0.15133 0.15825

0.15468

ROUGE-L

0.48154 0.50137

0.49114

ROUGE-W-1.2 0.11991 0.24106

0.16012

ROUGE-S*

0.25372 0.27567

0.26401

ROUGE-SU*

0.25437 0.27635

0.26468

Conclusions

In this paper we have presented an approach to summarization of Spanish TV programs. It is based on unsupervised methods, and it is specially oriented to documents with heterogeneous structures, that is, documents that contain many topics with very diﬀerent durations. Two approaches based on underlying topic detection have been explored. The ﬁrst one consists in the application of the methods directly to the document and the second one has a previous phase of topic segmentation. Results show that both approaches provide good results, and they have a similar behavior. As future work, we will try to improve the segmentation based approach developing some mechanisms to transmit more than one segmentation hypothesis to the summarization phase. This way, the errors generated by the ﬁrst phase could be recovered during the summarization process. It can be also interesting to develop another way to combine the summaries of the detected segments, instead of a straight forward concatenation of them.

An Approach to Automatic Summarization of Television Programs

93

Acknowledgments. This work has been partially supported by the Spanish MINECO and FEDER founds under project AMIC: Aﬀective Multimedia Analytics with Inclusive and Natural Communication (TIN2017-85854-C4-2-R).

References 1. Beeferman, D., Berger, A., Laﬀerty, J.: Statistical models for text segmentation. Mach. Learn. 34(1), 177–210 (1999) 2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). http://dl.acm.org/citation.cfm?id=944919.944937 3. Cheng, J., Lapata, M.: Neural summarization by extracting sentences and words. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, 7–12 August 2016, Berlin, Volume 1: Long Papers (2016) 4. Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Int. Res. 22(1), 457–479 (2004) 5. Fuentes, M., Alfonseca, E., Rodr´ıguez, H.: Support vector machines for queryfocused summarization trained and evaluated on pyramid data. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL 2007, pp. 57–60. Association for Computational Linguistics, Stroudsburg (2007). http://dl.acm.org/citation.cfm?id=1557769.1557788 6. Furui, S., Kikuchi, T., Shinnaka, Y., Hori, C.: Speech-to-text and speech-to-speech summarization of spontaneous speech. IEEE Trans. Speech Audio Process. 12(4), 401–408 (2004) 7. Gong, Y., Liu, X.: Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2001, pp. 19–25. ACM, New York (2001). https://doi.org/10.1145/383952.383955 8. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: MarieFrancine Moens, S.S. (ed.) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona (2004) 9. Lloret, E., Palomar, M.: Text summarisation in progress: a literature review. Artif. Intell. Rev. 37(1), 1–41 (2012) 10. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Eﬃcient estimation of word representations in vector space. CoRR abs/1301.3781 (2013) 11. Nallapati, R., Zhai, F., Zhou, B.: Summarunner: a recurrent neural network based sequence model for extractive summarization of documents. In: Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence, , San Francisco, 4–9 February 2017, pp. 3075–3081 (2017) 12. Ozsoy, M.G., Cicekli, I., Alpaslan, F.N.: Text summarization of turkish texts using latent semantic analysis. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, pp. 869–876. Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm? id=1873781.1873879 13. Shen, D., Sun, J.T., Li, H., Yang, Q., Chen, Z.: Document summarization using conditional random ﬁelds. In: Proceedings of the 20th International Joint Conference on Artiﬁcal Intelligence, IJCAI 2007, pp. 2862–2867 (2007) 14. Tur, G., De Mori, R.: Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. Wiley, New York (2011)

The Prosody of Discourse Makers alors and et in French: A Corpus-Based Study on Multiple Speaking Styles George Christodoulides(B) Language Sciences and Metrology Unit, Universit´e de Mons, Place du Parc 18, 7000 Mons, Belgium [email protected]

Abstract. In this study, we investigate the prosodic characteristics of two French discourse markers (DMs), alors and et. Our study is based on a 8-h corpus covering 8 diﬀerent speaking styles, with an average of 10 speakers per communicative situation. The tokens were classiﬁed depending on whether they are being used as discourse markers (DMs) or not; additionally in the case of et used as a conjunction, the type of the co-ordinated syntactic elements was identiﬁed. An automated prosodic analysis of all occurrences was performed. Results show that the use of et as a DM was more prevalent in non-planned speech; silent pauses preceded occurrences of alors and et, both as DMs and as non-DMs; the diﬀerence in silent pause duration, in the DM uses vs in the non-DM uses, was not statistically signiﬁcant for alors and was statistically signiﬁcant for et; DMs did not systematically constitute a separate prosodic unit; a strong prosodic boundary diﬀerentiates between the use of et as a DM or as a co-ordinating conjunction between verb phrases and subordinate clauses, and its other non-DM uses. Keywords: Prosody

1

· Discourse markers · Corpus linguistics · French

Introduction

Spoken language comprehension entails multiple tasks for the listener, such as segmenting the incoming stream of speech, lexical access, syntactic parsing, integration of information into some form of cognitive representation, and understanding of discourse relations. Prosody plays an important role in all these steps, by guiding the listener’s comprehension (for a review, see [1,5,7]). The relationship between prosody and information structure, whether speciﬁc prosodic structures cue speciﬁc discourse relations, and whether prosody can facilitate the processing of discourse relations are research questions whose importance is increasingly recognised. Fraser deﬁnes discourse markers as “a class of lexical expressions drawn primarily from the syntactic classes of conjunctions, adverbs, and prepositional phrases [that] with certain exceptions, signal a relationship between the c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 94–102, 2018. https://doi.org/10.1007/978-3-319-99579-3_11

The Prosody of Discourse Makers alors and et in French

95

interpretation of the segment they introduce, S2, and the prior segment, S1. They have a core meaning, which is procedural, not conceptual, and their more speciﬁc interpretation is ‘negotiated’ by the context, both linguistic and conceptual” [8]. Discourse markers aid in the segmentation of speech (similarly to punctuation marks in written language), and Schriﬃn deﬁnes them as “sequentially dependent elements which bracket units of talk” [15]. In this study, we investigate the prosodic characteristics associated with the use of two discourse markers in French: alors (then) and et (and). Are there speciﬁc prosodic features that can distinguish between the use of these words as a discourse marker, and their use as an adverb or a conjunction (respectively)? When used as conjunction, the token et may link (co-ordinate) two segments at diﬀerent syntactical levels (e.g. two noun phrases, two adjectives). When used as a discourse marker, et may convey several discourse relations; this is also the case for alors [16]. In this study, we will investigate whether there are prosodic characteristics that distinguish between these uses, on the basis of the C-Phonogenre corpus [10], an 8-h corpus covering 8 diﬀerent speaking styles.

2

Related Work

Studies have attempted to investigate the phonetic and prosodic properties of discourse markers in speech, using both experimental and corpus-based approaches. For example, [11] conﬁrm the importance of intonation in interpreting the Swedish DM men (but/and/so), and in choosing between its sentential interpretation and its interpretation as a DM. They show that when the token men is used as a discourse marker, it has a positive f0 reset, with a mean value of 13.8 ST when preceded by a glottalisation, and of 5.7 ST without glottalisation; whereas in the case of sentential tokens, the mean value of the f0 reset was 2.2 ST. In English, it has been claimed that DMs constitute a separate prosodic unit surrounded by brief pauses, and that this conﬁguration helps distinguish between DMs and other uses of the same token. However, [12] show that DMs only form a separate intonation unit when opening/closing a conversation or when marking transitions from one topic to another. [12] postulate that the intonation of DMs depends on the speaker’s perception of how important a particular marker is, and therefore the relationship between the function of a DM, its prosodic characteristics and its position in the utterance is arbitrary. It has to be noted that studies on the subject are scarce, and therefore it is not yet possible to draw clear conclusions (also given the large number of diﬀerent discourse markers, and the fact that few language have been studied). The present study should be read in conjunction with [6], which is a speech elicitation experiment on the use of the DMs alors and et in French. In this experiment, twenty adult native speakers of French were asked to prepare and to read aloud 64 sequences consisting of a ﬁrst segment, the discourse marker alors or et, and a second segment; all ﬁrst segments were extracted from a speech corpus. The sequences were constructed in order to convey one of six predeﬁned discourse relations. The prosodic characteristics of the resulting recorded utterances were analysed, and results suggest that the silent pause duration before the

96

G. Christodoulides

DM, as well as the absolute duration of the DM itself are used by the speaker to diﬀerentiate between the core meaning of the DM and its less predictable meanings; and that DMs did not systematically constitute a separate prosodic unit. Our study will try to re-evaluate these ﬁndings by analysing the occurrences of the tokens alors and et in a corpus that better represents natural and contextualised language use.

3

Corpus and Methodology

3.1

The C-PhonoGenre Corpus

The corpus used in this study is C-PhonoGenre [10], which was compiled to study situation-dependent speaking styles in French and the associated prosodic variation. It contains data from 8 speaking styles: instructional speech [DIDA]; spontaneous narration [NARR]; speeches during “Question Time” at the French parliament [PARL]; religious sermons [RELG]; radio press reviews [RPRW]; three kinds of sports commentary [SPOR]: rugby, basketball and football; presidential New Years wishes [WISH] and weather forecasts [MET]. The average sample duration per speaker is 5:30 min. The corpus composition is presented in Table 1. Table 1. Composition of the C-PhonoGenre corpus. Genre

Sub-Genre

Nb Dur (min) Syll

Tokens Audience Media Prepared Interactive

DIDA

Radio

17

18 717

100

26304

1

2

2

TV

0

2

2

2 0

Lecture

2

0

1

0

NARR

Narration

10

44

11396

9 546

1

0

0

2

PARL

Question

10

20

5710

3 613

2

1

2

1

2

1

1

1

6 141

0

1

2

0

2

1

2

0

17 531

0

2

2

0

5 305

0

2

0

0

1

2

0

2

1 947

0

2

2

0

0

1

2

0

Answer RELG

Mass on the Internet

7

54

8726

15

95

26359

5

35

7601

Sermon on TV RPRW Radio press review SPOR

Basket Rugby/football

MET

Weather forecast

10

WISH

Pres. New Year

15

98

18614

12 578

89

455

107571

75 378

Total

9

2861

The corpus samples were selected using the methodology detailed in [10]. The corpus contains recording of both female and male speakers, originating from 3 diﬀerent French-speaking areas: Metropolitan France, Belgium and Switzerland. Speaking situations were described by features on four dimensions: audience, media, preparation and interactivity; each dimension had 3 diﬀerent states: 0 indicates absence of a feature (e.g. Preparation = 0 for spontaneous speech) and 2 the full presence of a feature (e.g. Media = 2 for broadcasts), while the value of 1 indicates intermediate situations. For example, Media = 1 indicates

The Prosody of Discourse Makers alors and et in French

97

speech directed to an individual or a small group, yet in front of a microphone or camera (indirect audience), and Preparation = 1 indicates semi-prepared speech, situated between spontaneous and read speech. In the case of parliamentary debates, a question is prepared, while the answer is semi-prepared. Interactivity indicates whether the main speaker may be interrupted. The values for each dimension and each speaking style in the C-PhonoGenre corpus are also indicated in Table 1. 3.2

Annotation Methodology and Feature Extraction

The C-PhonoGenre corpus has been manually transcribed orthographically and a phonetic transcription and segmentation was obtained using EasyAlign [9]. The alignment was manually corrected. A single annotator added speech delivery information: (i) disﬂuencies, articulation and phonological phenomena (schwa, vowel lengthening whether associated to hesitation or not, creaky voice, liaison and elision) (ii) symbols to distinguish between complete silence, audible and less audible breaths, and mouth noises; (iii) indices of paralinguistic phenomena (laugh, cough) and external sounds; (iv) overlapping segments and syntactic interruptions. The C-PhonoGenre corpus has been processed using the annotation pipeline for French in Praaline [2]. The DisMo annotator [3] was applied to the entire corpus, providing part-of-speech and disﬂuency annotations. Pitch stylisation was performed using Prosogram [13]. An automatic annotation of prosodic prominence and prosodic boundaries was performed using Promise [4]. Features extracted using these plug-ins are stored in an SQL database, and include durations (of pauses, segments, syllables etc.), pitch information (e.g. intonation contour descriptors), and symbolic annotations (e.g. prominences and boundaries). The database from Praaline was linked to the R statistical software [14] for analysis. Finally, all occurrences of the tokens alors and et were identiﬁed using Praaline’s concordancer, and they were manually annotated depending on whether the token is being used as a discourse marker (cf. the deﬁnition given in the Introduction). Additionally, in the case of et used as a conjunction, we have annotated the type of the co-ordinated syntactic elements as follows (Table 2): Table 2. Annotation scheme for et when used as a conjunction and not a DM. Code

Co-ordinated elements

Example

np np

Noun phrase

ses id´ees et ses valeurs

pp pp

Prepositional phrase

dans l’ hˆ opital et dans la m´edecine

adj adj

Adjective/Complement fort et coh´erent

vp vp

Verb phrase

consommons et rejetons

sub sub Subordinate clauses

qui se diront et qui se souviendront

num

Number

vingt et un

other

Other cases

98

4 4.1

G. Christodoulides

Results and Discussion Discourse Markers and Speaking Style

In the following we will present the main results of the analysis of the corpus. There were 1944 occurrences of et and 177 occurrences of alors in all samples. In the case of alors, it was used as a discourse marker in 138 (77.9%) of the cases; in the conjunction alors que (while) in 35 of the cases and as an adverb in 4 cases. The distribution of the diﬀerent uses of et, normalised by the number of tokens, by speaking style is given in Fig. 1. Genre Total tokens Conjunction np_np pp_pp vp_vp adj_adj sub_sub locution num other Discourse Marker Total

PARL DIDA RELG MET NARR RPRW SPOR WISH Total 3613 18717 6141 1947 9546 17531 5305 12578 75378 1.63% 1.06% 1.87% 1.64% 0.68% 1.19% 0.55% 2.50% 1.35% 0.69% 0.41% 0.47% 0.72% 0.08% 0.55% 0.25% 0.87% 0.49% 0.42% 0.24% 0.47% 0.41% 0.10% 0.25% 0.09% 0.79% 0.34% 0.14% 0.10% 0.67% 0.21% 0.14% 0.15% 0.04% 0.25% 0.19% 0.11% 0.07% 0.11% 0.21% 0.04% 0.11% 0.00% 0.29% 0.12% 0.08% 0.12% 0.10% 0.00% 0.13% 0.07% 0.09% 0.20% 0.11% 0.03% 0.06% 0.03% 0.05% 0.15% 0.03% 0.00% 0.04% 0.05% 0.03% 0.05% 0.02% 0.05% 0.04% 0.02% 0.08% 0.04% 0.04% 0.14% 0.01% 0.00% 0.00% 0.00% 0.03% 0.00% 0.03% 0.02% 1.00% 1.46% 0.70% 0.67% 2.46% 0.79% 2.21% 0.53% 1.22% 2.63% 2.52% 2.57% 2.31% 3.14% 1.98% 2.75% 3.03% 2.58%

Fig. 1. Distribution of diﬀerent uses of et, by speaking style (normalised by the number of tokens).

We observe that in communicative situations where we have spontaneous, non-planned speech (e.g. NARR, SPOR) the majority of the occurrences of et were discourse markers, while in the more planned speaking styles (e.g. WISH, PARL, RELG), et is used primarily as a conjunction. 4.2

Temporal and Intonational Properties

We then examined the prosodic characteristics of the diﬀerent uses of alors and et in our corpus. Figure 2 shows the distribution of the length of silent pauses before DM and non-DM uses of the two tokens. We observe that DM are often preceded by silent pauses; we observe that this is also the case for occurrences of et used as a conjunction between verb phrases and subordinate clauses. Furthermore, both DM and non-DM uses of the two tokens were almost never followed by a silent pause. Articulation rate did not signiﬁcantly vary depending on the DM or non-DM use of the two tokens. A pitch reset is a prosodic signal for segmentation between the end of a discourse segment and a discourse marker introducing the next discourse segment. Figure 3a shows the pitch movement between the last syllable of the segment between the token alors or et, by its use (as a discourse marker or not). We observe that DM uses of alors tend to have a ﬂat contour, but there is no other

The Prosody of Discourse Makers alors and et in French alors

et 1.00

Silent Pause Before (s)

1.00

Silent Pause Before (s)

99

0.75

0.50

0.25

● ● ●

●

0.75

●

● ●

● ● ●

●

●

● ● ● ● ● ● ● ●

● ● ●

0.50

● ● ●

● ● ● ● ● ● ●

0.25 ●

Sub−category

r

M D

C

O N

ot he

b

nu m

su

b_

O N

C

j

p O N

su

vp

_v

ad O N

C

C

C

O N

ad

j_

_p pp

O N C

O N

N

C

on

D

−D

M

M

np

_n

p

0.00

p

0.00

CON np_np

CON adj_adj

CON sub_sub

CON other

CON pp_pp

CON vp_vp

CON num

DM

Non−DM

Fig. 2. Pause duration before the token, for DM and non-DM uses of alors (left) and et (right).

signiﬁcant use of prosodic cues to diﬀerentiate between DM and non-DM uses of alors and et. With respect to the duration of the two tokens, we do not observe a signiﬁcant diﬀerence between DM and non-DM uses, as can be seen on Fig. 3b. 4.3

Prosodic Prominence and Boundaries

We have also examined the percentage of prosodically prominent syllables, and syllables carrying a prosodic boundary, immediately preceding the tokens alors and et. The results for prosodic prominence are shown in Fig. 4, and for prosodic boundaries in Fig. 5. We can observe that uses of alors as a DM are preceded by a strong prosodic boundary in 54% of the occurrences, compared to 38% of alors

et

4

alors

4

0.5

et 0.5

●

3

2

1

0

0.4

3

2

1

DM T2

Non−DM T1

Non−DM T2

Category

●

●

● ● ● ●

● ●

DM T2

Non−DM T1

Non−DM T2

DM

(a) Pitch movement between the end of S1 and the DM alors (left) or et (right).

0.3

● ● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.2

0.1

0.0 DM T1

Non−DM

0.2

0.4

●

0.1

0 DM T1

0.3

● ● ● ● ● ●

Syllable duration (s)

●

Syllable duration (s)

Inter−syllabic movement (ST)

Inter−syllabic movement (ST)

● ●

0.0 DM T1

DM T2

Non−DM T1

Non−DM T2

Category

DM T1

Non−DM

DM T2

Non−DM T1

Non−DM T2

DM

(b) Duration of the DM, for disyllabic alors (left) and monosyllabic et (right).

Fig. 3. Duration and Pitch reset for DM and non-DM uses of the tokens. T1 and T2 are the ﬁrst and second syllables of the target DM respectively.

100

G. Christodoulides alors

et 1.00

% prominent syllables

% prominent syllables

1.00

0.75

0.50

0.25

0.75

0.50

0.25

CON adj_adj

CON sub_sub

CON other

CON pp_pp

CON vp_vp

CON num

DM

DM

CON num

CON other

CON vp_vp

CON np_np

CON sub_sub

CON adj_adj

CON np_np

Non−DM

DM

Sub−category

CON pp_pp

0.00 0.00

Non−DM

Fig. 4. Prominent syllables (percentage) at the last syllable before the token alors (left) or et (right).

the occurrences (there is no signiﬁcant diﬀerence for prominence though). We also observe that uses of et as a discourse marker are also preceded by a strong prosodic boundary in 48% of the cases. This ﬁnding would not be enough to distinguish between DM and non-DM uses of et, as a strong prosodic boundary is present in 46% of its uses as a conjunction between verb phrases and 40% of its uses as a conjunction between two subordinate clauses. alors

et 1.00

% boundary syllables

% boundary syllables

1.00

0.75

0.50

0.25

0.75

0.50

0.25

CON np_np

CON adj_adj

CON sub_sub

CON other

CON pp_pp

CON vp_vp

CON num

DM

Non−DM

Boundary

DM

CON num

CON other

CON sub_sub

CON vp_vp

CON adj_adj

Non−DM

DM

Sub−category

CON pp_pp

CON np_np

0.00 0.00

B2

Fig. 5. Prosodic boundaries (percentage) at the last syllable before the token alors (left) or et (right). B3 = major prosodic boundary and B2 = medium prosodic boundary.

5

Conclusion and Perspectives

In this study, we investigated the prosodic characteristics of alors and et, two words that are often used as discourse markers in French. We conducted a corpusbased study, based on an 8-h corpus covering 8 diﬀerent speaking styles, and the results can be summarised as follows: – The use of et as a discourse marker was more prevalent in non-planned speech.

The Prosody of Discourse Makers alors and et in French

101

– Silent pauses preceded occurrences of alors and et, both as DMs and as nonDMs. The Mann-Whitney U non-parametric test shows that the diﬀerence between the preceding pause length in the DM uses vs in the non-DM uses was not statistically signiﬁcant for alors and was statistically signiﬁcant for et. In this respect, our corpus study only partly conﬁrms the results of the speech elicitation experiment in [6]. – DMs did not systematically constitute a separate prosodic unit, and both DM and non-DM uses of the two tokens were almost never followed by a silent pause. However, in the case of et, a strong prosodic boundary diﬀerentiates its use as a discourse marker or as a co-ordinating conjunction between verb phrases and subordinate clauses, and its other non-DM uses. – There were no statistically signiﬁcant diﬀerences in the articulation rate and in token duration, between the DM and non-DM use of alors and et. We plan to expand this study in two directions. First, an annotation of discourse relations expressed by the 138 uses of alors and the 922 uses of et as a discourse marker, in order to further investigate whether speciﬁc prosodic cues are linked to speciﬁc prosodic relations. Secondly, we plan to replicate this corpus study on a corpus with longer recordings, so that we can test the eﬀects of individual variation (by examining more occurrences of each token produced by the same speaker). An application of the results of the present study is also envisaged. While prosodic cues seem not to be suﬃcient to distinguish between DM and non-DM uses of et, we would like to test whether the prosodic information identiﬁed as pertinent by the present study (i.e. preceding silent pause length and preceding prosodic boundary) can be used to improve the accuracy of statistical parsing of transcriptions. The prosody associated with the expression of discourse relations, or with the use of certain discourse markers, is highly variable. If such an association does indeed exist, for some speciﬁc discourse markers, or in some speciﬁc cases of discourse relations (e.g. for the purposes of disambiguation), studies on very large corpora will be needed before we are able to extract meaningful patterns from the data. This is because the prosody of an utterance is inﬂuenced by multiple factors, including several factors that are totally unrelated to discourse structure, and because the observed individual variation in the prosodic realisation of discourse relations is fairly high. While experimental studies may indicate relevant acoustic correlates, they are not enough and should be reviewed in light of corpus data, to avoid conclusions based on spurious correlations. More studies, on larger corpora and controlling for individual variation, are needed.

References 1. Carlson, K.: How prosody inﬂuences sentence comprehension. Lang. Linguist. Compass 3(5), 1188–1200 (2009) 2. Christodoulides, G.: Praaline: integrating tools for speech corpus research. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), May 26–31, Reykjavik, Iceland, pp. 31–34 (2014). http://www. praaline.org

102

G. Christodoulides

3. Christodoulides, G., Avanzi, M., Goldman, J.P.: DisMo: a morphosyntactic, disﬂuency and multi-word unit annotator. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), May 26–31, Reykjavik, Iceland, pp. 3902–3907 (2014) 4. Christodoulides, G., Avanzi, M., Simon, A.C.: Automatic labelling of prosodic prominence, phrasing and disﬂuencies in French speech by simulating the perception of na¨ıve and expert listeners. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association, Interspeech 2017, 20–24 August 2017, Stockholm, pp. 3936–3940 (2017) 5. Cutler, A., Dahan, D., van Donselaar, W.: Prosody in the comprehension of spoken language: a literature review. Lang. Speech 40(2), 141–201 (1997) 6. Didirkov´ a, I., Christodoulides, G., Simon, A.C.: The prosody of discourse markers alors and et in French. A speech production study. In: Proceedings of Speech Prosody 2018, Poznan (2018) 7. F´ery, C.: Intonation and Prosodic Structure. Key Topics in Phonology. Cambridge University Press, Cambridge (2017) 8. Fraser, B.: What are discourse markers? J. Pragmat. 31, 931–952 (1999) 9. Goldman, J.P.: EasyAlign: an automatic phonetic alignment tool under Praat. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association, Interspeech 2011, 27–31 August 2011, Florence, pp. 3233–3236 (2011) 10. Goldman, J.P., Prsir, T., Christodoulides, G., Auchlin, A.: Speaking style prosodic variation: an 8-hour 9-style corpus study. In: Campbell, N., Gibbons, Hirst, D. (eds.) Proceedings of Speech Prosody 2014, pp. 105–109 (2014) 11. Horne, M., Hansson, P., Bruce, G., Frid, J., Filipsson, M.: Discourse markers and the segmentation of spontaneous speech: the case of Swedish men ‘but/and/so’. Working Papers, Lund University, Department of Linguistics, vol. 47, pp. 123–139 (1999) 12. Komar, S.: The interface between intonation and function of discourse markers in English. Engl. Lang. Overseas Perspect. Enq. (ELOPE) 4(1–2), 43 (2007). https:// doi.org/10.4312/elope.4.1-2.43-55 13. Mertens, P.: The Prosogram: semi-automatic transcription of prosody based on a tonal perception model. In: Proceedings of Speech Prosody 2004, 23–26 March 2004, Nara, pp. 549–552 (2004) 14. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2017). https://www.R-project.org/ 15. Schiﬀrin, D.: Discourse Markers. Cambridge University Press, Cambridge (1987) 16. Uygur-Distexhe, D.: Right peripheral discourse markers in SMS: the case of alors, donc and quoi. Papers from the Lancaster University Postgraduate Conference in Linguistics and Language Teaching (2010)

Choosing a Dialogue System’s Modality in Order to Minimize User’s Workload ˇ ıdl, and Jakub Nedvˇed Adam Ch´ ylek(B) , Luboˇs Sm´ NTIS - New Technologies for Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic {chylek,smidl,nedvedj}@ntis.zcu.cz

Abstract. The communication during human-machine interaction often happens only as a secondary task that distract the user’s main focus on a primary task. In our study, the primary task was driving a vehicle and the secondary task was an interaction with a dialogue system on a tablet device using touch and speech. In this paper we present the design and the analysis of a study that can be used to create an optimal strategy for a dialogue manager that takes into consideration several metrics. These include the type of the information we require from the user, the expected cognitive load on the user, the expected duration of a user’s response and the expected error rate.

Keywords: Dialogue system

1

· Choice of modality · Lane change test

Introduction

Multimodal dialogue systems start to play a role in cooperative robotics in industry and in interactive systems in our day-to-day lives. They also present a distraction from some tasks, such as checking your surrounding when walking, operating industrial machines or driving a car. We will focus our attention on secondary tasks that require touch or speech as their input modality. Our motivational use case is ﬁlling an electronic journey log while driving (e.g. logging the arrival at a destination or the oﬄoading of a cargo). The electronic logging happens via a device with a touchscreen or using an automatic speech recognition system. The driver’s main focus here should, of course, be on the driving, but we also want to make sure that the log is also ﬁlled in a timely manner. This leaves us with a problem of correctly choosing the types of an input that we want from the user and the correct modality that won’t distract the user too much and that also won’t cause too many problems with the actual dialogue (like error corrections, recognition timeouts, etc.). Since driving is the primary task in our use case, we have used the ISO 26022 standard [4] for the assessment of the impact of secondary tasks on a driver of a motor vehicle. This standard provides a lane change task in a simulated c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 103–112, 2018. https://doi.org/10.1007/978-3-319-99579-3_12

104

A. Ch´ ylek et al.

environment, so we can safely test several workload-heavy tasks and later analyse the recorded data. The results of the analysis will allow us to create a situation-aware dialogue system that uses the right modality for the given situation.

2

Related Work

Lane change test (also often called lane change task) is commonly used to assess the eﬀect of visual-manual interaction on the primary task of driving [1,6,8]. Similarly to us, the authors of [6] have evaluated several diﬀerent styles of visual presentations on handheld devices, but a speech interface was not tested. In [10] a spoken interaction was compared to a visual interaction using questionnaires. The spoken interaction was preferred by the subjects and their perceived cognitive load was lower in that case. We can further improve these ﬁndings by analysing in which situations would the visual interaction be beneﬁcial and back these ﬁndings by changes in a performance on simulated tasks. Other researchers have focused on incremental dialogue processing [5] that allows the dialogue system to continually monitor the state of the environment and adjust the interactions with the user accordingly. The point of this study is not to decide whether the primary task is inﬂuenced by either of the modalities, as it’s been already shown in e.g. [2,3] that both of the modalities do in fact have an impact on the driving. Our goal is to have the basis for a strategy that could minimize the impact on a cognitive load individually for diﬀerent types of input information and diﬀerent requirements from the dialogue manager (a duration of the response, an expected error rate). Related to our concept of a dialogue is also a multimodal system that requires fusion of both modalities (as opposed to us using only a single modality at a time). The analysis of modality choices with increasing load was done in [7]. The authors concluded that with increasing diﬃculty of the tasks users started to prefer more the multimodal interactions over the unimodal.

3

Experiment Design

Our experiment was designed to resemble the motivational example - a simulation of a car and a simulation of a dialogue system on a touchscreen device. 3.1

Hardware and Software Setup

The hardware part of the setup consisted of a PC, 26 LCD display with speakers, a gaming steering wheel with pedals and an Android tablet for the dialogue system (Fig. 1b). On the tablet, there was an oﬄine automatic speech recognition (ASR, based on [9]) system that processed the spoken input on the device itself and an oﬄine text-to-speech (TTS, [11]) system. The tablet presented to the user a graphical user interface (GUI, Fig. 2) for the touch interactions.

Choosing a Dialogue System’s Modality

(a) Software

105

(b) Hardware

Fig. 1. Setup of the experiment.

The PC was running a simulation program called LCTSim1 (Fig. 1a) that had been set up according to the ISO standard (the lane change test). The position of the vehicle on the track as well as steering wheel angle and speed were recorded from the simulator at approximately 200 records per second. The events from a subsystem that handled the secondary task were recorded separately and they were later merged with the simulation’s log. The following types of events were used: a task was displayed, a user’s answered, the answer was correct or incorrect, the task timed out. 3.2

Primary Task

We have chosen a lane change test that conforms to the standard ISO 26022. This test consists of a 3 km three-lane straight road with equally spaced road signs. These road signs appear every 150 meters and indicate to which lane the participant should change. At most 18 lane changes were possible and the subject was expected to ﬁnish the scenario and the primary task before the track’s end. The simulator limited the speed to 60 km/h and the participants were instructed not to slow down. 3.3

Secondary Task

The secondary task was designed to represent an interaction with a dialogue system. It consisted of inputting several pieces of information one at a time using the available modalities. We have prepared following templates for the GUI to test common types of input elements (Fig. 2): a short list that ﬁts on a screen, a long list with a search ﬁeld, a text input, a date input as a spinner, a time input as a spinner, a grid of images and a dialogue window with buttons. Several tasks were created based on these templates. These tasks allowed ﬁlling the information using either this GUI or an ASR and will be listed in Sect. 3.4. For each task, the user would see the objective on the screen as well as hear the same text synthesized using TTS. In order to mimic real-world conditions, 1

Downloadable from https://isotc.iso.org/livelink/livelink?func=ll&objId=11560806.

106

A. Ch´ ylek et al.

the users did not have any microphone nor headphones on them. The ASR used the built-in microphone from the tablet and the TTS used the tablet’s speaker. Another speaker was connected to the PC and the simulation program emulated the sound of an engine. The entirety of the test (instructions, tasks, the TTS and the ASR) was in Czech. 3.4

Scenarios

The experiment was divided into 5 sessions. The participants ﬁrst had to perform a training session. They drove on a track without any secondary tasks to get comfortable with the controls of the vehicle, with the appearance of the signs and with the primary task of changing lanes when instructed by the sign. They were instructed that changing the lanes as quickly and accurately as possible has the highest priority during the rest of the sessions. The subjects could start the next session at their own discretion. The order of the lane changes during the second session was diﬀerent from the previous one. This session was also without any secondary task. This way we could obtain a reference drive (we have recorded the participant’s reactions to the signs without any workload from a secondary task). The rest of the sessions (3 to 5) continued on the same track (with the same order of lane changes) but now with secondary tasks that had following restrictions: During the third session, the participants were forced to use only the GUI to fulﬁl the objective. After the last task was completed the same track was loaded again from the start and a set of tasks for the fourth session started. This time the participant had to complete the tasks using only speech. The ASR had a constrained language model in order to recognize only the options that were presented (e.g. colours for 1st task, numbers for 2nd , etc.). After completing this set of tasks the same track was loaded for the last time. The choice of the modality for the last set of tasks was up to the users. To complete the task they could use the GUI or the ASR without any restrictions. This also meant that when the ASR failed to recognize their commands they could use the GUI to complete the task and vice versa. The participant had 20 s to perform the given task. Otherwise, the next task was shown. If an incorrect input was made the subject was notiﬁed and could try again until the task succeeded or timed out. The tasks were shown always in the same order but with diﬀerent values to be ﬁlled each time during the test (to mitigate habituation). Throughout the paper, we will refer to them using their order of appearance. The tasks’ objectives were as follows: 1. 2. 3. 4. 5.

choose a colour from a grid input a number into a text ﬁeld choose from two buttons input a date using a text ﬁeld choose a picture from thumbnails in a grid

Choosing a Dialogue System’s Modality

6. 7. 8. 9. 10. 11. 12.

107

input a time using a text ﬁeld choose from a short list of items choose from three buttons choose from a long list of items with an active search ﬁeld input a date using system’s date input method (a spinner) choose from a short list of items input a time using system’s native time input method

These tasks were designed not only to test all the basic input types on a smart device but also to test whether the amount of the information that is shown or that needs to be typed has any eﬀect on the results. This is why a text ﬁeld, an image grid, a list of items and buttons are included multiple times. Concretely, the 1st task was designed as an easier image selection version of task 5. The text input in the 2nd task is an easier version of the tasks 4 and 6. The task 3 is simpler than the 8th task. The list of items in the 7th and the 11th task contained fewer items than the task 9. Also, the native date and time input methods (10, 11) were supposed to be easier than typing into a text ﬁeld (4, 6).

Fig. 2. Example of diﬀerent input types used for secondary task.

3.5

Participants

There were 20 participants between 21 and 62 years of age (mean age 32.7, standard deviation of 9.7 years). All participants were native Czech speakers familiar with driving a car and using a touchscreen device.

4

Results

For the purpose of our analysis, we chose as a reference a drive through the track without secondary tasks. It is also possible to create a theoretical “ideal” drive based on the position of the signs and a ﬁxed distance needed for a lane change. The results using these references diﬀered only slightly and after manual

108

A. Ch´ ylek et al.

comparison of the results of several sessions, we have concluded that the ideal reference corresponded with the reality less than the chosen reference drive. In the following paragraphs, we will have to distinguish two types of positions on a track. We deﬁne the position between the lanes as an oﬀset from the centre of the middle lane (shortly “oﬀset”) and position on a track length-wise as a “distance”. Several metrics will be evaluated to measure the impact of the secondary tasks on the performance during the primary task. These metrics can later be used by a dialogue manager to create a strategy based on the expected impact. A mean of diﬀerences between the oﬀset of a reference drive and the drive with a secondary task (referred to only as a “mean diﬀerence”) will be one of the metrics we assess. The duration of the task (until successfully ﬁnished or until it timed out) was chosen as another metric and ﬁnally, the error rate of the answers is the last metric. We have included the overall results regarding mean duration and mean diﬀerence for each modality in Table 1. We can see that if a simple strategy is needed we can leave the choice of the modality to the user, as it oﬀers the best overall performance. But this would require the dialogue that uses this strategy to have similar composition as our scenarios. Because that would not often be the case, we will take a closer look at each individual type of a task in the following sections. Table 1. The overall statistics for each type of scenario. Mean diﬀerence from a reference pass of each participant and mean duration of a scenario (from the start until all the tasks of the scenario have been ﬁnished). Modality Mean diﬀerence [m] Mean duration [s]

4.1

Touch Voice 1.05 132.3

0.76

User’s choice 0.73

127.98 123.3

Comparing Mean Oﬀset Diﬀerences

Although the overall results can be interesting on their own, we wanted to analyse each kind of an input separately. We compared a mean diﬀerence for each given task across all the subjects. These results can be seen in the Table 2. The task numbers refer to the order in which the tasks were shown to the user, as deﬁned in Sect. 3.4. A smaller diﬀerence is better. From these results, we can see that using only the touch for the interaction resulted in a bad performance for tasks 3 to 12 (tasks 5 to 10 are signiﬁcantly the worst with p < 0.05). This metric clearly does not favour using touch, with one exception - the 1st task. On one hand, using touch for the ﬁrst task of selecting a colour was signiﬁcantly better (p = 0.1) than using speech. On the other hand, choosing a more complex image from a grid (task 5) proved to be

Choosing a Dialogue System’s Modality

109

Table 2. Mean oﬀset from the reference drive (in meters) for each task based on modality. A standard deviation is in the brackets, best performing modality is in bold and ∗ marks signiﬁcant diﬀerence from the next best performing modality (p < 0.05). Task

1

2

3

4

5

6

7

8

Voice

0.51

0.92

0.76

0.73

0.98

0.94

0.82

0.61∗ 0.51

9

10

11

12

0.75

0.89

0.71

(0.36) (0.83) (0.68) (0.50) (0.97) (0.95) (0.50) (0.52) (0.34) (0.51) (1.06) (0.63) Touch

0.37

0.82

0.85

0.92

1.30

1.29

1.53

1.27

1.10

1.07

1.11

0.94

(0.30) (0.57) (0.65) (0.45) (1.17) (1.00) (1.15) (1.54) (0.82) (0.90) (1.33) (0.42) User’s choice 0.40

0.71

0.73

0.90

0.77∗ 0.67∗ 0.89

0.78

0.61

0.59

0.95

0.82

(0.25) (0.39) (0.67) (0.65) (0.65) (0.69) (0.99) (0.31) (0.32) (0.29) (0.95) (0.49)

more demanding. For the possible human-machine interaction we could argue that the use case would be more often similar to the more complex ﬁfth task than to the 1st one. From this, we can say that forcing the user to use a touch interface does not look like a viable strategy for any of the input types. Leaving the choice of the modality up to the user proved to be marginally beneﬁcial in 3 tasks and signiﬁcantly better in 2 tasks. It also never was the worst performing setup. The types of input had a common theme - short or simple methods of input. One might think that the user would willingly choose a modality that causes fewer problems during the primary task. But we can argue that some of the users must have chosen a modality that was not optimal - otherwise, the results for the user’s choice of a modality would be similar to one of the forced modalities. This was clearly not the case since the spoken input was marginally better in 4 cases and even signiﬁcantly better than the rest in 1 case (most of these were input types that would require a lot of typing or visual searching). We can conclude that the users can choose a modality that does not always result in the least amount of cognitive load. Comparing performance based on the amount of information presented (as discussed in Sect. 3.4) for the inputs of the same type we can see that presenting fewer information results in better performance. The same goes for typing, as inputs that required more typing increased the mean diﬀerence. 4.2

Comparing Task Duration

We will now focus on another important aspect of an input in the secondary task - its duration. The results for each run are in Table 3 (shorter duration is better). Here we can see an interesting diﬀerence from the previous metric: using the touch interface is signiﬁcantly faster in 5 cases, marginally in 1. These types of input where touch was faster had in common that they did not require many touch events (like typing or tapping a spinner). If the choice of a modality is left up to the user, it is with the exception of task 11 better than the worst performing modality. Using speech is signiﬁcantly faster only in 1 task (ﬁlling a date into a text ﬁeld), marginally in 2 tasks.

110

A. Ch´ ylek et al.

The worse performance can be partly due to the lag of an ASR system that has to process the input and partly because the participants occasionally had to repeat the input several times because of the errors the ASR makes, as we will show later. We can again compare the tasks with elements of the same type that contain less information versus the ones with more information (e.g. the short list in task 7 versus long list in task 9). The tasks with less information are completed faster if using touch. Using speech these diﬀerences are less pronounced. Table 3. Mean duration (in seconds) of each task based on modality. A standard deviation is in the brackets, best performing modality is in bold and ∗ marks signiﬁcant diﬀerence from the next best performing modality (p < 0.05). Task

1

2

3

4

5

6

7 ∗

8

9

10

11

12

Voice

7.2 9.8 8.3 12.2 8.9 10.0 (3.4) (3.3) (2.7) (3.2) (3.0) (2.9)

7.6 8.45 8.6 13.3 7.8 13.0 (0.8) (0.4) (0.8) (7.4) (0.3) (5.4)

Touch

3.0 8.4 4.0∗ 16.0 7.2∗ 13.3 (0.7) (4.1) (1.1) (3.9) (4.8) (8.6)

6.1∗ 4.7∗ 10.6 13.8 5.1∗ 18.5 (1.9) (1.8) (3.6) (7.4) (1.6) (5.9)

User’s choice 3.3 6.4 7.3 11.5 8.3 11.0 (1.1) (1.3) (1.0) (2.0) (1.2) (2.6)

7.2 7.8 9.2 12.7 8.6 14.9 (0.9) (1.0) (2.5) (4.6) (2.8) (6.4)

Table 4. Which modalities did the subject choose and what error rate the modality caused. Task

1

2

3

4

5

Voice input [%]

25

95

70

95

Touch input [%]

75

5

30

0

0

0

5

Input timed out [%] 0 Voice error rate [%]

7

8

9

10

11

12

100 100 65 55

95

85

75

90

0

0

35 45

0

15

20

5

0

0

0

5

0

5

5

0

64.3 13.6 12.5 17.4 4.8 20.0 7.1 21.4 26.9 43.3 28.6 55.0

Touch error rate [%] 16.7 0

4.3

6

0

0

0

0

0

0

0

0

0

0

Comparing Modality Choices and Error Rates

During the last session, the user had a free will at choosing a modality. In this section, we will analyse which modality the subject preferred for which task. The detailed results are in Table 4. We can clearly see that using speech as the input method was preferred in most of the tasks, with the ﬁrst task being the only exception. Interestingly this theoretically very simple task of choosing a colour resulted in the highest error rate in both modalities. In contrast to this, the similar 5th task (choosing an image) had the lowest error rate. Reasons for this phenomenon could not be found. From the data, it is clear that touch input, although less prone to errors,

Choosing a Dialogue System’s Modality

111

is not preferred by the users and they are willing to try and repeat the input several times using speech. This knowledge is important for a dialogue strategy where we expect the recovery from recognition errors to be diﬃcult. Forcing the use of a touch interface instead of speech in these situations will result in lower error rates. Table 5. Error rates when the user was forced to use one of the modalities. Task

1

2

3

4

5

6

7

8

9

10

11

12

Voice error rate [%] 25.93 40.00 20.83 25.00 16.67 9.09 4.76 0 4.76 40.00 4.76 35.48 Touch error rate [%] 0

9.52 4.76 14.29 0

16.67 9.09 0

0

17.39 0

10.53

Voice timed out [%] 0

10

5

10

0

0

0

0

0

5

0

0

Touch timed out [%] 0

5

0

70

0

25

0

0

10

5

0

15

Our last analysed metric was an error rate of the forced modalities. The detailed results are included in Table 5. The results of the voice input were expected to contain errors in the conditions of the test. The 4th task (typing a date) involved a lot of interaction with a virtual keyboard and most of the users were unable to ﬁnish the task in time. From the perspective of a dialogue strategy, this data can provide a valuable insight into an expected error rate of a touch interface. Whenever the user is forced to use a keyboard we should expect increased error rates or longer response times. Choosing from a grid of images or buttons should be preferred.

5

Conclusion

The acquired data and the presented analysis allow us to create a strategy for a dialogue manager that either forces the user to use a certain modality or gives the user a free choice of the modality. Such strategy can be based on several factors that can be used to infer the expected impact on the primary task. For our purposes, this impact was measured as a mean oﬀset from a reference drive without any secondary task, an error rate on the secondary task and as a time needed to accomplish a task. The factors that the manager may take into account are the types of the input (e.g. a choice from a list, a date), the amount of presented data (e.g. choice from two versus twenty images), the requirements on an expected error rate or a limit on the expected duration of the input. The strategy does not have to be based solely on the results of this study. For example, it can be further improved on the ﬂy based on the interaction with the user. If a simple strategy is required, the best overall performance was achieved when the user had a choice of a modality. In the upcoming future, a dialogue manager that uses the data from the experiment as the basis of its strategy will be created and evaluated. This will also allow us to analyse whether the knowledge acquired using driving as a primary task is transferable to other primary tasks (e.g. operating a robotic hand).

112

A. Ch´ ylek et al.

Acknowledgments. This work was supported by the European Regional Development Fund under the project Robotics for Industry 4.0 (reg. no. CZ.02.1.01/0.0/0.0/ 15 003/0000470).

References 1. Benedetto, S., Pedrotti, M., Minin, L., Baccino, T., Re, A., Montanari, R.: Driver workload and eye blink duration. Transp. Res. Part F Traﬃc Psychol. Behav. 14(3), 199–208 (2011). https://doi.org/10.1016/j.trf.2010.12.001 2. Curin, J., et al.: Dictating and editing short texts while driving. In: Proceedings of the 3rd International Conference on Automotive User Interfaces and Interactive Vehicular Applications - AutomotiveUI 2011 p. 13 (2011). http://dl.acm.org/ citation.cfm?doid=2381416.2381418 3. He, J., et al.: Texting while driving: is speech-based text entry less risky than handheld text entry? Accid. Anal. Prev. 72, 287–295 (2014). https://doi.org/10. 1016/j.aap.2014.07.014 4. Road vehicles – Ergonomic aspects of transport information and control systems – Simulated lane change test to assess in-vehicle secondary task demand. Standard, International Organization for Standardization, Geneva, CH, September 2010 5. Kousidis, S., Kennington, C., Baumann, T., Buschmeier, H., Kopp, S., Schlangen, D.: A multimodal in-car dialogue system that tracks the driver’s attention. In: Proceedings of the 16th International Conference on Multimodal Interaction - ICMI 2014, pp. 26–33 (2014). http://dl.acm.org/citation.cfm?doid=2663204.2663244 6. Louveton, N., McCall, R., Koenig, V., Avanesov, T., Engel, T.: Driving while using a smartphone-based mobility application: evaluating the impact of three multi-choice user interfaces on visual-manual distraction. Appl. Ergon. 54, 196– 204 (2016). https://doi.org/10.1016/j.apergo.2015.11.012 7. Oviatt, S., Coulston, R., Lunsford, R.: When do we interact multimodally? Cognitive load and multimodal communication patterns. In: International Conference on Multimodal Interfaces, pp. 129–136 (2004). http://dl.acm.org/citation.cfm? id=1027957 8. Pitts, M.J., Skrypchuk, L., Wellings, T., Attridge, A., Williams, M.A.: Evaluating user response to in-car haptic feedback touchscreens using the lane change test. Adv. Hum. Comput. Interact. 2012 (2012). https://doi.org/10.1155/2012/598739 9. Praˇza ´k, A., Psutka, J.V., Hoidekr, J., Kanis, J., M¨ uller, L., Psutka, J.: Automatic online subtitling of the Czech parliament meetings. In: Sojka, P., Kopeˇcek, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 501–508. Springer, Heidelberg (2006). https://doi.org/10.1007/11846406 63 10. Silvervarg, A., et al.: Perceived usability and cognitive demand of secondary tasks in spoken versus visual-manual automotive interaction. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1171–1175 (2016). https://doi.org/10.21437/Interspeech. 2016-99 11. Tihelka, D., Stanislav, P.: ARTIC for assistive technologies: transformation to resource-limited hardware. In: WCECS 2011, vol. I, pp. 581–584 (2011)

A Free Synthetic Corpus for Speaker Diarization Research Erik Edwards1(B) , Michael Brenndoerfer2 , Amanda Robinson1 , Najmeh Sadoughi1 , Greg P. Finley1 , Maxim Korenevsky1 , Nico Axtmann3 , Mark Miller1 , and David Suendermann-Oeft1 1

2

EMR.AI Inc., San Francisco, CA, USA [email protected] University of California Berkeley, Berkeley, CA, USA 3 DHBW, Karlsruhe, Germany

Abstract. A synthetic corpus of dialogs was constructed from the LibriSpeech corpus, and is made freely available for diarization research. It includes over 90 h of training data, and over 9 h each of development and test data. Both 2-person and 3-person dialogs, with and without overlap, are included. Timing information is provided in several formats, and includes not only speaker segmentations, but also phoneme segmentations. As such, it is a useful starting point for general, particularly early-stage, diarization system development.

Keywords: Speaker diarization Speech activity detection · Open-source corpora

1 1.1

Introduction Background and Motivation

Speaker diarization is the task of segmenting an audio ﬁle with multiple speakers into speaker turns, also known as “speaker indexing” or the “who spoke when” question. This task was ﬁrst considered for air-traﬃc control recordings [13,30,34,38], and has since been applied to a variety of applications [1,2,25], most often to 2-person telephone conversations [8,24,36], broadcast radio and television [12,33], and many-person (e.g. 4–10+) meetings [4,43]. Our own application is doctor-patient dialogs [9], usually consisting of 2 speakers, but occasionally 3 speakers, and only very rarely 4+ speakers. We were not able to identify a suitable training corpus for diarization system development, which is understandable given that medical dialogs contain sensitive personal information. A recently-released diarization challenge set (for the “DIHARD” challenge) included some clinical interviews with doctors and autistic children, but it was required to delete the data following the challenge. Also, the speech of children may not be considered to be a typical case study for general system development. Other data sets are proprietary and seem particular to a given recording channel c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 113–122, 2018. https://doi.org/10.1007/978-3-319-99579-3_13

114

E. Edwards et al.

and/or background noise condition (e.g. air-traﬃc control). These do not seem ideal for our application or for general system development, where one might prefer to obtain clean speech and then corrupt it with background noise suitable to the application [21,46]. We decided therefore to make our own synthetic corpus of dialogs, which we make freely available for general use, particularly for early-stage and general diarization system development. Of course, this is not intended to replace real-world data, and each applied worker must also obtain data from their own domain. The earliest approaches to diarization used a “bottom-up” approach of clustering feature vectors by similarity [13,30]. These are also called “unsupervised” in the sense that they require no labeled training data [34]. Although these approaches have remained heavily used in the literature [2,4], later systems began to introduce “top-down” or “supervised” approaches [12,38,43]. These require a fair amount of labeled training data in addition to test data. In fact, the ﬁrst such top-down study [38] was also the ﬁrst to introduce synthetic dialog data for training purposes. Recent diarization approaches utilize neural networks [18,20,23,41,45], and these can likewise require a large amount of training data. However, the manual segmentation of dialog data is remarkably diﬃcult and time-consuming (as we have attempted ourselves), and therefore prohibitive for most groups undertaking to get started with system development. Moreover, to avoid over-tuning to the test set during system development and architecture search, it is strongly preferable to have separate development and test data sets. A ﬁnal motivation for our synthetic corpus is that we desired to study the issue of “phoneme speciﬁcity” or “phone adaptive training” in speaker diarization [5,7,17,31,35,42,44,47]. This refers to the fact that phoneme acoustic diﬀerences confound the detection of speaker acoustic diﬀerences. That is, for example, the fricatives of two speakers may be more similar than the fricatives and vowels of the same speaker. In order to address this issue, one generally requires a corpus wherein the phone identities and segmentations are available. We introduce such a corpus here, by using methodology from automatic speech recognition (ASR) to obtain forced alignments of phoneme labels. 1.2

Brief Review of Diarization Data Sets

The ﬁrst diarization data studied was air-traﬃc control recordings [13,30,34,38], and an early study of a 5-person meeting quickly followed [43]. The 1997 DARPA Speech Recognition Workshop introduced the ARPA Hub4 task, to transcribe radio and television broadcasts [12,33]. This was the ﬁrst in a series of diarization and related tasks from ARPA (Advanced Research Project Agency) and NIST (National Institute of Standards and Technology), and over 100 publications have been dedicated to the diarization of such broadcasts. We have not been able to locate the past NIST data sets, and recent ones appear to be accessible only with an LDC (Linguistic Data Consortium) account. Also, they can contain music or other background noises, and they do not generally include a large training set or phonemic information. The second major domain of diarization

Diarization Corpus

115

research (also over 100 publications) has been multi-person meetings, particularly following the introduction of widely-used corpora of meeting data, namely the ISL Meeting Corpus [6], the ICSI Meeting corpus [19], the AMI corpus [15], and various meeting data sets from NIST, e.g. [11]. Although these are excellent for their domain of application, they involve many speakers (at least 3 speakers, and 4+ speakers in the great majority) and again a particular audiochannel/background-noise scenario. This may not be suitable for early-stage or general diarization system development, or for research focused on 2–3 speakers. Of these, only the AMI corpus (involving 4+ speakers of British or European English) is freely available with a liberal usage license. The third major domain of diarization research has concerned 2-person telephone conversations, of which the stand-out data set has been the Switchboard corpus [14]. This is by far the closest data set to our intended application, but it also has a few drawbacks: It is only available via LDC account, it is sampled at 8 kHz, it seems particular to the given audio channel, and exact overlapped-speech information may not be obtainable. Therefore, it was deemed that, for general, open-source use, particularly outside of the three major application domains, a free synthetic diarization corpus would be necessary, and likely useful to others as well. We therefore focused on ﬁnding previous synthetic diarization corpora. As mentioned above, the ﬁrst to introduce synthetic dialog data [38] was also the ﬁrst top-down study, where availability of training data becomes critical. Another early top-down study [39] likewise used a simulated dialog corpus, for which they cited a CD-ROM. Neither of these early synthetic corpora are currently available to our knowledge. Almost no mention of synthetic data was made in the years following the 1997 NIST set. We ﬁnd exactly 2 artiﬁcial conversations made from TIMIT data [8,22,40], a small synthetic test set from TIMIT data [10], and one large synthetic set made from TIMIT [26]. The later was only described in a few sentences, but appears quite similar in motivation to ours (e.g., conversations of 2–6 speakers). Unfortunately, none of these TIMIT-based sets are available to our knowledge. A set of synthetic Spanish conversations was found [3], but we do not consider non-English sets here. Therefore, we have developed our own synthetic corpus as a basic starting point for diarization research, derived from the freely available and open-source LibriSpeech corpus [28]. This synthetic diarization corpus is freely available for download at: https://github.com/EMRAI/emrai-synthetic-diarization-corpus.

2

Synthetic Diarization Corpus

The LibriSpeech corpus consists of sections of English audio books recorded at 16 kHz sample rate [28], usually with clear articulation and high-quality audio. It was expected therefore that forced alignment could produce highly accurate (albeit not perfect) phonemic segmentations. The open-source and widely-used Kaldi speech recognition toolkit [29] includes a recipe for ASR training and alignment of the LibriSpeech corpus. The use of this ASR set is also advantageous

116

E. Edwards et al.

because some analyses from the ASR pipeline can be used in diarization. For example, if a universal background model (UBM) or i-vector extractor is trained on the LibriSpeech ASR corpus, it could be used on the synthetic diarization data as well. In brief outline, we have constructed our synthetic corpus as follows (further details will be available from the download page of the corpus). For training data, we use the “train clean 100” subset of the LibriSpeech corpus with 100.6 h of audio. This consists of 585 chapters read by 251 unique speakers (126 male, 125 female), where each chapter has up to 129 utterances. We ranked chapters according to number of utterances, and discarded chapters with fewer than 4 utterances. Alternating chapters in this ranked list were combined into 2-speaker dialogs, with care not to combine the same speaker into a single dialog. The utterances from the 2 speakers were simply alternated until one of the 2 speakers had no further utterances. This resulted in dialogs with 13–259 utterances (median 84). Speakers were combined without respect to gender, resulting in 73 femalefemale, 65 male-male, and 154 female-male dialogs (292 dialogs total). Dialogs ranged in duration from 2.7–49.6 min (median 17.5 min), yielding 98.15 h in the total training corpus. The LibriSpeech “dev clean”, “dev other”, “test clean”, and “test other” sets were likewise prepared for diarization development and test sets (Table 1). Table 1. Synthetic 2-person corpus with no overlap. Dialogs Utts (Turns) Tokens Hours Train

292

28522

989715 98.15

Dev-clean

48

2673

53765

4.98

Dev-other

45

2822

50227

4.69

Test-clean

43

2605

52279

5.07

Test-other

45

2861

51305

4.85

Inspired by published statistics of natural conversations [16,37], a small random gap was inserted between speaker turns, as sampled from a Rayleigh distribution with scale parameter (mode) of 200 ms. The longest random draws (i.e. from the long tail of the Rayleigh distribution) were discarded, given that gaps in natural conversations are bounded to some ﬁnite value. The actually-used samples ranged from 2 to 819 ms with a mean gap of 240 ms. In each original audio ﬁle, the leading/trailing silences were tapered linearly to 0 at start/end, such that no audible transient occurs between speaker turns (i.e. the silent portions transition smoothly into each other). Successive wav ﬁles were linearly added into the dialog waveform, with the appropriate oﬀsets, and checked so that no sample accidentally exceeded a range of ±1. Timing information is provided in 3 formats: (1) the Kaldi .ctm format; (2) the NIST .rttm format [27], as required by the widely-used md-eval-v21.pl script

Diarization Corpus

117

for computing the diarization error rate (DER) [1]; and (3) a simple frame-byframe list of integer labels. In the later, 0 indicates silence, 1 indicates speaker 1, and 2 indicates speaker 2, etc. Integers greater than 10 indicate overlap. In case the direction of overlap is important, these are coded such that “12” means overlap as speaker 1 transitions to speaker 2, and “21” means overlap as speaker 2 transitions to speaker 1. But if the user is only interested in “overlap”, then all integers greater than 10 can be collapsed into one category. For the NIST .rttm format, we provide two versions. In the ﬁrst, only speaker turns are indicated (with labels 1, 2, etc.), and where all within-speaker gaps of less than 200 ms are ignored, i.e. labeled as speech. This appears to be the most widely used threshold currently, whereas a previous standard used a threshold of 300 ms [27]. In the second set of .rttm ﬁles provided, all silences, including gaps less than 200 ms, are explicitly included (with label 0). From these, users could make other thresholds of within-speaker gaps to ignore. The dialog .ctm ﬁles include the timing information for individual phonemes, as obtained by forced alignment (from the tri4b stage of the Kaldi recipe for the LibriSpeech ASR corpus, using the Kaldi “ali2phones” utility [29]). These .ctm ﬁles from the original forced alignments were simply mapped to the new timeline of the dialog. We followed the provided standard recipe for the ASR pipeline, except that we used our own lexicon, for reasons that will be presented in a separate contribution. In brief, we have been studying a syllabic approach to ASR, and have developed a lexicon with syllabic phonology for these purposes. This has resulted in ∼20% relative improvement in WER, and so this was preferred for forced alignments as well. Moreover, we sought to investigate the use of syllabic structure in diarization (see companion paper), which requires syllabic information from the alignments. Our expanded phone set can be mapped back to the usual ARPAbet phones [32], if desired. Since forced alignment does not work for out-of-vocabulary words, we manually added all such words to our lexicon. This is one of the reasons that we use only the 100-h “train clean” subset of the full LibriSpeech training data. A second version of the corpus incorporates speaker overlap. Because some users may want to compare diarization with and without overlap (but otherwise identical), we used the exact same utterances and alignments as above, with only one diﬀerence – in the overlap version we subtract 200 ms from each betweenspeaker interval. This shifts the mode of the ∼Rayleigh distribution to 0 ms, with a range of −198 to 619 ms (mean 40 ms). This is a fairly realistic range of overlap for natural English conversations [16,37], and therefore barely noticeable to the human ear. Note, however, that real-world conversations also include another type of overlap, where one speaker makes a brief utterance or non-speech sound in the middle of the other speaker’s turn (sometimes called “back-channel” speech). We have no statistics for such events, and it is not possible to imitate these easily with just the LibriSpeech data, so no such “back-channel” speech was included in the synthetic corpus (Table 2). Next, a 3-person synthetic dialog corpus was constructed by the same methods as above. However, we do not want all dialogs to have ∼33% representation

118

E. Edwards et al. Table 2. Synthetic 2-person corpus with overlap. Dialogs Utts (Turns) Tokens Hours Train

292

28522

Dev-clean

48

2673

989715 96.58 53765

4.83

Dev-other

45

2822

50227

4.54

Test-clean

43

2605

52279

4.93

Test-other

45

2861

51305

4.69

of each of the 3 speakers. Although we do not know of any published statistics, it is certainly not the case that all real-world 3-person dialogs have equal time allocated to the 3 speakers. Also, the 3 speakers should not alternate in a simple sequence of 1, 2, 3, 1, 2, 3, etc. As a simple ﬁrst solution, the sequence was assigned as follows: the ﬁrst speaker is speaker 1 by deﬁnition, and then each subsequent speaker is chosen randomly from the other 2 speakers, until one speaker runs out of available utterances. In this manner, each dialog ends up with a unique sequence of speaker turns, and unique proportions of representation across the 3 speakers. Single speakers took a range of 17.7–44.4% of the dialog turns (mean 33.3%). This method does, however, lose some utterances in each dialog, so the total hours in the corpus is less than for the 2-speaker corpus (Table 3). Dialogs included between 17 and 366 utterances (median 118), and ranged in duration from 2.8–71.5 min (median 24.4 min). Across all 3-speaker dialogs, 22% were same-gender (m-m-m or f-f-f) and 78% were mixed-gender. Table 3. Synthetic 3-person corpus without overlap. Dialogs Utts (Turns) Tokens Hours Train

195

26694

928346 92.11

Dev-clean

32

2430

48899

4.53

Dev-other

30

2560

45664

4.26

Test-clean

29

2406

47639

4.61

Test-other

30

2684

48025

4.53

The inter-speaker intervals were again chosen randomly according to a Rayleigh distribution with mode of 200 ms (as above), and the actual samples ranged from 1 to 803 ms (mean 242 ms). To create the corresponding 3-person corpus with overlap (Table 4), the identical sequences and values were used, except with 200 ms subtracted from the inter-speaker intervals. This yielded intervals of −199 to 603 ms (mean 42 ms).

Diarization Corpus

119

Table 4. Synthetic 3-person corpus with overlap. Dialogs Utts (Turns) Tokens Hours Train

3

195

26694

Dev-clean

32

2430

928346 90.64 48899

4.40

Dev-other

30

2560

45664

4.12

Test-clean

29

2406

47639

4.47

Test-other

30

2684

48025

4.38

Discussion and Conclusion

A synthetic corpus of dialogs was made from the open-source LibriSpeech corpus and released for download: https://github.com/EMRAI/emrai-synthetic-diarization-corpus. The corpus includes timing information in several formats, and includes phoneme as well as speaker segmentations. Both 2-speaker and 3-speaker corpora, with and without overlap, are provided. In the future, we will likely add a 4-speaker corpus. Note that dialogs with diﬀerent numbers of speakers can be combined by a user to obtain a data set where the number of speakers is not ﬁxed. As a synthetic corpus, there are several deviations from real-world data. First, there is very little background noise (but users could add their own for a better approximation to real conditions [21,46]). Second, conversational statistics were approximately mimicked, but cannot be considered perfectly realistic. Third, we included no intervals of truly multi-speaker speech, i.e., “back-channel” utterances by one speaker that occur fully within the turn of another speaker. Fourth, the LibriSpeech corpus itself consists of high-quality readings of audio books, which has certain advantages (such as high-quality phonetic alignments), but also makes the speech unrealistic to most real-world applications. Fifth, although our corpus is gender-balanced, we include no child or other special categories of speech. Finally, we only include 2-speaker and 3-speaker dialogs (and 4-speaker dialogs will be included in a future release). Thus, we explicitly do NOT suggest that the synthetic corpus replaces the need for real-world data; applied workers must also obtain data for each particular application. Nonetheless, we believe that our general-purpose corpus serves as a useful starting point for diarization research, particularly in the early stages of system development, where a very challenging corpus peculiar to one recording situation is often less desirable. We advise the beginning researcher to attempt ﬁrst the 2-speaker corpus without overlap, and then move on to consider overlap and more speakers, along with real-world data. It is, however, possible that training on this corpus can produce models that generalize to real-world situations (as in our companion paper).

120

E. Edwards et al.

References 1. Anguera Mir´ o, X.: Robust speaker diarization for meetings. Ph.D. thesis, Univ. Polit`ecnica de Catalunya (2006) 2. Anguera Mir´ o, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012) 3. Anguera Mir´ o, X., Hernando Peric´ as, F.: Evolutive speaker segmentation using a repository system. In: Proceedings of ICSLP, pp. 605–608. ISCA (2004) 4. Anguera, X., Wooters, C., Peskin, B., Aguil´ o, M.: Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 402–414. Springer, Heidelberg (2006). https://doi.org/10.1007/11677482 34 5. Bozonnet, S., Vipperla, R., Evans, N.: Phone adaptive training for speaker diarization. In: Proceedings of INTERSPEECH, pp. 494–497. ISCA (2012) 6. Burger, S., MacLaren, V., Yu, H.: The ISL meeting corpus: the impact of meeting type on speech style. In: Proceedings of ICSLP, pp. 301–304. ISCA (2002) 7. Chen, I.F., Cheng, S.S., Wang, H.M.: Phonetic subspace mixture model for speaker diarization. In: Proceedings of INTERSPEECH, pp. 2298–2301. ISCA (2010) 8. Delacourt, P., Kryze, D., Wellekens, C.: Speaker-based segmentation for audio data indexing. In: Proceedings of ESCA Tutorial and Research Workshop, pp. 78–83. ISCA (1999) 9. Finley, G., et al.: An automated medical scribe for documenting clinical encounters. In: Proceedings of NAACL. ACL (2018) 10. Gangadharaiah, R., Narayanaswamy, B.: A novel method for two-speaker segmentation. In: Proceedings of ICSLP, pp. 2337–2340. ISCA (2004) 11. Garofolo, J., Laprun, C., Michel, M., Stanford, V., Tabassi, E.: The NIST meeting room pilot corpus. In: Proceedings of LREC, p. 4. ELRA (2004) 12. Gauvain, J.L., Adda, G., Lamel, L., Adda-Decker, M.: Transcribing broadcast news: the LIMSI Nov96 Hub4 system. In: Proceedings of DARPA Speech Recognition Workshop, pp. 56–63. DARPA (1997) 13. Gish, H., Siu, M.H., Rohlicek, J.: Segregation of speakers for speech recognition and speaker identiﬁcation. In: Proceedings of ICASSP, vol. 2, pp. 873–876. IEEE (1991) 14. Godfrey, J., Holliman, E., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: Proceedings of ICASSP, vol. 1, pp. 517–520. IEEE (1992) 15. Hain, T., et al.: The development of the AMI system for the transcription of speech in meetings. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 344–356. Springer, Heidelberg (2006). https://doi.org/10.1007/11677482 30 16. Heldner, M., Edlund, J.: Pauses, gaps and overlaps in conversations. J. Phon. 38(4), 555–568 (2010) 17. Hsieh, C.H., Wu, C.H., Shen, H.P.: Adaptive decision tree-based phone cluster models for speaker clustering. In: Proceedings of INTERSPEECH, pp. 861–864. ISCA (2008) 18. Ikbal, S., Visweswariah, K.: Learning essential speaker sub-space using heteroassociative neural networks for speaker clustering. In: Proceedings of INTERSPEECH, pp. 28–31. ISCA (2008) 19. Janin, A., et al.: The ICSI meeting corpus. In: Proceedings of ICASSP, vol. 1, pp. 364–367. IEEE (2003)

Diarization Corpus

121

20. Jothilakshmi, S., Ramalingam, V., Palanivel, S.: Speaker diarization using autoassociative neural networks. Eng. Appl. Artif. Intell. 22(4–5), 667–675 (2009) 21. Kim, K., Kim, M.: Robust speaker recognition against background noise in an enhanced multi-condition domain. IEEE Trans. Consum. Electron. 56(3), 1684– 1688 (2010) 22. Liu, C., Yan, Y.: Speaker change detection using minimum message length criterion. In: Proceedings of ICSLP, pp. 514–517. ISCA (2000) 23. Meinedo, H., Neto, J.: A stream-based audio segmentation, classiﬁcation and clustering pre-processing system for broadcast news using ANN models. In: Proceedings of INTERSPEECH, pp. 237–240. ISCA (2005) 24. Metzger, Y.: Blind segmentation of a multi-speaker conversation using two diﬀerent sets of features. In: Proceedings of Odyssey Workshop, pp. 157–162. ISCA (2001) 25. Moattar, M., Homayounpour, M.: A review on speaker diarization systems and approaches. Speech Commun. 54(10), 1065–1103 (2012) 26. Mohammadi, S., Sameti, H., Langarani, M., Tavanaei, A.: KNNDIST: a nonparametric distance measure for speaker segmentation. In: Proceedings of INTERSPEECH, pp. 2282–2285. ISCA (2012) 27. NIST: Spring 2006 (RT-06S) Rich Transcription Meeting Recognition Evaluation plan. Report RT-06S, National Institute of Standards and Technology, Spring 2006 28. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: Proceedings of ICASSP, pp. 5206–5210. IEEE (2015) 29. Povey, D., et al.: The Kaldi speech recognition toolkit. In: Proceedings of Workshop ASRU, Waikoloa Village, HI, p. 4. IEEE (2011) 30. Rohlicek, J., et al.: Gisting conversational speech. In: Proceedings of ICASSP, vol. 2, pp. 113–116. IEEE (1992) 31. Schindler, C., Draxler, C.: Using spectral moments as a speaker speciﬁc feature in nasals and fricatives. In: Proceedings of INTERSPEECH, pp. 2793–2796. ISCA (2013) 32. Shoup, J.: Phonological aspects of speech recognition. In: Lea, W. (ed.) Trends in Speech Recognition, pp. 125–138. Prentice-Hall, Englewood Cliﬀs (1980) 33. Siegler, M., Jain, U., Raj, B., Stern, R.: Automatic segmentation, classiﬁcation and clustering of broadcast news audio. In: Proceedings of DARPA Speech Recognition Workshop, pp. 97–99. DARPA (1997) 34. Siu, M.H., Yu, G., Gish, H.: An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers. In: Proceedings of ICASSP, vol. 2, pp. 189–192. IEEE (1992) 35. Soldi, G., Bozonnet, S., Alegre, F., Beaugeant, C., Evans, N.: Short-duration speaker modelling with phone adaptive training. In: Proceedings of Odyssey Workshop, pp. 208–215. ISCA (2014) 36. S¨ onmez, M., Heck, L., Weintraub, M.: Speaker tracking and detection with multiple speakers. In: Proceedings of EUROSPEECH, pp. 2219–2222. ISCA (1999) 37. Stivers, T., et al.: Universals and cultural variation in turn-taking in conversation. Proc. Natl. Acad. Sci U.S.A. 106(26), 10587–10592 (2009) 38. Sugiyama, M., Murakami, J., Watanabe, H.: Speech segmentation and clustering based on speaker features. In: Proceedings of ICASSP, vol. 2, pp. 395–398. IEEE (1993) 39. Takagi, K., Itahashi, S.: Segmentation of spoken dialogue by interjections, disﬂuent utterances and pauses. In: Proceedings of ICSLP, pp. 697–700. ISCA (1996) 40. Valente, F., Wellekens, C.: Scoring unknown speaker clustering: VB vs. BIC. In: Proceedings of ICSLP, pp. 593–596. ISCA (2004)

122

E. Edwards et al.

41. Vi˜ nals, I., Villalba, J., Ortega, A., Miguel, A., Lleida, E.: Bottleneck based frontend for diarization systems. In: Abad, A., et al. (eds.) IberSPEECH 2016. LNCS (LNAI), vol. 10077, pp. 276–286. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-49169-1 27 42. Wang, G., Wu, X., Zheng, T.: Using phoneme recognition and text-dependent speaker veriﬁcation to improve speaker segmentation for Chinese speech. In: Proceedings of INTERSPEECH, pp. 1457–1460. ISCA (2010) 43. Wilcox, L., Chen, F., Kimber, D., Balasubramanian, V.: Segmentation of speech using speaker identiﬁcation. In: Proceedings of ICASSP, vol. 1, pp. 161–164. IEEE (1994) 44. Yella, S., Motl´ıcek, P., Bourlard, H.: Phoneme background model for information bottleneck based speaker diarization. In: Proceedings of INTERSPEECH, pp. 597– 601. ISCA (2014) 45. Yella, S., Stolcke, A., Slaney, M.: Artiﬁcial neural network features for speaker diarization. In: Proceedings of SLT Workshop, pp. 402–406. IEEE (2014) 46. Zˆ ao, L., Coelho, R.: Colored noise based multicondition training technique for robust speaker identiﬁcation. IEEE Signal Process. Lett. 18(11), 675–678 (2011) 47. Zibert, J., Mihelic, F.: Prosodic and phonetic features for speaker clustering in speaker diarization systems. In: Proceedings of INTERSPEECH, pp. 1033–1036. ISCA (2011)

Speaker Diarization: A Top-Down Approach Using Syllabic Phonology Erik Edwards1(B) , Amanda Robinson1 , Najmeh Sadoughi1 , Greg P. Finley1 , Maxim Korenevsky1 , Michael Brenndoerfer2 , Nico Axtmann3 , Mark Miller1 , and David Suendermann-Oeft1 1

2

EMR.AI Inc., San Francisco, CA, USA [email protected] University of California Berkeley, Berkeley, CA, USA 3 DHBW, Karlsruhe, Germany

Abstract. A top-down approach to speaker diarization is developed using a modiﬁed Baum-Welch algorithm. The HMM states combine phonemes according to structural positions under syllabic phonological theory. By nature of the structural phonology, there are at most 16 states, and the transition matrix is sparse, allowing eﬃcient decoding to structural phones. This addresses the issue of phoneme speciﬁcity in speaker diarization – that speaker similarities/diﬀerences are confounded by phonetic similarities/diﬀerences. We address this here without the expensive use of a complete set of individual phonemes. The voice activity detection (VAD) issue is likewise addressed, giving a new approach to VAD. Keywords: Speaker diarization

1

· Speech activity detection · Syllable

Introduction

When attempting the “who spoke when” question, i.e. speaker diarization, one must use features that distinguish diﬀerent speakers of the dialog. These distinctions are confounded by phonemic diﬀerences, which are ultimately irrelevant to the labeling of speaker turns. This is the opposite of the situation in automatic speech recognition (ASR), where phone identities must be labeled, and speaker diﬀerences ignored. The problem in ASR is that of “speaker adaptation”, whereas the problem in speaker diarization is sometimes referred to as “phoneme speciﬁcity” or “phone adaptive training”. We present here a novel speaker diarization system that addresses the problem of phoneme speciﬁcity, while remaining highly computationally eﬃcient. The earliest approaches to diarization used a “bottom-up” approach of agglomerative clustering of feature vectors of diﬀerent frames [14]. These are also called “unsupervised” in the sense that they require no labeled training data [35]. These approaches have remained heavily used in the literature [2,3]. Later systems began to introduce “top-down” approaches in combination with the bottom-up methods [12,37,40], but these require labeled training data. In c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 123–133, 2018. https://doi.org/10.1007/978-3-319-99579-3_14

124

E. Edwards et al.

fact, the ﬁrst such paper [37] was also the ﬁrst to introduce synthetic dialog data for training purposes. Another early top-down approach [40] was the ﬁrst to use HMM models with Baum-Welch training (although not as here, where we use it at diarization time). We ﬁrst tried the bottom-up approach, where we found the issue of phoneme speciﬁcity to be strongly confounding. That is, for example, two fricatives from diﬀerent speakers can be highly similar in their acoustic features, while a fricative and a vowel from the same speaker can be highly dissimilar. A number of papers have now addressed the problem of phoneme speciﬁcity/adaption in speaker diarization [4,5,18,31,36,39,42,43]. This issue is also well known in the larger literature on speaker recognition and veriﬁcation [8,17]. We therefore abandoned the bottom-up approach in favor of the top-down approach presented here. This required a reliable set of training data, wherein both speaker labels and phone labels are available (since we desire to study phoneme speciﬁcity). Therefore, we also introduced our own synthetic corpus (Sect. 2, and described fully in the companion paper). Our motivating application is the segmentation of doctor-patient dialogs, where the diarization is followed by ASR and information extraction [9]. Therefore, several of our basic decisions were guided by this application. First, the ASR stage requires MFCC features [7], so we attempt speaker diarization with the same MFCC features, but supplemented with a small number of auxiliary features. Second, we focus on the case of 2-speaker dialogs, which covers the great majority of doctor-patient encounters (although our approach is easily generalizable to 3+ speakers). Third, the issue of overlapped speech is less problematic in doctor-patient dialogs, because it is a situation where both members of the dialog have a high motivation to listen and to respect speaker turn taking. Other than yes/no responses, most medically critical information is delivered in longer turns with little or no overlap. Therefore, for our ﬁrst system presented here, the focus is entirely on correct labeling of speaker identity, but not necessarily on reﬁning the exact edges of speaker turns. In our system, each speaker-turn segment is submitted to the ASR stage with some leading/trailing audio anyway, so we have adopted the most typical “collar” used in diarization publications, which is 250 ms. The “collar” is a region around the segment boundaries that is ignored for computing the diarization error rate (DER) [1]. Finally, our system must operate in real-time, so there is a strong focus here on remaining computationally eﬃcient at the time of diarization.

2

Synthetic Diarization Corpus

Doctor-patient dialogs are not freely available for diarization research. Existing data sets for diarization contain many speakers (e.g. meetings with 4 to 10+ speakers); or seem particular to a given situation or audio channel; or have speaker turns labeled, but not phonemic segmentations; or lack a large quantity of training data in addition to test sets; or cannot be obtained freely for general use. Therefore, we have developed a synthetic corpus as a basic starting point for diarization research, utilizing the open-source LibriSpeech corpus [27]. This synthetic corpus (Table 1) is described fully in the companion paper.

Speaker Diarization: A Top-Down Approach Using Syllabic Phonology

125

Table 1. Corpus of synthetic LibriSpeech dialogs. Dialogs Utts (Turns) Tokens Hours Train

3

292

28522

Dev-clean

48

2673

989715 98.15 53765

4.98

Dev-other

45

2822

50227

4.69

Test-clean

43

2605

52279

5.07

Test-other

45

2861

51305

4.85

New Lexicon with Syllabic Phonology

The concept of the syllable has a long tradition in linguistics, dating at least to the ancient Greek συλλαβη and Latin syllaba [16,25,38]. Use of the syllable in ASR dates to one of the earliest systems [26], and has recurred many times since [11,20]. However, syllabic approaches have consistently remained outside of the mainstream of ASR, and have been used only very rarely in speaker recognition [23,24,34]. We know of no syllable-based work in the speaker diarization or VAD literatures. One contributing factor may be the absence of a lexicon from which syllabic segmentations can be obtained directly. There is no simple method for obtaining syllabiﬁcations from ARPAbet-based lexicons [33], such as the widely-used CMUdict [28]. We have therefore developed an English lexicon utilizing syllabic phonology. For present purposes, this essentially means that each phoneme is assigned a structural position (i.e. Aﬃx, Onset, Peak, Coda, Suﬃx), according to the most widely-accepted phonological theory [10,15,19,32]. The immediate practical motivation for introducing syllabic positions into our diarization work is that we would like to address phoneme speciﬁcity without however introducing a full phoneme-based decoding (as in ASR), which would be computationally expensive. On the other hand, there are only a handful of syllabic structural positions (5-15, depending on how many sub-positions are used), and the transition matrix for the structural positions is sparse. Thus, in the above 5-position scheme, Aﬃx can only precede Onset; Onset can only precede Peak; Coda can only follow Peak; and Suﬃx can only follow Coda. An English utterance is a rather predictable succession of structural positions, and a dialog simply allows these to transition between speakers. Since the vowel phones occur exclusively in the Peak position, and since vowel segments are the dominant source of speaker characteristics, the Peak segments can be primarily used to distinguish speakers. This is the original idea and motivation; the resulting system in practice is given next.

4

Diarization Method

Our speaker diarization system proceeds in two general stages: (1) Feature extraction and decorrelation/dimensionality reduction; (2) an expectation

126

E. Edwards et al.

maximization (EM) algorithm to obtain posterior probabilities of HMM states, from which the speech/silence and speaker labels are obtained. All coding was done in C. 4.1

Feature Extraction

Our total system cascades an ASR stage following diarization, so, for eﬃciency, we begin with the ASR acoustic features (40-D MFCCs [7]), supplemented with a small number of auxiliary features. Speciﬁcally, we append the 4-D Kaldi pitch features [13] and the 5-D VAD features of [29]. These are supplemented with Δ features, making a total 98-D feature set. This is reduced by PCA (principal component analysis) to 32-D output, followed by multi-class LDA (linear discriminant analysis) [41]. LDA was trained on labels deﬁned by the 7 syllabic phone categories below, with vowels diﬀerentiated by the 251 unique speakers, giving 258 LDA labels total (1 silence, 6 consonant, and 251 vowel labels). All results presented here use a reduced set of 12-D LDA components. Finally, we change the 12-D LDA features to percentile units, where 128 bins were learned for each LDA feature from the training data. This allows the features to be held as char variables (the smallest data type in C), and used for direct table lookup, leading to greater computational eﬃciency at the time of diarization. Also, since the features are decorrelated by PCA/LDA, this allows the use of a direct (binned) probability representation, whereas GMM probability representations were found to perform worse and take > 2× longer computationally. 4.2

Modified Baum-Welch Algorithm and HMM States

The Baum-Welch algorithm is a method to iteratively estimate the parameters of an HMM model [21]. As such, it is usually applied during training, and the resulting parameters ﬁxed at decoding time. However, here we adapt the BaumWelch algorithm to perform diarization on test data. The training data is only used to initialize the HMM parameters, and then the modiﬁed Baum-Welch algorithm adapts to the audio ﬁle under consideration by EM iterations. The update equations of the Baum-Welch are well-known and not covered here. More importantly, we have arrived at a method of progressive untying of HMM states with successive stages of iterations, such that stage 1 essentially provides a soft VAD output, and the last stage achieves the full diarization. A recorded 2-person dialog consists of an initial segment of silence, alternating utterances of speakers 1 and 2 (with silent gaps within and between), and then a ﬁnal segment of silence. The ﬁrst person to speak is labeled “speaker 1” by deﬁnition, and “silence” includes any irrelevant background noise and often breath sounds. Note that initial silence is special in terms of the HMM A matrix, because the dialog must begin in this state, and this state must transition to speaker 1. However, we found no advantage to keep the ﬁnal silence as a separate state, nor to keep within- vs. between-speaker silences separate. Thus, our HMM model has 4 overall states: (1) Speaker 1; (2) Speaker 2; (3) Initial silence;

Speaker Diarization: A Top-Down Approach Using Syllabic Phonology

127

(4) Other silence. For the B matrix (emission probabilities), all silences remain tied together in one “tie-group”. Next, we split the Speaker 1 and 2 states according to syllabic phonology, in order to address phoneme speciﬁcity (see Introduction). The following split into 7 phoneme categories was found so far to perform best: 1. 2. 3. 4. 5. 6. 7.

Prevocalic stops (B, D, G, K, P, T) Prevocalic fricatives/aﬀricates (CH, DH, F, HH, JH, S, SH, TH, V, Z, ZH) Prevocalic liquids/nasals/semi-vowels (L, N, M, NG, R, W, Y) Vowels (AA, AE, AH, ..., UW) (inclusive of all stress levels) Postvocalic liquids/nasals/semi-vowels (L, N, M, NG, R, W, Y) Postvocalic stops (B, D, G, K, P, T) Postvocalic fricatives/aﬀricates (CH, DH, F, HH, ..., Z, ZH).

This breakdown uses the most important phonemic distinction according to syllabic positions, which is the pre- vs. postvocalic distinction. This refers to consonants which lie before vs. after the vowel within the syllable. This distinction was emphasized already by Saussure (his “explosive” vs. “implosive” consonants) [30], and by the early Haskins studies of speech (their “initial” vs. “ﬁnal” consonants) [6,22]. In terms of syllabic phonology, prevocalic merges Aﬃx and Onset positions, postvocalic merges Coda and Suﬃx positions, and vowel is the same as Peak position. The pre- vs. postvocalic split was found to improve performance already at the VAD stage, whereas fewer distinctions (4 phone categories) and more reﬁned distinctions (up to 15 phone categories) deteriorated performance. Thus, we proceed with the 7 structural-phone categories. These phone categories deﬁne 7 HMM states per speaker, now giving 16 HMM states total (2 silence states + 7 states per speaker). Finally, we use the traditional 3 left-to-right substates per basic state, giving a grand total of N = 48 HMM states. Note that the major purpose of the 3 substates is to provide more realistic durational modeling by the transition matrix (A). For concreteness, we list these HMM states explicitly: – – – – – – – – – – – – –

HMM HMM HMM HMM HMM HMM HMM HMM HMM HMM HMM HMM HMM

States States States States States States States States States States States States States

0-2: 3-5: 6-8: 9-11: 12-14: 15-17: 18-20: 21-23: 24-26: 27-29: 30-32: 33-35: 36-38:

Initial silence Other silence Speaker 1, prevocalic stops Speaker 1, prevocalic fricatives/aﬀricates Speaker 1, prevocalic liquids/nasals/semivowels Speaker 1, vowels Speaker 1, postvocalic liquids/nasals/semivowels Speaker 1, postvocalic stops Speaker 1, postvocalic fricatives/aﬀricates Speaker 2, prevocalic stops Speaker 2, prevocalic fricatives/aﬀricates Speaker 2, prevocalic liquids/nasals/semivowels Speaker 2, vowels

128

E. Edwards et al.

– HMM States 39-41: Speaker 2, postvocalic liquids/nasals/semivowels – HMM States 42-44: Speaker 2, postvocalic stops – HMM States 45-47: Speaker 2, postvocalic fricatives/aﬀricates The HMM A matrix, representing transition probabilities between these states, is learned once from the training data. Importantly, we do not update the A matrix during the modiﬁed Baum-Welch iterations. This is the most timeconsuming update computation, and has negligible consequences for diarization. Moreover, it was found that it was better to sparsify the A matrix by setting direct (0-ms lag) Speaker 1 to 2 transitions to 0. The HMM B matrices, representing emission probabilities for each state, are ﬁrst learned from the training data, and then updated with each iteration of the Baum-Welch during diarization. However, it is common practice to tie HMM states so that their emission probabilities are estimated jointly. This is particularly important if there is too little data. Moreover, most diarization systems begin with a VAD stage (speech vs. silence), before making the more reﬁned distinctions for diarization. An important result of our preliminary investigations was that the B matrices are best updated with strong ties across states initially, and then progressive untying of the states towards the ﬁnal diarization. We arrived at a 3-stage procedure, wherein the ﬁrst stage uses only 7 tie groups, the last stage leaves most states untied, and the middle stage uses an intermediate degree of tying. Speciﬁcally, using the 48 HMM states enumerated above, the following 3-stages of state tie groups was found to work best: STAGE 1 TYING OF B MATRIX: – – – – – – – –

TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP

0 1 2 3 4 5 6 7

== == == == == == == ==

HMM HMM HMM HMM HMM HMM HMM HMM

States States States States States States States States

0-5 (Silence) 6-8, 27-29 (Prevocalic stops) 9-11, 30-32 (Prevocalic fricatives) 12-14, 33-35 (Prevocalic liquids/nasals) 15-17, 36-38 (Vowels) 18-19, 39-41 (Postvocalic liquids/nasals) 20-22, 42-44 (Postvocalic stops) 23-25, 45-47 (Postvocalic fricatives)

It can be seen that no distinction is made in Stage 1 between speakers. This is therefore a speech vs. silence stage, except that speech has been expanded into the 7 structural-phone categories. This is, in fact, a new method of VAD, with soft (posterior probability) outputs. These are then used to initialize Stage 2 of the Baum-Welch iterations, where only the vowels are used to begin the separation of speakers. Thus, TIE-GROUP 4 of Stage 1 is split into 2 tie-groups in Stage 2. STAGE 2 TYING OF B MATRIX: – TIE-GROUP 0 == HMM States 0-5 (Silence) – TIE-GROUP 1 == HMM States 6-8, 27-29 (Prevocalic stops)

Speaker Diarization: A Top-Down Approach Using Syllabic Phonology

– – – – – – –

TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP

2 3 4 5 6 7 8

== == == == == == ==

HMM HMM HMM HMM HMM HMM HMM

States States States States States States States

129

9-11, 30-32 (Prevocalic fricatives) 12-14, 33-35 (Prevocalic liquids/nasals) 15-17 (Speaker 1 Vowels) 36-38 (Speaker 2 Vowels) 18-19, 39-41 (Postvocalic liquids/nasals) 20-22, 42-44 (Postvocalic stops) 23-25, 45-47 (Postvocalic fricatives)

It should be kept in mind that speaker distinctions are most usefully obtained from vowels. A major purpose of the consonant categories is just to separate them out from the vowels, so as not to contaminate the acoustic evidence provided during vowels states. Consonants also provide some degree of power to distinguish speakers, but we leave these states tied across speakers until the ﬁnal iterations, in order not to interfere. Experiments showed that all 3 of these stages (and in this order of course-to-reﬁned) were necessary to achieve the best performance. 8 EM iterations per stage were used for all results here. Following the 24 EM iterations of the 3-stage Baum-Welch algorithm, the posterior probabilities are summed across all Speaker 1 states, all Speaker 2 states, and all Silence states. By this method, it is not important if the algorithm has perfectly separated various consonant categories, because they are all summed together with the vowel states for each Speaker. The ﬁnal diarization label is taken as the maximum of these three probabilities for each time frame.

5

Results and Discussion

We present results for the synthetic LibriSpeech dialog corpus (Sect. 2), and for 2 recordings of doctor-actor dialogs. In the latter, a real doctor interviewed an actor patient (to avoid privacy issues). The doctors were male, and the patients female. Audio was recorded by a cell phone. The 2 dialogs were 6.4 min and 5.7 min in duration, and used for test data only. All training to initialize the HMM A and B matrices was done on the synthetic corpus. For the synthetic LibriSpeech corpus, we obtain the following DERs, using a collar of 250 ms, as assessed with the widely-used md-eval-v21.pl script (from NIST). The same collar and script was used to asses the VAD error rate (VER) (Table 2). Table 2. Results for synthetic LibriSpeech dialogs. Mean DER Max DER Mean VER Max VER Dev-clean 0.66%

2.44%

0.62%

2.38%

Dev-other 0.94%

3.75%

0.90%

3.75%

Test-clean 0.95%

4.45%

0.78%

4.44%

Test-other 1.18%

5.58%

1.12%

5.42%

130

E. Edwards et al.

It can be seen that, using the liberal collar of 250 ms, the algorithm can successfully detect speech (VAD) and then diarize all of the development and test ﬁles. It must be emphasized that this is by no means a guaranteed result, and previous versions of our diarization methods obtained mean DERs closer to 5– 10%, or worse (i.e., early bottom-up method). Also, the present algorithm under diﬀerent settings would often fail on a small subset of ﬁles, e.g. obtain max DERs worse than 20–30%. The inﬂuential settings are: inclusion of VAD and pitch features; number of LDA components; types of phonological distinctions; type of probability model for B matrices (e.g. GMMs performed worse); and, critically, the tying and progressive untying of HMM states during successive stages of the EM iterations. Interestingly, the majority of the observed DER is due to VER (VAD error). Thus, the grand-mean DER was 0.93%, and the grand-mean VER was 0.85%, and it was common (under the liberal collar of 250 ms) to observe ﬁles with the same DER as VER, meaning that the algorithm rarely struggles to separate speaker characteristics, if the stage-1 (soft VAD) outputs are accurate. In fact, some of the VAD errors obtained may be considered spurious, as breath noise is not consistently treated in the forced alignments. The results imply that future improvements should ﬁrst focus on the Stage 1 VAD phase. For the live doctor-actor dialog recordings, we obtain (Table 3): Table 3. Results for recorded doctor-actor dialogs. DER Dialog 1

VER

4.06% 3.26%

Dialog 2 10.00% 9.13% Average

7.20% 6.37%

Thus, a reasonable diarization of the real-world recordings was still obtained, despite the fact that the HMM model was trained only on synthetic data with no overlap. The LibriSpeech corpus is primarily American speech, whereas the doctor-actor dialogs here were British speech; and the recording method (cell phone) was quite diﬀerent than for the training corpus. Also, the real-world dialogs contain many segments of coughing and other non-speech sounds that are not present in the training data, as well as many hesitation sounds (“umm”, “ahh”). Finally, the manual diarization of these dialogs is likely not perfect. Therefore, the average DER of 7.2% is encouraging for the applicability of the general methods reported here, although we will clearly need to obtain matched training data for the methods to fully work.

6

Summary and Conclusion

We have presented our initial speaker diarization system, with the intended application of doctor-patient dialogs. Training on a synthetic corpus, to initialize

Speaker Diarization: A Top-Down Approach Using Syllabic Phonology

131

HMM parameters, allowed successful diarization of recorded doctor-patient dialogs. The HMM parameters are updated in 3 stages of EM iterations, at the time of diarization. Emphasis was on computational eﬃciency, leading to a reduced Baum-Welch algorithm that omits A-matrix updates, and uses discrete (binned) probability distributions. HMM states are based on only 7 structural phones, as motivated by syllabic phonological theory, with sparse transition matrix, allowing an eﬃcient approach to the phoneme speciﬁcity problem. The ﬁrst of the 3 EM stages replaces the usual VAD stage, also improving total eﬃciency.

References 1. Anguera Mir´ o, X.: Robust speaker diarization for meetings. Ph.D. thesis, Univ. Polit`ecnica de Catalunya (2006) 2. Anguera Mir´ o, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012) 3. Anguera, X., Wooters, C., Peskin, B., Aguil´ o, M.: Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 402–414. Springer, Heidelberg (2006). https://doi.org/10.1007/11677482 34 4. Bozonnet, S., Vipperla, R., Evans, N.: Phone adaptive training for speaker diarization. In: Proceedings of INTERSPEECH, pp. 494–497. ISCA (2012) 5. Chen, I.F., Cheng, S.S., Wang, H.M.: Phonetic subspace mixture model for speaker diarization. In: Proceedings of INTERSPEECH, pp. 2298–2301. ISCA (2010) 6. Cooper, F., Delattre, P., Liberman, A., Borst, J., Gerstman, L.: Some experiments on the perception of synthetic speech sounds. J. Acoust. Soc. Am. 24(6), 597–606 (1952) 7. Edwards, E., et al.: Medical speech recognition: reaching parity with humans. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 512–524. Springer, Cham (2017). https://doi.org/10.1007/978-3-31966429-3 51 8. Fakotakis, N., Tsopanoglou, A., Kokkinakis, G.: A text-independent speaker recognition system based on vowel spotting. Speech Commun. 12(1), 57–68 (1993) 9. Finley, G., et al.: An automated medical scribe for documenting clinical encounters. In: Proceedings of NAACL. ACL (2018) 10. Fudge, E.: Branching structure within the syllable. J. Linguist. 23(2), 359–377 (1987) 11. Fujimura, O.: Syllable as a unit of speech recognition. IEEE Trans. Acoust. 23(1), 82–87 (1975) 12. Gauvain, J.L., Adda, G., Lamel, L., Adda-Decker, M.: Transcribing broadcast news: the LIMSI Nov96 Hub4 system. In: Proceedings of DARPA Speech Recognition Workshop, pp. 56–63. DARPA (1997) 13. Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., Khudanpur, S.: A pitch extraction algorithm tuned for automatic speech recognition. In: Proceedings of ICASSP, pp. 2494–2498. IEEE (2014) 14. Gish, H., Siu, M.H., Rohlicek, J.: Segregation of speakers for speech recognition and speaker identiﬁcation. In: Proceedings of ICASSP, vol. 2, pp. 873–876. IEEE (1991)

132

E. Edwards et al.

15. Goldsmith, J.: The syllable. In: Goldsmith, J., Riggle, J., Yu, A. (eds.) The Handbook of Phonological Theory, 2nd edn., pp. 165–196. Wiley, Malden (2011) 16. Guest, E.: A History of English Rhythms. W. Pickering, London (1838) 17. Hansen, E., Slyh, R., Anderson, T.: Speaker recognition using phoneme-speciﬁc GMMs. In: Proceedings of Odyssey Workshop, pp. 179–184. ISCA (2004) 18. Hsieh, C.H., Wu, C.H., Shen, H.P.: Adaptive decision tree-based phone cluster models for speaker clustering. In: Proceedings of INTERSPEECH, pp. 861–864. ISCA (2008) 19. Kessler, B., Treiman, R.: Syllable structure and the distribution of phonemes in English syllables. J. Mem. Lang. 37(3), 295–311 (1997) 20. Kozhevnikov, V., Chistovich, L.: Speech: articulation and perception. Translation JPRS 30543, Joint Public Research Service, U.S. Department of Commerce (1965) 21. Levinson, S., Rabiner, L., Sondhi, M.: An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Syst. Tech. J. 62(4), 1035–1074 (1983) 22. Liberman, A., Ingemann, F., Lisker, L., Delattre, P., Cooper, F.: Minimal rules for synthesizing speech. J. Acoust. Soc. Am. 31(11), 1490–1499 (1959) 23. Martin, T., Wong, E., Baker, B., Mason, M., Sridharan, S.: Pitch and energy trajectory modelling in a syllable length temporal framework for language identiﬁcation. In: Proceedings of Odyssey Workshop, pp. 289–296. ISCA (2004) 24. Mary, L., Yegnanarayana, B.: Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10), 782–796 (2008) 25. Mitford, W.: An Inquiry into the Principles of Harmony in Language, and of the Mechanism of Verse, Modern and Antient, 2nd edn. L. Hansard, London (1804) 26. Olson, H., Belar, H.: Phonetic typewriter. J. Acoust. Soc. Am. 28(6), 1072–1081 (1956) 27. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: Proceedings of ICASSP, pp. 5206–5210. IEEE (2015) 28. Rudnicky, A.: CMUdict 0.7b: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA (2015). https://github.com/Alexir/CMUdict 29. Sadjadi, S., Hansen, J.: Unsupervised speech activity detection using voicing measures and perceptual spectral ﬂux. IEEE Signal Process. Lett. 20(3), 197–200 (2013) 30. Saussure, F.: Cours de linguistique g´en´erale. Payot, Lausanne, Paris (1916) 31. Schindler, C., Draxler, C.: Using spectral moments as a speaker speciﬁc feature in nasals and fricatives. In: Proceedings of INTERSPEECH, pp. 2793–2796. ISCA (2013) 32. Selkirk, E.: The syllable. In: van der Hulst, H., Smith, N. (eds.) The Structure of Phonological Representations, vol. 2, pp. 337–384. Foris, Dordrecht (1982) 33. Shoup, J.: Phonological aspects of speech recognition. In: Lea, W. (ed.) Trends in Speech Recognition, pp. 125–138. Prentice-Hall, Englewood Cliﬀs (1980) 34. Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., Stolcke, A.: Modeling prosodic feature sequences for speaker recognition. Speech Commun. 46(3–4), 455– 472 (2005) 35. Siu, M.H., Yu, G., Gish, H.: An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers. In: Proceedings of ICASSP, vol. 2, pp. 189–192. IEEE (1992) 36. Soldi, G., Bozonnet, S., Alegre, F., Beaugeant, C., Evans, N.: Short-duration speaker modelling with phone adaptive training. In: Proceedings of Odyssey Workshop, pp. 208–215. ISCA (2014)

Speaker Diarization: A Top-Down Approach Using Syllabic Phonology

133

37. Sugiyama, M., Murakami, J., Watanabe, H.: Speech segmentation and clustering based on speaker features. In: Proceedings of ICASSP, vol. 2, pp. 395–398. IEEE (1993) 38. Wallis, J.: Grammatica linguae Anglicanae. L. Lichﬁeld, Oxford (1674) 39. Wang, G., Wu, X., Zheng, T.: Using phoneme recognition and text-dependent speaker veriﬁcation to improve speaker segmentation for Chinese speech. In: Proceedings of INTERSPEECH, pp. 1457–1460. ISCA (2010) 40. Wilcox, L., Chen, F., Kimber, D., Balasubramanian, V.: Segmentation of speech using speaker identiﬁcation. In: Proceedings of ICASSP, vol. 1, pp. 161–164. IEEE (1994) 41. Yamada, M., Pezeshki, A., Azimi-Sadjadi, M.: Relation between kernel CCA and kernel FDA. In: Proceedings of IJCNN, pp. 226–231. IEEE (2005) 42. Yella, S., Motl´ıcek, P., Bourlard, H.: Phoneme background model for information bottleneck based speaker diarization. In: Proceedings of INTERSPEECH, pp. 597– 601. ISCA (2014) 43. Zibert, J., Mihelic, F.: Prosodic and phonetic features for speaker clustering in speaker diarization systems. In: Proceedings of INTERSPEECH, pp. 1033–1036. ISCA (2011)

Improving Emotion Recognition Performance by Random-Forest-Based Feature Selection Olga Egorow(B) , Ingo Siegert, and Andreas Wendemuth Cognitive Systems Group, Otto von Guericke University, 39016 Magdeburg, Germany [email protected]

Abstract. As technical systems around us aim at a more natural interaction, the task of automatic emotion recognition from speech receives an ever growing attention. One important question still remains unresolved: The deﬁnition of the most suitable features across diﬀerent data types. In the present paper, we employed a random-forest based feature selection known from other research ﬁelds in order to select the most important features for three benchmark datasets. Investigating feature selection on the same corpus as well as across corpora, we achieved an increase in performance using only 40 to 60% of the features of the wellknown emobase feature set.

Keywords: Speech emotion recognition Random forest

1

· Feature selection

Introduction

Speech is a carrier of diﬀerent kinds of information – besides the pure semantic content of an utterance, there are several layers underneath [14]. In humanhuman interaction (HHI), the interlocutors try to extract this additional information, often using multiple channels – simply speaking, by listening not only to what is said but also how it is said. One such layer of information is the emotional layer – the same sentence can have diﬀerent meanings depending on its emotional toning. This can be transferred to the domain of human-computer interaction (HCI) to enable computer systems to understand the emotional level in order to make HCI more natural and pleasant for the user. Unfortunately, the recent performance boost in speech recognition provided by deep learning did not improve the performance of emotion recognition alike: Although there are ﬁrst attempts to implement end-to-end approaches [24], they are still in their infancy and rely on multimodal data. As long as the required massive data amounts are not yet available for audio-based emotion recognition, it is necessary to explore the existing possibilities and to look for other ways to improve the performance of current systems. One such way is the extraction and selection of the most suitable features. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 134–144, 2018. https://doi.org/10.1007/978-3-319-99579-3_15

Improving Emotion Recognition Performance

135

Since the Interspeech 2009 Emotion challenge [21], the emobase feature set (as described in detail in [10]) is often used as a go-to feature set for various acoustic recognition systems: e.g. dialogue performance [19], user state detection [8], physical pain detection [17], etc. It contains 988 features based on 19 functionals of 26 Low-Level-Descriptor (LLDs) and their deltas: Mel-Frequency Cepstral Coeﬃcient (MFCC), Line Spectral Pairs (LSPs), intensity, fundamental frequency, and other – there are also larger versions of this set such as the 2010 emobase version and the emo large version containing 1582 and 6552 features, respectively. Besides these large feature sets, there are also relatively small ones, such as the GeMaps set [9], containing 18 LLDs (based on frequency and spectrum) and their derivatives, resulting in a total of only 62 features for the minimalistic and 88 features for the extended set. Although widely used, these sets are not perfect. So, the 988 features of emobase are often used to classify relatively small amounts of samples. The GeMaps set on the other hand, while having not as many features, does not achieve the same performance as emobase [9]. In the present study, we want to examine two questions. Our ﬁrst research question is whether the emotion recognition performance achieved using the emobase feature set is the best possible, or whether the same or even better performance can be achieved with less features using a data-driven feature selection process. Our second question is whether the same features are important for diﬀerent data types. To investigate these questions, we employ a Random Forest (RF)-based feature ranking procedure on three diﬀerent corpora and conduct classiﬁcation experiments using same-corpus as well as cross-corpus features. 1.1

Literature Review

As early as 2003, Kwon et al. have deducted that the extraction of good features is more important to the emotion recognition task than the choice of the optimal classiﬁer [13]. The most frequently used features comprise prosodic and spectral information. One problem concerning such features is that their values depend on the individual speaker’s voice characteristics. Possible solutions are the calculation of speaker-independent features, such as the changes instead of the absolute values [15], or diﬀerent normalisation methods [3]. Some research questions have already been answered: For example, it was shown that suprasegmental features perform better than segmental ones [22] or that features are not language-independent [26]. The choice of the best suitable features was also addressed in diﬀerent investigations. So, Bitouk et al. used spectral features to classify emotions on two corpora and investigate the inﬂuence of diﬀerent feature selection techniques, but none of the employed methods lead to clear gains [2]. Gharavian et al. presented a sophisticated feature selection approach based on fast correlation-based ﬁlters and genetic-algorithm-based optimisation to achieve 5% absolute improvement in terms of accuracy [11]. Unfortunately, the authors opted for a training and test set evaluation procedure instead of a true LeaveOne-Speaker-Out (LOSO) setting and therefore did not report on diﬀerences

136

O. Egorow et al.

between the speakers. Besides the usually employed prosodic and spectral features, there are also approaches investigating novel feature sets – for instance based on the Fourier parameters [25] and wavelets [18]. In the present study, we investigate the performance of RF-based feature selection on three benchmark emotional datasets in a LOSO setting and compare the features selected for diﬀerent data types.

2

Datasets

In order to be able to answer our research questions in the most possible generalised way, we employed three famous benchmark corpora with diﬀerent languages, emotion types and recording conditions. The Audiovisual Interest Corpus (AVIC) [20] is a dataset built around a product presenter in an English commercial presentation. The recordings were made in an oﬃce environment and contain three levels of interest (loi1 - loi3) as classes. The Berlin Emotional Speech Database (emoDB) [5] is a studio-recorded German dataset containing recordings of ten emotionally neutral sentences with seven emotions: anger, boredom, disgust, fear, joy, neutral, and sadness. The Speech Under Simulated and Actual Stress (SUSAS) dataset [12] contains acted and spontaneous emotional utterances of English speakers in four diﬀerent conditions: neutral, medium stress, high stress and screaming. Some of the utterances also contain ﬁeld noise. An overview over the details of the corpora is given in Table 1. Table 1. Characteristics of the selected corpora. Property

AVIC

emoDB SUSAS

Quality

Oﬃce

Studio

Language

English German English

Emotion type Spont

3

Acted

Noisy Mixed

# Speakers

21 (10f) 10 (5f)

7 (3f)

# Emotions

3

7

4

# Samples

3002

535

3593

Feature Selection with Random Forests

In order to ﬁnd the optimal amount of features, we ﬁrst ranked the features according to their importance for the classiﬁcation task using RF. We then analysed the obtained feature rankings and compared them for diﬀerent speakers of the same corpus as well as between the diﬀerent corpora. In the last step, we compared the classiﬁcation performance using an increasing number of features to ﬁnd an optimum.

Improving Emotion Recognition Performance

3.1

137

Feature Extraction

For feature extraction, we used the emobase feature set of the openSMILE toolkit mentioned above, providing 988 spectral and prosodic features extracted on utterance level (cf. [10] for details). In order to establish comparability of the features among diﬀerent speakers, we standardised the data to zero mean and unit variance. 3.2

Feature Ranking

In order to select the most important features, it is necessary to rank the features according to their importance. One possibility for this is a feature ranking routine based on RF – an ensemble learning method combining a typically high number of binary decision trees [4]. In each decision tree, each node samples a random subset of features and chooses the feature that is suited best to split the data into classes based on the impurity measure (e.g. the Gini index or information gain). By iterating this process, the features can be ranked according to their ability to decrease the impurity. A detailed explanation can be found in [7, 23]. The method was tested for several applications, for example in the ﬁeld of spectroscopy analysis [16]. To realise this feature ranking procedure, we used the random forest implementation provided by KNIME [1]. The procedure consists of three steps as illustrated in Fig. 1. In the ﬁrst step, a random forest containing a high number of trees with k levels each (k can be a low number since the most relevant features are close to the root) is built on the training portion of the data in order to obtain two statistical values for each feature f : the number of models Mi which use f as split on a tree level i, and the number of times Ti f was in the feature sample for the level i. Their quotient summed up over all levels is the score Sf for each f : k Mi Sf = Ti i=0 In a second step, a random score Srandf is generated by calculating the score in the same way, but now with randomly shuﬄed labels – this is done in order to eliminate a bias that might be contained in the data. In order to balance the inﬂuence of randomness, both Sf and Srandf are calculated ten times and then averaged. The new score Snewf is then obtained in a ﬁnal third step by subtracting Srandf from Sf : Snewf = Sf − Srandf . The features are then sorted according to their ﬁnal scores, the ranking indicating their importance. In order to avoid overﬁtting to the data, this procedure is executed in a LOSO manner: For each speaker, the feature ranking is performed only on the data of all the other speakers, excluding the data of the current speaker, which is reserved for later testing.

138

O. Egorow et al. Data

Training 1 - (n-1)

Test n

Feature Ranking Feature Extraction

repeat 10-times RF Training

True score

avg. true score

Final score

RF Training

Random score

avg. random score

Ranked features

repeat 10-times Target Shuffling

Fig. 1. An overview over the RF-based feature ranking procedure.

3.3

Comparison of Feature Rankings

One of our research questions was to investigate whether there are generally important features carrying emotional information or whether the most important features diﬀer depending on the data. In order to answer this question we compared the feature rankings obtained on the three employed datasets and conduct several Pearson’s correlation tests – between the feature selection rankings of diﬀerent speakers of the same corpus for intra-corpus comparison as well as between the feature selection rankings of diﬀerent corpora for inter-corpus comparison. Intra-corpus Comparison. In order to test whether the feature rankings are consistent for all speakers within a corpus, we compared the LOSO rankings by conducting Pearson’s correlation tests. For AVIC, the Pearson’s correlation coeﬃcient r between the feature rankings of the individual speakers lies between 0.95 and 0.98 (r = 0.97 ± 0.008), leading to the conclusion that the feature rankings of the speakers are very similar. Our idea was now to construct an average feature ranking for the whole corpus by averaging the feature rankings over all speakers, FAV IC . Naturally, the Pearson’s correlation between FAV IC and the feature rankings of the individual speakers is just as high as between the speakers, with values between 0.96 and 0.99 (r = 0.98 ± 0.008). The LLDs occurring most frequently in the top 100 are illustrated in Fig. 2a. For EmoDB, we implemented the same procedure. Here the correlations between the speakers are about as high as for AVIC, with r values between 0.95 and 0.98 (r = 0.98 ± 0.01) indicating that the feature rankings are consistent. Also, in the same way as for AVIC, we constructed a new average feature ranking FEmoDB . Again, r between FEmoDB and the feature rankings of the individual speakers is between 0.97 and 0.99 (r = 0.99 ± 0.006). The LLDs occurring most frequently in the top 100 are illustrated in Fig. 2b.

Improving Emotion Recognition Performance

(a) AVIC

(b) EmoDB

139

(c) SUSAS

Fig. 2. Word clouds of the LLDs most frequently occurring in the top 100 of the feature rankings for AVIC, EmoDB and SUSAS. The LLDs occurring for all three corpora are written in red. (Color ﬁgure online)

Finally, we repeated this procedure for SUSAS. The correlations between the feature rankings of the individual speakers are slightly lower than for EmoDB, with r values between 0.87 and 0.96 (r = 0.92 ± 0.03) but still suﬃciently high to conclude that the feature rankings are consistent. The correlations between the average feature ranking FSU SAS and the individual rankings are between 0.92 and 0.98 (r = 0.96 ± 0.02). The LLDs occurring most frequently in the top 100 are illustrated in Fig. 2c. Inter-corpus Comparison. In the second step of our analysis, we compared the inter-corpus results in order to ﬁnd whether the feature rankings are similar between the diﬀerent types of data used. For this, we calculated the Pearson’s correlation coeﬃcients between the previously constructed average feature rankings FemoDB , FSU SAS and FAV IC . In contrast to the intra-corpus comparison presented above, the results lead to the conclusion that there are no correlations between the feature rankings of the diﬀerent corpora. For the correlation between FEmoDB and FAV IC , the r value is 0.18. For the correlation between FEmoDB and FSU SAS , r is even lower, 0.14. For FSU SAS and FAV IC , r is negative, −0.07. These results are shown in Fig. 2: There are only two LLDs shared by all three datasets (MFCC[5]and its derivative as well as the derivative of MFCC[10]). This means that, unfortunately, the feature rankings are not universally transferable for diﬀerent types of data. However, there are similarities – diﬀerent MFCCs seem to be the most important features, since they occur relatively often in the top 100 features for all three datasets. 3.4

Selecting the Optimal Number of Features

In the next part, we searched for an optimal number of features for each of the corpora. For this, we classiﬁed the data using an increasing number of features, starting with 50 features with the highest RF-scores and then consecutively adding 50 more features with decreasing scores in each step, until we reached the full 988 emobase feature set. In order to avoid overﬁtting, we again used a LOSO validation setting. For each feature subset, we calculated the Unweighted Average Recall (UAR) over all classes and speakers. The UARs achieved during this optimisation procedure are shown in Fig. 3. Here, AVIC and EmoDB show

140

O. Egorow et al. 100 AVIC

EmoDB

SUSAS

UAR %

82.67% 80 57.08% 60

52.01%

40 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 Number of used features

Fig. 3. The UAR of the classiﬁcation performance on the three datasets depending on the number of selected features. The results achieved using the full number of features are indicated by the dashed line

similar results: after starting with a rather low UAR value for low numbers of features, the UAR rises rapidly and stays at a stable value. However, for SUSAS the number of features seems to have less inﬂuence, since the UAR does not change as much as for the other two corpora.

4

Classification Using Previously Selected Features

After selecting the optimal number of features, we conducted classiﬁcation experiments in order to evaluate and compare the performance of the selected features to the full emobase feature set. 4.1

Classification Setup

For the classiﬁcation, we again implemented the LOSO procedure as described above. Since we obtained between 7 and 21 models for each corpus, we decided against parameter ﬁne-tuning and employed default employed default Support Vector Machine (SVM) parameters as provided by the LibSVM library [6]. For evaluation, we computed the unweighted average f-measure (UAF) as the harmonic mean of the unweighted average recall and precision over all classes of one speaker, and then the unweighted average over all speakers. In order to include variations over speakers, we report the average values as well as the standard deviation as performance measures. 4.2

Classification Performance

The classiﬁcation results are shown in Fig. 4 – we report the classiﬁcation performance for each dataset, the baseline performance using all 988 emobase features and the performance using the previously selected features. Furthermore, we also report the results using cross-corpus feature selection. For this, we performed the

Improving Emotion Recognition Performance

FE

UAF %

80

FS

FCC1

141

FCC2

60

40 AVIC

emoDB

SUSAS

Fig. 4. The UAF of the classiﬁcation performance for the emobase feature set FE , the best feature selection set FS , the cross-corpus feature set with the lower correlation FCC1 and with the higher correlation FCC2

classiﬁcation on one dataset using the feature set obtained on another one. Since we used three corpora, this procedure results in two additional values per corpus: FCC1 denotes the results using the feature set with the lower correlation coeﬃcient (as obtained in Sect. 3), FCC2 the results with the higher correlation coeﬃcient. The classiﬁcation with feature selection outperforms the classiﬁcation using the full emobase feature set for all three corpora by several percent absolute – but the improvements lie within the standard deviation of the average values of the speakers. However, the results show that for all three corpora, a performance improvement can be achieved using between 40 and 60% less features than the original feature set. This is an interesting ﬁnding since feature extraction as well as classiﬁcation are resource-intensive tasks, where a reduction of the processing overhead can be a real beneﬁt – for example in the domain of mobile applications. Regarding the performance of the diﬀerent feature sets across corpora, we can observe that the results are almost as expected: except for SUSAS, the “alien” feature sets obtained by feature selection on another corpus do not perform as good as the one obtained on the same corpus. Furthermore, FCC2 outperforms FCC1 in all cases (albeit marginally as for emoDB), which corresponds to the higher correlation between FCC2 and FS compared to FCC1 and FS . The only exception is SUSAS, where the FCC2 works about 0.7% better than FS . Based on these results, we can conclude that RF-based feature selection is a viable method to improve emotion recognition performance for diﬀerent types of data.

5

Conclusion

The ﬁrst question we aimed to investigate in this study was whether the number of features used for emotion recognition can be reduced achieving the same or even better performance. We have shown that by applying RF-based feature selection, we can reduce the number of features roughly by half and obtain an even better performance than using the full emobase set – furthermore, by using

142

O. Egorow et al.

three diﬀerent corpora we have shown that this result is independent of the type of emotions, language and recording conditions. The second research question was whether there are inter-corpus similarities in the selected features. Here our ﬁnding is that the most important features are not consistent over diﬀerent corpora, and therefore the feature selection needs to be done for each emotion recognition task separately. However, diﬀerent MFCCs are among the most important features of all three corpora indicating that there is a common ground of acoustic information. There are two main directions for further research. The ﬁrst interesting question here is to investigate further feature sets – besides larger versions of the emobase feature set including up to 6552 features also novel and less frequently used features such as the Fourier parameters and wavelet-based features are of interest. The second open question is to consolidate feature classes according to the type of material used – in this investigation, we have seen that features important for EmoDB diﬀer from those for AVIC. The question is whether these diﬀerences are based on the type of emotions, on the emotional classes, on the recording conditions, or on some still unknown factors. This needs to be further investigated in order to understand the relations between the features and the information on the emotional status of the speaker contained in them. Acknowledgements. This work has been sponsored by the German Federal Ministry of Education and Research in the program Zwanzig20 – Partnership for Innovation as part of the research alliance 3Dsensation (grant number 03ZZ0414). It was also supported by the project Intention-based Anticipatory Interactive Systems (IAIS) funded by the European Funds for Regional Development (EFRE) and by the Federal State of Sachsen-Anhalt, Germany (grant number ZS/2017/10/88785).

References 1. Berthold, M.R., et al.: KNIME: The konstanz information miner. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds.) Data Analysis, Machine Learning and Applications. Studies in Classiﬁcation, Data Analysis, and Knowledge Organization. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3540-78246-9 38 2. Bitouk, D., Verma, R., Nenkova, A.: Class-level spectral features for emotion recognition. Speech Commun. 52(7–8), 613–625 (2010) 3. B¨ ock, R., Egorow, O., Siegert, I., Wendemuth, A.: Comparative study on normalisation in emotion recognition from speech. In: Horain, P., Achard, C., Mallem, M. (eds.) IHCI 2017. LNCS, vol. 10688, pp. 189–201. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-72038-8 15 4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 5. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH-2005, pp. 1517–1520 (2005) 6. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. Trans. Intell. Syst. Technol. 2, 1–27 (2011)

Improving Emotion Recognition Performance

143

7. Chen, Y.W., Lin, C.J.: Combining SVMs with various feature selection strategies. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A. (eds.) Feature Extraction: Foundations and Applications, pp. 315–324. Springer, Berlin Heidelberg (2006). https://doi.org/10.1007/978-3-540-35488-8 13 8. Egorow, O., Wendemuth, A.: Detection of challenging dialogue stages using acoustic signals and biosignals. In: Proceedings of the 24th International Conference on Computer Graphics, Visualization and Computer Vision, pp. 137–143 (2016) 9. Eyben, F., et al.: The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and aﬀective computing. Trans. Aﬀect. Comput. 7(2), 190–202 (2016) 10. Eyben, F., W¨ ollmer, M., Schuller, B.: OpenEAR - introducing the Munich opensource emotion and aﬀect recognition toolkit. In: Proceedings of the 3rd International Conference on Aﬀective Computing and Intelligent Interaction (ACII), pp. 1–6. IEEE (2009) 11. Gharavian, D., Sheikhan, M., Nazerieh, A., Garoucy, S.: Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput. Appl. 21(8), 2115–2126 (2012) 12. Hansen, J., Bou-Ghazale, S.: Getting started with SUSAS: A speech under simulated and actual stress database. In: Proceedings of the EUROSPEECH-1997, pp. 1743–1746 (1997) 13. Kwon, O.W., Chan, K., Hao, J., Lee, T.W.: Emotion recognition by speech signals. In: Proceedings of the 8th European Conference on Speech Communication and Technology (2003) 14. Levinson, S.C., Holler, J.: The origin of human multi-modal communication. Phil. Trans. R. Soc. B 369(1651), 20130302 (2014) 15. Mao, Q., Zhao, X., Zhan, Y.: Extraction and analysis for non-personalized emotion features of speech. Adv. Inf. Sci. Serv. Sci. 3(10), 255–263 (2011) 16. Menze, B.H., et al.: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classiﬁcation of spectral data. BMC Bioinformatics 10(1), 213 (2009) 17. Oshrat, Y., Bloch, A., Lerner, A., Cohen, A., Avigal, M., Zeilig, G.: Speech prosody as a biosignal for physical pain detection. In: Proceedings of Speech Prosody, pp. 420–424 (2016) 18. Palo, H.K., Mohanty, M.N.: Wavelet based feature combination for recognition of emotions. Ain Shams Eng. J. (2017, in Press) 19. Ramanarayanan, V., et al.: Using vision and speech features for automated prediction of performance metrics in multimodal dialogs. ETS Research Report Series 1 (2017) 20. Schuller, B., M¨ uller, R., H¨ ornler, B., H¨ othker, A., Konosu, H., Rigoll, G.: Audiovisual recognition of spontaneous interest within conversations. In: Proceedings of the 9th International Conference on Multimodal interfaces, pp. 30–37. ACM (2007) 21. Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and aﬀect in speech: state of the art and lessons learnt from the ﬁrst challenge. Speech Commun. 53(9–10), 1062–1087 (2011) 22. Schuller, B., W¨ ollmer, M., Eyben, F., Rigoll, G.: The role of prosody in aﬀective speech, linguistic insights, studies in language and communication. Lang. Commun. 97, 285–307 (2009) 23. Silipo, R., Adae, I., Hart, A., Berthold, M.: Seven techniques for dimensionality reduction. Technical report, KNIME (2014)

144

O. Egorow et al.

24. Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: Endto-end multimodal emotion recognition using deep neural networks. J. Sel. Top. Signal Process. 11(8), 1301–1309 (2017) 25. Wang, K., An, N., Li, B.N., Zhang, Y., Li, L.: Speech emotion recognition using fourier parameters. Trans. Aﬀect. Comput. 6(1), 69–75 (2015) 26. Yang, C., Ji, L., Liu, G.: Study to speech emotion recognition based on TWINsSVM. In: Proceedings of the 5th International Conference on Natural Computation, vol. 2, pp. 312–316. IEEE (2009)

Coherence Understanding Through Cohesion Markers: The Case of Child Spoken Language Polina Eismont1(&) , Vladislav Metelyagin2 and Elena Riekhakaynen2

,

1

Saint Petersburg State University of Aerospace Instrumentation, Bolshaya Morskaya Street 67, 190000 St. Petersburg, Russia [email protected] 2 Saint-Petersburg State University, Universitetskaya Emb. 7/9, 199034 St. Petersburg, Russia [email protected], [email protected]

Abstract. Coherence and cohesion are crucial for organizing text semantics and syntax. They both may be described in terms of topic-focus structure, but the surface syntactic topic-focus structure does not coincide with that of deep semantics, and the automatic analysis of coherence which refers to the meaning of the whole text is complicated. The paper presents a Topic-Focus Annotating Parser (TFAP) that was trained on the corpus of Russian unprepared child oral narratives (213 narratives elicited by native Russian children aged from two years seven months to seven years six months). According to the results, children develop their narrative skills both in coherence and cohesion, but at the earlier stages of language acquisition, parsing errors reflect the speaker’s low level of narrative skills, while at the later stages (from ﬁve years seven months to seven years six months), when the basic rules of narrative organization are already acquired, parsing errors may be caused by the deﬁciencies of the parser. The topic-focus schemes we obtained support Leonid Sakharny’s theoretical approach to cognitive representation of coherence. Keywords: Child language Topic-Focus Annotating Parser Coherence Spoken narrative

Cohesion

1 Introduction The study of spoken language processing has always been a challenge for linguists primarily due to the methodological difﬁculties with obtaining the data (see, for example, [1] for an overview). Researchers have been describing and discussing phonetic, grammatical, lexical, and pragmatic aspects of spoken Russian since the end of the 1960s. However, we still do not have an accurate aggregate picture of how a speaker and a listener process spoken Russian. One of the ways to accumulate different aspects of spoken discourse is to study the multi-layered structure of oral narratives. The semantic meaning of a narrative (coherence) may be understood only through its surface syntactic structures (cohesion). Cohesion and coherence have been described as the basic principles of textuality [2, 3]. We will discuss current theoretical approaches to © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 145–154, 2018. https://doi.org/10.1007/978-3-319-99579-3_16

146

P. Eismont et al.

these two sides of text organization and some background studies of coherence and cohesion resolution in Natural Language Processing in Sect. 2 of the paper. Children acquire narrative skills gradually, and both coherence and cohesion are difﬁcult for them at the early stages of ﬁrst language acquisition [4, 5]. The development of coherence and cohesion starts at the age of 4–5 years and continues up to secondary school, but narratives become coherent by the age of 7–8 years. In our study, we will analyse oral narratives elicited by native Russian children aged from four years seven months to seven years six months (for space considerations, we will hereinafter indicate the age of the participants as follows: Years;Months: e.g., 4;7–7;6); the data will be described in Sect. 3. Computer modelling is one of the methods to verify a theoretical approach [6]. In Sect. 4 of the paper, we will provide the results of an automatic analysis of children’s narratives that corresponds to the wide and narrow subject complexes proposed by Sakharny. Section 5 is for conclusions and prospects of the study.

2 Coherence and Cohesion Coherence and cohesion represent the two sides of a text as a linguistic sign. Coherence is deﬁned as: “the way in which the content of connected speech or text hangs together, or is interpreted as hanging together, as distinct from that of random assemblages of sentences” [7]. It reflects the subject of a story, the logical structure of a narration and organizes the links between characters and events. The other side – cohesion – sets up the surface structure of a text bounding the sentences together by referencing the characters, some situational objects, time and space of the events. The topic-focus analysis of a text structure was proposed by Daneš in 1974 [8] and van Dijk in 1977 [9]. They both suggested that topics and focuses of different utterances link to each other and create a coherent text – a sequence of sentences. Van Dijk argued that topics function as the glue for the whole text and we can calculate the text topic from the topics of the sentences it consists of. According to Daneš, the so called functional perspective of a text is dynamic and develops within the text depending on speakers’ intentions and their interpretation of the denoted situation. He suggested three types of thematic progressions: a simple linear progression (the rheme of the previous utterance becomes the topic of the following one); a continuous theme progression (the theme remains the same for several consequent utterances), and a progression with derived themes that emerged from a ‘hypertheme’. All these progressions may combine and run into one another. Sakharny developed Daneš’s ideas in [10] trying to regard text coherence from the cognitive point of view. He suggested that the thematic progressions not only are a speciﬁc feature of a text, but also represent the way we think about the subject of a story. He described the so called wide and narrow subject complexes that he understood as the topic-focus structures of coherence. The surface topic-focus structure reflects the deep topic-focus structure of coherence, that is the structure of the speakers’

Coherence Understanding Through Cohesion Markers

147

vision of the meaning of the narrative. The basic structure of each complex is a simple existential structure ‘there is X’, which is later included into a more complicated structure of X and its attributes. These complicated structures of different “X-s” and their attributes may interact with each other and form such subject complexes as a bush structure (one subject has many attributes), a chain structure (the attribute of a subject becomes the subject of the following structure), and combined structures. Unlike in the theory proposed by Daneš, Sakharny’s wide subject complexes reflect the cognitive structure of coherence and are not explicit in the text itself. The development of computer linguistics has risen the questions of an automatic analysis of coherence. Different researchers have analysed written corpora and proposed possible solutions [11–13]. However, transferring cohesive markers into coherent structures still remains problematic. Several attempts have been made in automatic analysis of child oral narratives, but the developers have chosen inductive methods and did not apply cohesive markers for narrative analysis [14]. There are few different ways to provide cohesion: referencing and coreferencing (by means of anaphors, pronouns or various lexical nominations and paraphrasing), syntactic structure changes (e.g., ellipsis or inversion), grammatical iterations. Kibrik attempted to study the text reference in discourse in [15]. He suggested a complex method of analysis that not only involves the verbal component of communication, but also considers gestures, mimics and cognitive mechanisms. The author argues that referencing is one of the universal cognitive mechanisms. It depends on such parameters as antecedent’s actualization and language typological features. For example, Kibrik considers Russian pronominal system to be difﬁcult for unambiguous interpretation, as pronouns are usually the only source of reference information. Reference conflicts that may occur often enough and require a big set of resolution tools are nonetheless peripheral and normally can be easily solved by any communicant in natural situation. But this question becomes highly important if we try to create an automatic tool for parsing oral speech. The multi-layered structure of an oral narrative includes among others the phonetic level. Numerous studies have shown that the boundaries between semantic-syntactic units and phonetic ones do not necessarily coincide in casual speech [16, 17]. However, recent experimental data from casual Russian provides evidence that a listener tends to ignore “irregular” pausation giving preference to the semantic-syntactic relations within an utterance, intonation serving as a complementary source of information [18]. Thus, we assume that at least preliminary testing of the automatic topic-focus analysis can be performed excluding the phonetic level.

3 Data The corpus “KONDUIT” (KOrpus Nepodgotovlennyh Detskih Ustnyh Izvlechennyh Tekstov – Corpus of Child Unprepared Elicited Oral Narratives; [19]) comprises 213 unprepared narratives (or quasi-narratives), elicited during a series of experiments with Russian native children aged 2;7–7;6. The children were divided into 5 age groups, 3 different experimental designs have been suggested depending on the cognitive development of the children [20].

148

P. Eismont et al.

The experiment with the youngest children (aged 2;7–3;6) was in a game manner. Two experiment assistants manipulated 4 glove puppets performing different actions that can be described using the verbs of 14 different semantic classes, e.g. verbs of motion, verbs of communication, emotional verbs, verbs of perception, etc. The second group (children aged 3;6–4;6) had to retell a picture book “Three kittens” (by Vasily Suteev) consisting of 15 pictures and representing a story of three small kittens who try to catch a mouse, hunt a frog, and catch a ﬁsh, but everyone escapes, and three upset kittens return to their home wet, tired and hungry. The three oldest age groups were retelling a cartoon about a kitten who makes a disorder at home and goes for a walk. He meets rabbits, beavers and a bear-cub, but no one wants to play with him. All the characters (both in the picture book and in the cartoon) perform different clear actions that can be described using the verbs of the same semantic classes that were expected in the experiment with the youngest age group. Thus, despite some differences in the experimental designs, the narratives elicited by the children of all 5 age groups may be compared for the use of the same verbs describing the same semantics and the similar situations. The experiment was carried out in accordance with the Declaration of Helsinki and the existing Russian and international regulations concerning ethics in research. The parents signed informed consents for their children to take part in the experiment. All narratives were audio- and video-recorded. The orthographic annotation of all the recordings has been performed by the experimenter who conducted the experiment with all children. The corpus includes 25 689 tokens, or 5 763 utterances. Only stories elicited by the children of the three oldest groups (4;7–5;6, 5;7–6;6, and 6;7–7;6) may be considered as standard narratives that have some coherent and cohesive features. All clauses and their topic-focus structures have been manually annotated. The principles of topic identiﬁcation have been discussed in [21, 22]. We understand ‘topic’ and ‘focus’ as it has been suggested by Prague linguists: topic is the main subject of the utterance, while focus is the information said about this subject. The results of the automatic topic-focus parsing have been compared to this manually annotated corpus.

4 Decision and Discussion 4.1

Parser

We developed a Topic-Focus Annotating Parser (TFAP) based on the morphologically and syntactically annotated texts. Morphological analysis that differentiates parts of speech and identiﬁes gender, case and number of nouns and tense, person and number of verbs, etc. is performed by pymorphy2 [23], while syntactic analysis is performed by Google SyntaxNet parser trained on SyntagRus syntactic corpus [24]. TFAP works in three steps. The ﬁrst step provides the annotations of semantic and syntactic roles and marks grammatical features that are important for the future topicfocus analysis. Most nouns in Nominative case are attributed as Agents, but if it is inanimate it is annotated differently with the most probable semantic role depending on

Coherence Understanding Through Cohesion Markers

149

their most probable referent. At this stage, TFAP also allocates all animate nouns in any case and with any syntactic role as they may function as a topic later. Referencing is done at the second step. If an argument is tagged as a pronoun in Nominative case and its syntactic role is ‘Subject’, this argument requires an antecedent. If there is an argument in any other case except Nominative, but it is animate, it is marked as a possible referent for future topic-focus structure. All other pronouns and inanimate nouns are marked as less possible referents and topics. After these two preliminary steps, TFAP starts structuring the topic-focus schemes and linking them through the text. Each anaphoric word tagged ‘ref’ is linked to the nearest antecedent with the same set of grammatical features. Verbs are linked to the animate nouns in Nominative or pronouns within the clause. If there is no animate subject in Nominative within the clause or within three previous clauses, the verb is linked to an inanimate noun in Nominative within the clause. At this stage, TFAP also seeks for any clause initial adverbs or adverbial expressions of time and place, and if there is a subject in the same clause, tags them as narrative topics. Every found antecedent is marked as topic, while the rest part of the clause is marked as focus. Clauses united by the same topic or narrative topic are referred to as a single topic-focus scheme (cf. Fig. 1):

Fig. 1. Topic-focus scheme of a narrative elicited by a boy, 4;7 (this is kitten // playing / fooling // I have this cartoon // kitten is fooling throwing the balls all around the robs is rolling // rabbits are jumping with the rope // rabbits are playing with the rope / and this is beavers are building a house //// and this is kitten is now crying // a bear is riding a bicy… a scooter //// kitten is climbing down the tree).

150

4.2

P. Eismont et al.

Results and Discussion

Table 1 presents the results of the topic-focus parsing of the narratives elicited by the children aged 4;7–7;6.

Table 1. Parsing results for the narratives elicited by the children aged 4;7–7;6 (* - this narrative was chosen as a sample narrative for TFAP training). Age

Total number of

Parsing correctness (% from all narratives in Mean every age group) length of Narratives Clauses 100% 90–99% 80–89% 50–79% 0–49% utterance (Me)

4;7–5;6 32 5;7–6;6 21 6;7–7;6 17

983 961 986

15.6 4.8 5.9*

25 14.3 17.6

34.4 57.1 64.7

25 23.8 11.8

0 0 0

4.6 5.1 5.4

Average correctness of clause parsing (%) 86 84 85

Parsing correctness for every narrative was estimated by the percentage of correctly parsed utterances. The results show the development of narrative skills in language acquisition. 40.6% of all narratives elicited by the children aged 4;7–5;6 have been parsed with at least 90% correctness as children of this age produce quite primitive narratives, use mostly lexical nominations, do not use pronouns and anaphors (cf. the narrative in Fig. 1 above). Their utterances are short, they do not use narrative topics, they almost do not switch between the situation of the cartoon and the situation in which the experiment takes place. All these features are typical for the narratives of children of this age [8, 10, 25, 26] and they simplify TFAP work. At the same time, children between 4 and 5 years may confuse the gender of characters: acters: (1) kotik igraet v kubiki // ona kitten-M.NOM.SG play-PRS.3SG in block-M.ACC.PL // she-F.3SG vsyo razbrosala everything-N.ACC.SG throw-PST.F.SG ‘the kitten is playing with the blocks // it (she) has thrown everything away’

Coherence Understanding Through Cohesion Markers

151

Or they may omit the lexical nominations of the characters and label only actions listing them as pearls in a necklace, without mentioning any characters: (2) kotik kitten-M.NOM.SG

i and

myachik / ball-M.NOM.SG

i and

rybku fish-F.ACC.SG

lovit / catch- PRS.3SG

eshche also

razbrosaet / throw-PRS/FUT.3SG away

a and

on he/it-M/N.NOM.SG eshche morgaet also wink- PRS.3SG teper’ now

dvizhetsya move-PRS.3SG glazami / eye-M.INS.PL

v myachik igraet / with ball-M.ACC.SG play-PRS.3SG

i razrushil kukushku i klubki razbrosal and break-PST.M.SG cuckoo-F.ACC.SG and cob-M.ACC.SG throw-PST.M.SG away ‘the kitten and the ball / it moves and is catching a fish / is also winking with its eyes / is also throwing away / and now is playing with the ball / and has broken a cuckoo and has thrown away the cobs’

Another feature of Russian child syntax is the combination of subject omission and the preposition of an object (cf. (2) – rybky lovit ‘is catching a ﬁsh’, v myachik igraet ‘is playing with a ball’, klubki razbrosal ‘has thrown the cobs away’). This inversion is possible but much less frequent in adult speech, and it is impossible to differentiate between the homonymous forms of Nominative and Accusative Singular for inanimate masculine nouns. TFAP reveals all these deﬁciencies in child narratives. But they are speciﬁc only for the narratives elicited by the children of the youngest age group. Elder children produce more sophisticated narratives, and the percentage of correctly annotated utterances reduces. This is caused by the following parsing deﬁciencies: – the clauses are too complex, so the distance between the verb and its argument or between the anaphor and its antecedent may be too long; – in Russian, the main character (the kitten) may be labelled with at least three different lexemes: kotyonok ‘kitten’ - masculine, kiska ‘pussy’ and koshka ‘cat’ (the latter two – feminine), and the children often switched the gender of a lexeme and its anaphoric pronoun; – the number of atypical topic-focus structures (narrative topics, thetic rhemes [cf. [27, 28]) increases as children acquire narrative skills: narratives require speciﬁc word order and some rules of ellipsis that influence the syntactic structure of separate clauses; – children may switch between the situation of the cartoon and the situation of the experiment; as a result, deictic pronouns appear and the distance between antecedents and their references within the situation of the cartoon increases. Thus, the parsing errors in the narratives elicited by the children aged 4;7–5;6 are caused by the speakers’ imperfection, while the texts are simpler and TFAP can easily parse them. But the parsing errors in the narratives elicited by the children aged 5;7–7;6

152

P. Eismont et al.

are caused by the TFAP’s imperfection while texts are more sophisticated and reflect all speciﬁc features of spoken language. At the same time, the text topic-focus schemes suggested by TFAP represent the wide subject complexes suggested by Sakharny [10]. The structures of the narratives elicited by the younger children are much simpler than those of the narratives elicited by the elder children (cf. Fig. 1 representing the structure of the narrative elicited by a boy aged 4;7 and Fig. 2 representing the structure of the narrative elicited by a girl aged 6;10).

Fig. 2. Topic-focus scheme of a narrative elicited by a girl aged 6;10, fragment (there / there is some house spinning // now there is a kitten who is picking the cubes // now the kitten / he / now the kitten has decided to play with a ball / and accidentally knocked the clocks down // yes the mommy-cat returned home / and saw that there is a mess at home / and and the kitten confessed that it / was him who made the mess / then the mommy-cat got a telephone call / and the mommy-cat took a jar and went to do her stuff / and the kitten stayed at home // and the kitten jumped out to the street and saw the rabbits who / were jumping with a rope / the kitten then the kitten went out to the street to play with the rabbits / the kitten ran up and wanted to jump but / a rabbit said that the kittens are not allowed / then the kitten saw the beavers who were building a house // the kitten saw and wanted / and also wanted / to help the beavers / but a beaver / but a beaver told to the kitten that kittens are not allowed to be builders / the kitten / felt // the kitten felt sad / and that’s why he left / and that’s why he left the rabbits and the beavers).

Coherence Understanding Through Cohesion Markers

153

5 Conclusion and Future Plans In the paper, we reported the results of an automatic topic-focus analysis of Russian children’s narratives performed by the parser TFAP. The data allowed to discuss both the cognitive aspects of cohesion and coherence acquisition and the deﬁciencies of the parser. The automatic annotation showed to be the most successful for the texts of the youngest group of participants (4;7–5;6), as younger children normally describe separate events or even episodes using a single utterance of a simple structure and do not try to connect these events either in their minds or in their narratives. On the contrary, elder children construct the coherent image of the whole episode in their minds, divide it into several connected events and represent this complex structure with their narratives. Thus, our results are similar to the schemes proposed by Sakharny. As we mentioned in Sect. 2, we decided to omit the analysis of phonetic information for the current study. However, further development of the automatic topicfocus analysis as well as a psycholinguistic description of narrative’s processing in children will deﬁnitely beneﬁt from the inclusion of the phonetic level. Thus, we are now performing manual acoustic-phonetic transcription of the Corpus based on the principles used in the Corpus of Transcribed Russian Oral Texts [17]. Acknowledgements. The work is supported by the research grant number 16-04-50114 (dir. P. Eismont) from the Russian Foundation for Humanities and the research grant number MК-6776.2018.6 (dir. E. Riekhakaynen) from the President of the Russian Federation.

References 1. Warner, N.: Methods for studying spontaneous speech. In: Cohn, A., Fougeron, C., Huffman, M. (eds.) Handbook of Laboratory Phonology, pp. 612–633. Oxford University Press, Oxford (2012) 2. De Beaugrande, R., Dressler, W.U.: Introduction to Text Linguistics. Longman, London, New York (1981) 3. Murzin, L.N., Stern, A.S.: Text and Its Perception. UGU Press, Sverdlovsk (1991). (in Russian) 4. Berman, R.A., Slobin, D.I.: Relating Events in Narrative: A Crosslinguistic Developmental Study. Lawrence Erlbaum, Hillsdale (1994) 5. Manhardt, J., Rescorla, L.: Oral narrative skills of late talkers at ages 8 and 9. Appl. Psycholinguist. 23, 1–21 (2002) 6. Frauenfelder, U.H., Peeters, G.: Lexical segmentation in TRACE: an exercise in simulation. In: Cognitive Models of Speech Processing: Psycholinguistic and Computational Perspectives, pp. 51–86. MIT Press, Cambridge Mass (1990) 7. Matthews, P.H.: Oxford Concise Dictionary of Linguistics. Oxford University Press, Oxford (2003) 8. Daneš, F.: Functional sentence perspective and the organization of the text. In: Papers on Functional Sentence Perspective, pp. 106–128. Academia, Prague (1974) 9. Van Dijk, T.A.: Sentence Topic and Discourse Topic. http://www.discourses.org/ OldArticles/Sentence%20topic%20and%20discourse%20topic.pdf. Accessed 02 May 2018

154

P. Eismont et al.

10. Sakharny, L.V.: Topic-focus structure in text: basic notions (in Russian). Lang. Lang. Behav. 1, 7–16 (1998) 11. Hahn, U.: On Text Coherence Parsing. In: COLING, pp. 25–31 (1992) 12. Nikolenko, S.I., Koltcov, S., Koltsova, O.: Topic modelling for qualitative studies. J. Inf. Sci. 43(1), 88–102 (2017) 13. Ionov, M.I.: Automatic detection of the discourse status of a referent of a noun phrase. Rhema 4, 24–42 (2016). (in Russian) 14. Hassanali, Kh., Liu, Y., Solorio, Th. Coherence in child language narratives: a case study of annotation and automatic prediction of coherence. In: Third Workshop on Child, Computer and Interaction (WOCCI 2012) ISCA. http://www.isca-speech.org/archive/wocci_2012. Accessed 24 June 2018 15. Kibrik, A.: Reference in Discourse. Oxford University Press, Oxford (2011) 16. Kibrik, A.A., Podlesskaya, V.I. (eds.): Stories about Dreams: Corpus-based Study of Russian Spoken Discourse. Yazyki slavyanskikh kultur, Moscow (2009). (in Russian) 17. Nigmatulina, J., Raeva, O., Riechakajnen, E., Slepokurova, N., Vencov, A.: How to study spoken word recognition: evidence from Russian. In: Anstatt, T., Gattnar, A., Clasmeier, Ch. (eds.) Slavic Languages in Psycholinguistics: Chances and Challenges for Empirical and Experimental Research, pp. 175–190. Narr Verlag, Tuebingen (2016) 18. Raeva, O.V., Riekhakaynen, E.I.: Spontaneous Russian texts from a listener’s perspective. Soc. Psycholinguist. Res. 3, 67–70 (2015). (in Russian) 19. Eismont, P.M.: “KONDUIT”: Corpus of child oral narratives. In: Proceedings of the International Conference “Corpus Linguistics – 2017”, pp. 373–377. Saint-Petersburg State University, St. Petersburg (2017). (in Russian) 20. Ambridge, B., Rowland, C.F.: Experimental methods in studying child language acquisition. Wiley Interdisc. Rev. Cogn. Sci. 4(2), 149–168 (2013) 21. Kehler, A.: Discourse topics, sentence topics, and coherence. Theor. Linguist. 30, 227–240 (2004) 22. Götze, M., Weskott, Th., Endriss, C., Fiedler, I., Hinterwimmer, St., Petrova, Sv., Schwarz, A., Skopeteas, St., Stoel, R.: Information structure. In: Dipper, St., Götze, M., Skopeteas, St. (eds.) Interdisciplinary Studies on Information Structure, vol. 7 of Working papers of the SFB 632, pp. 147–187. Universitätsverlag, Potsdam (2007) 23. Korobov, M.: Morphological Analyzer and Generator for Russian and Ukrainian Languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, Valeri G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015). https://doi.org/10. 1007/978-3-319-26123-2_31 24. Dyatchenko, P.V. et al.: Current state of the deeply annotated corpus of Russian texts (SynTagRus) (in Russian). In: Russian National Corpus: 10 Years of the Project. Proceedings of the V.V. Vinogradov Russian Language Institute, pp. 272–299. Russian Language Institute, Moscow (2015) 25. Bamberg, M.: The Acquisition of Narratives: Learning to use Language. Mouton de Gruyter, Berlin (1987) 26. Van Dam, F.J.: Development of Cohesion in Normal Children’s Narratives. Project report. https://dspace.library.uu.nl/bitstream/handle/1874/180044/Development%20of%20cohesion %20in%20normal%20children%27s%20narratives%20research%20report.pdf?sequence= 1&isAllowed=y. Accessed 02 May 2018 27. Paducheva, E.V.: Communicative perspective interpretation: basic structures and linearaccentual transformations. Comput. Linguist. Intellect. Technol. 11(18), 522–535 (2012) 28. Zimmerling, A.V.: Thetic sentences: semantics and derivation (in Russian). In: Lyutikova, E. A., Zimmerling, A.V., Konoshenko, M.B. (eds.) Typology of Morphosyntactic Parameters, vol. 1, pp. 223–252, M.A. Sholokhov MGGU, Moscow (2014)

Context Modeling for Cross-Corpus Dimensional Acoustic Emotion Recognition: Challenges and Mixup Dmitrii Fedotov1(B) , Heysem Kaya2 , and Alexey Karpov3 1

3

Institute of Communications Engineering, Ulm University, Ulm, Germany [email protected] 2 Department of Computer Engineering, Tekirda˘ g Namık Kemal University, C ¸ orlu, Turkey [email protected] St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia [email protected]

Abstract. Recently, focus of research in the ﬁeld of aﬀective computing was shifted to spontaneous interactions and time-continuous annotations. Such data enlarge the possibility for real-world emotion recognition in the wild, but also introduce new challenges. Aﬀective computing is a research area, where data collection is not a trivial and cheap task; therefore it would be rational to use all the data available. However, due to the subjective nature of emotions, diﬀerences in cultural and linguistic features as well as environmental conditions, combining aﬀective speech data is not a straightforward process. In this paper, we analyze diﬃculties of automatic emotion recognition in time-continuous, dimensional scenario using data from RECOLA, SEMAINE and CreativeIT databases. We propose to employ a simple but eﬀective strategy called “mixup” to overcome the gap in feature-target and target-target covariance structures across corpora. We showcase the performance of our system in three diﬀerent cross-corpus experimental setups: singlecorpus training, two-corpora training and training on augmented (mixed up) data. Findings show that the prediction behavior of trained models heavily depends on the covariance structure of the training corpus, and mixup is very eﬀective in improving cross-corpus acoustic emotion recognition performance of context dependent LSTM models.

Keywords: Cross-corpus emotion recognition Time-continuous emotion recognition · Data augmentation

1

Introduction

Automatic aﬀect recognition is a popular research topic, which brings researchers from psychological and technical areas together [19,24]. It can be beneﬁcial in c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 155–165, 2018. https://doi.org/10.1007/978-3-319-99579-3_17

156

D. Fedotov et al.

a variety of applications in areas of human-computer interaction (HCI) and human-human interaction (HHI). Emotional component in an HCI system allows it to perceive the emotional state of speaker and adjust the response to increase the quality of interaction. Although emotion recognition has been a hot topic for a long period and a high amount of research was conducted, the problem is far from being solved. Less than two decades ago, emotion recognition has left the laboratory conditions and faced the real-world data and problems; such as cultural, linguistic and environmental diﬀerences [10,22]. Combination of diﬀerent corpora, which could solve the problem of data shortage, could not be applied in a straightforward manner in the context of acoustic emotion recognition. The main diﬃculty lies in the subjective nature of emotions, resulting in diverse and controversial annotations. Despite these issues, data combination and augmentation may lead to a dramatic increase in performance of aﬀect recognition systems. In this paper, we dealt with problems of cross-corpus time-continuous dimensional emotion recognition and proposed ways to overcome them. We observed that a pure cross-corpus emotion recognition may not work properly if data have diﬀerent label distributions. We also showed that this problem can be partially solved by combining and augmenting data. This paper is structured as follows: we introduce the related work in Sect. 2; provide information on corpora used, data preprocessing techniques and methodology in Sect. 3; present results of diﬀerent cross-corpora emotion recognition settings in Sect. 4; and conclude the paper in Sect. 5.

2

Related Work

Most of the previous research on emotion recognition dealt with acted, categorically labeled corpora, providing information at utterance-level [1,7,11]. Continuously annotated databases of spontaneous interactions provide more naturalistic data, but also introduce several challenges, such as diversity in annotations [16,17], reaction lags between actual appearance of an emotion and its annotation [12] and amount of contextual information the system needs [5,6]. Problem of cross-corpus emotion recognition was investigated by several research groups. Schuller et al. studied this problem with acted, categorically annotated databases [22]. Performance of the proposed methodology was poor if some diﬀerences in environmental conditions were present. For some of the emotions, classiﬁcation accuracy of used Support Vector Machine (SVM) based model was below the chance level. Authors also showed that normalization strategy has a crucial role in the cross-corpus scenario and concluded that speakerlevel normalization leads to the best performance, compared to other approaches. The study of normalization eﬀect on cross-corpus emotion recognition performance was extended and cascaded normalization techniques, which are comprised of speaker, value and instance level normalization, were recently introduced and tested in [9]. The proposed approach achieved increased performance reducing cross-corpus diﬀerences with respect to suprasegmental acoustic features.

Context Modeling for Cross-Corpus Dimensional Emotion Recognition

157

Resent study focused on cross-corpus recognition of self-assessed aﬀect. Cross-corpus predictions of aﬀective primitives were used as a data for extracting functionals and then combined with predictions of other sub-systems to improve performance [8]. These studies provided a starting point for the paper in-hand and a speaker normalization technique was used. Cross-corpus emotion recognition with timecontinuous data is poorly studied, which served as motivation to conduct our research.

3

Data and Methodology

Three corpora of spontaneous, emotionally-rich interactions are used in this study: RECOLA [20], SEMAINE [13] and CreativeIT [14]. All corpora are annotated at frame level using two aﬀective scales: arousal (activation) and valence (positivity). A brief overview of the used corpora is presented in Table 1. Table 1. Overview of used corpora. Corpus

Duration Recordings Participants Gender Age (min) (m/f) μ (σ)

Annotation rate (Hz)

RECOLA

115

23

23

10/13

21.4 (2.0)

25

SEMAINE 435

24

20

8/12

30.4 (10.4)

50

CreativeIT 132

31

15

7/8

N/A

60

3.1

RECOLA

RECOLA (Remote COLlaborative and Aﬀective interactions) database was collected during spontaneous dyadic interactions between people while solving a cooperative problem. From 46 people participating in the database collection, 34 gave their consent to share the data publicly available and recordings from 23 users are presented in the current version of the database, shared with research community. Each recording has duration of ﬁve minutes, yielding 115 min of speech in total. Participants are aged between 18 and 25 years and have diﬀerent mother tongues although spoke French during the database collection process: 17 of them have French as a mother tongue, 3 – Italian and 3 – German. The corpus was recorded in four modalities: audio, video, electrocardiogram and electro-dermal activity. Recordings were continuously annotated by 6 equally gender-distributed persons via ANNEMO (ANNotating EMOtions) annotation tool [20] in two aﬀective scales (arousal and valence) and ﬁve social behavior scales (agreement, dominance, engagement, performance, rapport).

158

3.2

D. Fedotov et al.

SEMAINE

SEMAINE (Sustained Emotionally coloured Machine-human Interaction using Nonverbal Expression) database was collected within a project, where the aim was to build a system that could engage a person in a sustained conversation with a Sensitive Artiﬁcial Listener (SAL) agent. Three scenarios are used in the context of this project: Solid SAL, where the agent’s role was played by a real human-operator; Semi-Automatic SAL, where system spoke phrases chosen by a human operator from a pre-deﬁned list; and Automatic SAL, where the system chose phrases and non-verbal signals by itself. Only data collected from users (not operators) in Solid SAL scenario were used in this study. The corpus consists of 24 recordings in English from 20 speakers, whose age range from 20 to 58 years. Recordings have durations from 11 to 30 min resulting in the total corpus length of 435 min. The corpus was recorded in two modalities: audio and video and annotated via FeelTrace annotation tool [3] in diﬀerent dimensions and emotional labels: valence, arousal, power, anticipation, intensity, fear, anger, happiness, sadness, disgust, contempt and amusement. 3.3

CreativeIT

CreativeIT database was collected to serve as a multidisciplinary resource for theatrical performance improvement and emotion recognition. It was recorded by actors, coordinated by a director with an expert qualiﬁcation in Active Analysis introduced by Stanislavsky. Two scenarios were used during the database collection: two-sentence exercise, where actors were permitted to use only one predeﬁned phrase each; and paraphrase of script, where actors were following general script without any constraints on words and expressions. Only the paraphrase part of corpus was used in this study as it meets the conditions of spontaneous interaction closely. Selected part of the corpus consists of 31 recordings in English from 15 participants. Duration of recordings ranges from 2 to 7 min, with a total of 132 min. In addition to audio data from close-up microphones, motion capture data is available for each recording, representing body language of actors during interactions. Recordings were annotated via FeelTrace annotation tool [3] by three groups of evaluators: theater experts, actors and naive audience in diﬀerent dimensional groups, such as emotional descriptors (arousal, valence) and theatrical performance ratings (naturalness, creativity). 3.4

Features and Labels

For cross-corpus emotion recognition, the audio modality was used in this study, as it is presented in each corpus described above. Audio features were extracted with openSMILE tool [4]. They consist of 65 low-level descriptors (LLDs) and their ﬁrst order derivatives [21]. Feature step size was set to 0.01 sec. resulting in a feature extraction rate of 100 Hz. As corpora have diﬀerent annotation rates (see Table 1) they were brought to the same data frequency to be able to share

Context Modeling for Cross-Corpus Dimensional Emotion Recognition

159

the same prediction models. The lowest annotation frequency of 25 Hz, present in RECOLA, was used to subsample other two corpora. Extracted features were speaker-level z-normalized, as it was previously shown to have a better performance in cross-corpus experiments [9]. Annotations of two main aﬀective dimensions - arousal and valence - were used in this study as labels. Distributions of labels for corpora described above are presented in Fig. 1.

(a) Arousal

(b) Valence

Fig. 1. Label distributions in three emotional corpora.

The label distribution of RECOLA is narrower in both aﬀective dimensions, than remaining corpora. It can be a result of its pure spontaneous nature. Although all corpora used in this study are designed to be naturalistic, SEMAINE can simulate four personality prototypes, which aﬀect operators’ behavior and hence, the user. Even though actors participating in collection of CreativeIT database were not restricted lexically to choose the words for interaction, they had to follow the general scenario and the role. These conditions could have led to more idiosyncratic nature of emotions in both SEMAINE and CreativeIT. 3.5

Modeling

In this study, recurrent neural network with long short-term memory (LSTMRNN) was used for context modeling. The model comprises of two layers with 80 and 60 neurons with ReLU activation function [15], respectively, each followed a dropout layer with p = 0.3 [23]. The models were optimized by root mean square propagation (RMSprop) using the concordance correlation coeﬃcient as a metric function. We use the LSTM implementation provided by Keras toolkit [2]. Our recent study has revealed that performance of time-continuous emotion recognition has a strong relation with the amount of acoustic context used in recurrent neural network (RNN) models regardless of the number of time steps [5]. The required amount of context could be set by combination of two parameters: number of time steps fed into RNN model and a sparsing coeﬃcient, which is responsible for decreasing the amount of data in each sample by skipping

160

D. Fedotov et al.

frames. Regardless of sparsing coeﬃcient, the step size between samples is one frame, hence there is no loss in total amount of information. The amount of context in seconds is then represented as: C=

SC × T W , FR

(1)

where SC is the sparsing coeﬃcient that determines the amount of frames to skip, TW is the time window size and FR is the frame rate in Hz. Based on our previous research [5], a context size of 7.68 s, which is obtained from the combination of SC=12 and TW=16, was selected for this study. The same procedure of sparsing applies to respective labels. Sequence-to-sequence modeling is used in this study, thus features of TW previous frames were used to predict the corresponding labels for these frames. After prediction phase, the values of labels obtained for the same frame at diﬀerent time steps were averaged to smooth ﬁnal prediction. 3.6

Mixup for Data Augmentation and Corpus Adaptation

To combine data from diﬀerent corpora, a recently introduced methodology called mixup was used in this study [25]. mixup is a data augmentation technique, that constructs virtual training examples based on existing ones, using weights drawn from Beta distribution to regulate their contribution to the synthetic instance: xnew = λxi + (1 − λ)xj , ynew = λyi + (1 − λ)yj ,

(2) (3)

where λ ∼ Beta(α, α), α is a hyper-parameter for the Beta distribution, xi , xj are feature vectors, and yi , yj are label values/vectors. This kind of data augmentation encourages the model to behave more linearly in-between training examples, which can be useful for cross-corpus learning. In this study, feature vectors xi , xj and corresponding labels yi , yj were taken from two diﬀerent corpora. To create diﬀerent sets of augmented data, hyperparameter α of mixup routine was varied (see Fig. 2). Three values were tested: α = 0.1, which provides slight changes to original data and a minor contribution of the second corpus; α = 1, which provides an uniformly distributed level of contribution of both corpora to augmented data; and α = 10, which creates most examples in the middle of feature-label space between two samples. To preserve sequential nature of data, streams were mixed up at the recording level with consecutive frames.

4

Experimental Results

In this paper, the problem of cross-corpus multi-dimensional emotion recognition is considered. To study the issues and particularities of time-continuous

Context Modeling for Cross-Corpus Dimensional Emotion Recognition

(a) α = 0.1

(b) α = 1

161

(c) α = 10

Fig. 2. Beta-distribution with three diﬀerent values of parameter α.

and multidimensional emotion recognition, three experimental setups were used: single-corpus training, two-corpora training and training on augmented data. The performance of cross-corpus prediction was estimated using Pearson’s correlation coeﬃcient (ρ). 4.1

Single-Corpus Training

The ﬁrst problem deﬁnition was to predict values on an unseen corpus using model trained on single corpus. Two models (for arousal and valence) were trained on whole data available for one corpus up to 5 epochs, then they were used to generate predictions for diﬀerent corpora, including the training corpus itself (to show the ground-truth label distributions). Scatter plots of prediction in the single-corpus training settings are presented in Fig. 3. Test corpus RECOLA

SEMAINE

CreativeIT

1.0

RECOLA

0.5 0.0 -0.5 -1.0

0.5

SEMAINE

Train corpus

1.0

0.0 -0.5 -1.0 1.0

CreativeIT

0.5 0.0 -0.5

-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

-1.0 1.0

Fig. 3. Single-corpus training (x-axis – valence, y-axis – arousal).

162

D. Fedotov et al.

Table 2. Pearson correlation scores (arousal/valence) for single-corpus training. Train on

Test on RECOLA

SEMAINE

CreativeIT

RECOLA

0.923/0.890

0.375/0.223

0.337/−0.024

SEMAINE 0.533/0.170

0.821/0.750

0.322/−0.065

CreativeIT −0.027/0.009 0.306/−0.013 0.953/0.952

Distributions of each corpus labels can be seen as self-prediction (main diagonal). Figure 3 shows that models predict only in the limits of their own annotation distributions and exhibit the same tendencies regardless of the test data. This results in a low cross-corpus prediction performance, even in some cases leading to a negative correlation (see Table 2). Negative correlations may also be attributed to the use of diﬀerent annotation tools. ANNEMO tool has two separate bars for arousal and valence that are manipulated by the user independently. However, FeelTrace toolkit provides the two-dimensional emotion representation with basic emotions played on the graph, which are in some cases drastically converse (e.g. for “afraid”) to other research [18]. 4.2

Multi-corpus Training

The second research problem was to predict aﬀect primitives on an unseen corpus using the model trained on two remaining corpora. Other experimental parameters were left the same as in the single-corpus training setting. We refer this multi-corpus training scheme as “combining”. The third research problem was to predict arousal and valence on one of the corpora using the model trained on fully synthetic data, generated from the remaining corpora with mixup routine. Comparative multi-corpus training results with combining and mixup strategies are presented in Table 3, where the improved performance of multi-corpus training over the best single corpus training performance on a target corpus is shown in bold. Table 3. Pearson correlation scores (arousal/valence) for leave-one-corpus-out training results. Test on

RECOLA + CreativeIT

SEMAINE 0.359/0.050

RECOLA + SEMAINE

CreativeIT 0.435/−0.016 0.431 (1)/−0.041 (0.1)

CreativeIT + SEMAINE RECOLA

Combined

Mixed up (best α)

Train on

0.222/0.149

0.368 (1)/−0.012 (10) 0.695 (1)/0.294 (10)

Compared to the single-corpus training, combination of data results in approximate averaging of performances of two corpora used for training. Only a

Context Modeling for Cross-Corpus Dimensional Emotion Recognition

163

combination of SEMAINE and RECOLA provides better results for CreativeIT as the test corpus with arousal dimension. Mixup based data augmentation allows model to beneﬁt more from diﬀerences in databases, creating synthetic samples that train a model having higher generalization ability. Thus, mixup dramatically improves over single-corpus training on two corpora, and renders a relatively slight performance decrease (from 0.375 to 0.368) on SEMAINE arousal dimension. The advantage of using mixup over simple combination is seen clearly on RECOLA corpus: while combining approach markedly underperforms the single-corpus performance, mixup improves it in both arousal and valence dimensions.

5

Conclusions and Future Work

In this paper, we studied problems of time-continuous multidimensional crosscorpus emotion recognition. In addition to the feature distribution problem that is present in other cross-corpus settings and could be partially solved by a speaker-level normalization, the dimensional approach introduces the challenge of diﬀerent label distributions. It can be caused by initial database collection scenario, diﬀerent annotation software or people’s perception of emotions. Nevertheless, it may serve as a limiting factor for the system, may not let it predict outside originally trained distribution and may even result in converse behavior. In future work, a cross-task approach will be introduced to the current research to increase coverage of arousal-valence space by using corpora with categorical annotation. The question of mapping emotion labels between corpora is still poorly studied, but an eﬀective approach may increase amount of data available for diﬀerent experimental settings, which will have a positive eﬀect on the performance of the emotion recognition system. Acknowledgments. This research is supported by the Russian Science Foundation (project No. 18-11-00145).

References 1. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Ninth European Conference on Speech Communication and Technology (2005) 2. Chollet, F., et al.: Keras (2015). https://keras.io 3. Cowie, R., Douglas-Cowie, E., Savvidou*, S., McMahon, E., Sawey, M., Schr¨ oder, M.: ‘FEELTRACE’: An instrument for recording perceived emotion in real time. In: ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (2000) 4. Eyben, F., W¨ ollmer, M., Schuller, B.: Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462. ACM (2010) 5. Fedotov, D., Ivanko, D., Sidorov, M., Minker, W.: Contextual dependencies in timecontinuous multidimensional aﬀect recognition. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC) (2018)

164

D. Fedotov et al.

6. Gunes, H., Pantic, M.: Automatic, dimensional and continuous emotion recognition. Int. J. Synth. Emotions 1(1), 68–99 (2010) 7. Haq, S., Jackson, P.J.: Multimodal emotion recognition. Machine audition: principles, algorithms and systems, pp. 398–423 (2010) 8. Kaya, H., Fedotov, D., Ye¸silkanat, A., Verkholyak, O., Zhang, Y., Karpov, A.: LSTM based cross-corpus and cross-task acoustic emotion recognition. In: INTERSPEECH 2018. ISCA (2018, in press) 9. Kaya, H., Karpov, A.A.: Eﬃcient and eﬀective strategies for cross-corpus acoustic emotion recognition. Neurocomputing 275, 1028–1034 (2018) 10. Lim, N.: Cultural diﬀerences in emotion: diﬀerences in emotional arousal level between the east and the west. Integr. Med. Res. 5(2), 105–109 (2016) 11. Makarova, V., Petrushin, V.A.: RUSLANA: A database of Russian emotional utterances. In: Seventh International Conference on Spoken Language Processing (2002) 12. Mariooryad, S., Busso, C.: Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In: Aﬀective Computing and Intelligent Interaction (ACII), pp. 85–90. IEEE (2013) 13. McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The SEMAINE database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Aﬀect. Comput. 3(1), 5–17 (2012) 14. Metallinou, A., Lee, C.C., Busso, C., Carnicke, S., Narayanan, S.: The USC CreativeIT database: a multimodal database of theatrical improvisation. In: Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, p. 55 (2010) 15. Nair, V., Hinton, G.E.: Rectiﬁed linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010) 16. Nicolaou, M.A., Gunes, H., Pantic, M.: Automatic segmentation of spontaneous data using dimensional labels from multiple coders. In: Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality. German Research Center for AI (DFKI) (2010) 17. Nicolle, J., Rapp, V., Bailly, K., Prevost, L., Chetouani, M.: Robust continuous prediction of human emotions using multiscale dynamic cues. In: Proceedings of the ACM International Conference on Multimodal Interaction, pp. 501–508 (2012) 18. Paltoglou, G., Thelwall, M.: Seeing stars of valence and arousal in blog posts. IEEE Trans. Aﬀect. Comput. 4(1), 116–123 (2013) 19. Petta, P., Pelachaud, C., Cowie, R.: Emotion-Oriented Systems: The HUMAINE Handbook. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-64215184-2 20. Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the RECOLA multimodal corpus of remote collaborative and aﬀective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013) 21. Schuller, B., Steidl, S., Batliner, A., Epps, J., Eyben, F., Ringeval, F., Marchi, E., Zhang, Y.: The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load. In: Fifteenth Annual Conference of the International Speech Communication Association (2014) 22. Schuller, B., Vlasenko, B., Eyben, F., Wollmer, M., Stuhlsatz, A., Wendemuth, A., Rigoll, G.: Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans. Aﬀect. Comput. 1(2), 119–131 (2010)

Context Modeling for Cross-Corpus Dimensional Emotion Recognition

165

23. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overﬁtting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 24. Valstar, M., Gratch, J., Schuller, B., Ringeval, F., Lalanne, D., Torres Torres, M., Scherer, S., Stratou, G., Cowie, R., Pantic, M.: Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 3–10. ACM (2016) 25. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

Functional Mapping of Inner Speech Areas: A Preliminary Study with Portuguese Speakers Carlos Ferreira1,2,10 , Bruno Direito1 , Alexandre Sayal1,2 , Marco Sim˜ oes2,3,4 , Inˆes Cad´orio5,6 , Paula Martins5,7,8 , 5,6 , Daniela Figueiredo5,6 , Miguel Castelo-Branco2,3 , Marisa Lousada and Ant´ onio Teixeira8,9(B) 1

2

4

Institute of Nuclear Sciences Applied to Health, University of Coimbra, Coimbra, Portugal [email protected], c [email protected] CIBIT Coimbra Institute for Biomedical Imaging and Translational Research, ICNAS, University of Coimbra, Coimbra, Portugal 3 Faculty of Medicine, University of Coimbra, Coimbra, Portugal Center for Informatics and Systems, University of Coimbra, Coimbra, Portugal 5 School of Health Sciences, University of Aveiro, Aveiro, Portugal 6 Center for Health Technology and Services Research, University of Aveiro, Aveiro, Portugal 7 Institute of Biomedicine, University of Aveiro, Aveiro, Portugal 8 Institute of Electronics and Telematics Engineering of Aveiro (IEETA), Aveiro, Portugal 9 Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal [email protected] 10 Perspectum Diagnostics, Oxford, UK

Abstract. Inner speech can be deﬁned as the act of talking silently with ourselves. Several studies aimed to understand how this process is related to speech organization and language. Despite the advances, some results are still contradictory. Importantly, language dependency is scarcely studied. For this ﬁrst study of inner speech for Portuguese native speakers using fMRI, we selected a confrontation naming task, consisting of 40 black and white line drawings. Five healthy participants were instructed to name in inner and in overt speech the visually presented image. fMRI data analysis considering the proposed inner speech paradigm identiﬁed several brain areas such as the left inferior frontal gyrus, including Broca’s area, supplementary motor area, precentral gyrus and left middle temporal gyrus including Wernicke’s area. Our results also show more pronounced bilateral activations during the overt speech task when compared to inner speech, suggesting that inner and overt speech activate similar areas but stronger activation can be found in the later. However, this diﬀerence stems in particular from signiﬁcant activation diﬀerences in the right pre-central gyrus and middle temporal gyrus. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 166–176, 2018. https://doi.org/10.1007/978-3-319-99579-3_18

Functional Mapping of Areas Related with Inner Speech in Portuguese Keywords: Inner speech portuguese fMRI · Portuguese

1

167

· First keyword · Overt speech

Introduction

Inner Speech is deﬁned as the act of talking to ourselves silently [6,13,15]. Several studies imply inner speech in memory tasks, reading, comprehension, consciousness, inner thought (self-reﬂection tasks) [6,14] and prospective thought [16]. According to the literature, two levels of inner speech can be deﬁned: one (more abstract) designated of “language of mind” where the syntax is not fully structured and semantics is more personal and subjective; the other level is more concrete and phonological and phonetic components can be present [6]. Aside these two features intrinsic to inner speech, there is still lack of comprehension on how inner speech is related with speech organization and language. To that end, recent work has been developed to better understand the relation between inner and overt speech, and their correspondence to language pathways. Despite all previous eﬀorts, there is still lack of consensus regarding the relation between inner and overt speech [6,18]. Some of the factors that contribute to this are: the paradigm variability to explore inner and overt speech, some of the studies did not compare inner and overt speech and others did not monitor participants performance [6,18]. To assess their neural underpinnings features, diﬀerent methods such as Positron Emission Tomography (PET), electroencephalography (EEG), Transcranial Magnetic Stimulation TMS) and Functional Magnetic Resonance Imaging (fMRI) can be used. The recent advances in the ﬁeld of Magnetic Resonance Imaging (MRI), combining optimized spatial and improved temporal resolution, multivariate supervised learning methods (allowing assessments in real time), established the use of this technique as one of the most important in the understanding of brain mechanisms. The fact that it does not use ionizing radiation (as in PET imaging) also represents a signiﬁcant advantage of fMRI to assess brain function. fMRI uses the contrast between oxygenated and deoxygenated blood, the blood-oxygenation-level-dependent (BOLD) eﬀect, which is based on the coupling between the hemodynamic response and neuronal activity. Currently, fMRI using BOLD eﬀect, is one of the preferred methods to map neuronal activity [26]. High ﬁeld MRI scanners are being used to increase signal-to-noise ratio, ultimately improving the ability to map brain function based on the BOLD signal [10,26]. Recent studies tried to use fMRI to understand and identify brain areas involved in inner speech. Areas like the left inferior frontal gyrus (IFG) (including Broca’s area), Wernicke’s area, right temporal cortex, supplementary motor area (SMA), insula, right superior parietal lobule (SPL) and right superior cerebellar cortex were found to be involved in inner speech [6,9,12,15]. Geva [6] mentions that structural connectivity patterns near the supramarginal gyrus (SMG) (implicated in the dorsal pathway of language) are predictive of internal speech.

168

C. Ferreira et al.

In a critical review, it is mentioned that the planning without speech production and articulation is supported by connections between the prefrontal cortex and the left IFG (Broca’s area) [9]. Also is stated the existence of projections between areas related with speech production and auditory cortex as a relevant process in verbal self-monitorization of internal speech [9]. It is also mentioned that inner speech nature is supported by connections between frontal and temporal regions, as to inform the areas related with language perception of the self-generated nature of the verbal output. To map the areas related with inner speech, several paradigms are being used. One example [19], analyzes the relation between frontal and temporal activity instructing the participants to say the same word (word repetition task) in diﬀerent time points - at each second, each 4 s (condition fast vs. slow), and at each second, each 2 s, each 4 s (conditions fast vs. medium vs. slow). The moment where the participants need to perform the task was indicated by a visual cue [19]. Other example used a task to map the inner speech during a working memory task where the authors exploit a storage condition and a manipulation condition with sub-vocal reproduction of letters [12]. This paradigm allowed to identify active brain areas related with working memory during an inner speech task. Paradigms that include letter or object naming, animal name generation, verb generation, reading, rhyme judgement, counting or semantic ﬂuency tasks are also used to assess inner speech brain related areas [6,18]. In the present study, we focus in the optimization of a paradigm that can be easily used to study inner and overt speech and possible relation between the areas recruited by both processes. We will use a confrontation naming task to evaluate the variability/diﬀerences between both speech mechanisms and try to map areas that could be related only with pure inner speech. We also want to assess the feasibility of assessing inner speech related areas when performing a language task in the context of the European Portuguese language. Paper Structure: The paper is structured as follows: a brief introductory section presenting related work; Sect. 2 details methods of fMRI data acquisition (including the stimulation protocol, MR parameters) and the tools used for image processing and analyses; Sect. 3 provides the most relevant results obtained so far; in Sect. 4 we discuss the results comparing our ﬁndings with published literature; ﬁnally, the conclusions that can be drawn are presented.

2

Methods

The study consisted in the recording and analysis of fMRI data while native speakers of Portuguese performed inner and overt speech tasks in response to visual stimuli. Participants: Five healthy native Portuguese speakers volunteers (mean age: 22.2 years old; 3 males) were enrolled in this study. All participants had normal or corrected to normal vision, and no history of neurological disorders.

Functional Mapping of Areas Related with Inner Speech in Portuguese

169

The Edinburgh handedness test was applied to the participants to ensure they were all right handed (mean 92% right) and they all declare Portuguese as their native language. The study was approved by the Ethics Commission of the Faculty of Medicine of the University of Coimbra and was conducted in accordance with the declaration of Helsinki. All subjects provided written informed consent to participate in the study. Data Collection: The data was collected using a Siemens Magnetom Trio 3 T scanner (Erlangen, Germany) with a 12-channel head coil. Anatomical images were acquired using a sagittal T1 3D MPRAGE sequence with the following parameters: TR = 2530 ms; TE = 3.42 ms; TI = 1100 ms; ﬂip angle = 7◦ ; 176 slices; matrix size 256 × 256; voxel size 1 × 1 × 1 mm. After the anatomical scan, functional maps were obtained using axial gradient echo-planar imaging BOLD sequences parallel to the bi-commissural plane with the following parameters: TR = 3000 ms; TE = 30 ms; 40 slices; matrix size 70 × 70; voxel size 3 × 3 × 3 mm. Visual stimuli were presented on a NordicNeuroLab (Bergen, Norway) LCD monitor, with a resolution of 1920 × 1080 pixels, refresh rate 60 Hz. Stimulation Protocol: The experimental protocol consisted in a picture naming task - inner and overt speech - of 40 black and white line drawings selected from Snodgrass & Vanderwart corpus [20] (Fig. 1). The selection of black and white line drawings was preferred over colored pictures because of their simplicity. Additionally, ambiguous pictures that could retrieve more than one target word (e.g. bottle with water) were excluded from this task. The inner and overt speech runs consisted in a block design experiment with nine rest blocks of 15 s and 8 task blocks of 30 s where each image was presented during 3 s, 10 images per block and two repetitions per image in the run. Each run had a total duration of 125 volumes (Fig. 1). In the baseline condition, the participants were instructed to focus on the ﬁxation cross presented. During the task condition, each participant was instructed to name the object silently in the inner speech run and overtly in the overt speech run. Data Analysis: Preprocessing and analysis were conducted using BrainVoyager QX 2.8 (Brain Innovation, Maastricht, Netherlands). First, individual functional data were analyzed in order to assess data quality (e.g. head motion) and participants’ engagement and ability to perform the task proposed. All participants successfully performed the task and were included in the analysis. Preprocessing of single-subject fMRI data included slice-time correction, realignment to the ﬁrst image to compensate for head motion and temporal high-pass ﬁltering to remove low-frequency drifts. The anatomical images were co-registered to the functional volumes and all images were normalized to Talairach coordinate space [24]. After preprocessing, in the ﬁrst-level analysis of the functional data, general linear model (GLM) analysis was used for each run. Predictors were modeled as a boxcar function with the length of each condition, convolved with the canonical hemodynamic response function (HRF). Six motion parameters (three

170

C. Ferreira et al.

Fig. 1. Stimulation paradigm. Baseline - consisted in a ﬁxation cross and the participants were instructed to focus on it. Task - picture naming task - consisted in a sequence of images visually presented for the participants to name silently (in the inner speech run) or overtly (in the overt speech run). Each image was presented at the screen during 3 s in a total of 10 images per block.

translational and three rotational) and predictors based on spikes (outliers in the BOLD time course) were also included into the GLM as covariates. At the group level, to map the most important brain regions involved in inner and overt speech, we used the contrast “task” > “baseline”. First we applied a 3D spatial smoothing with a Gaussian ﬁlter of 6 mm. Taking into account the feasibility nature of our study, we performed a ﬁxed-eﬀects (FFX) analysis. To address the multiple comparison problem, we applied False Discovery Rate (FDR) correction (considering a false discovery rate of 0.01). We also aimed at comparing inner and overt speech mechanisms. To this end, we selected a set of regions of interest (ROI) involved in the speech/word formation network (based on a literature review [5,6,12,15,19,23]. Each individual ROI was selected based on the corresponding anatomic landmarks and on the highest t-statistic voxel of the inner speech run statistical map (contrast “confrontation naming task” > “baseline”). Each ROI was deﬁned as a volume with a maximum of 1000 voxels around the peak value (using BrainVoyager QX interface tool to deﬁne ROIs). We then computed and compared ROI-GLM t-statistic per ROI between inner and overt speech. We performed a two-sided Wilcoxon rank sum test (Matlab 2017a) to test the statistical signiﬁcance of the diﬀerence between the results obtained considering the inner and overt speech in the naming task.

3 3.1

Results Whole Brain Analysis - Brain Map of the Naming Task

The FFX-GLM statistical map regarding the inner speech naming task (FFX, q(FDR) < 0.01), considering the contrast of interest “picture naming task” > “baseline” (Fig. 2a), revealed signiﬁcant activations in the IFG and Middle Frontal Gyrus (MFG) (including Broca’s area), preCentral Gyrus (pGC), SMA,

Functional Mapping of Areas Related with Inner Speech in Portuguese

171

Middle Temporal Gyrus (MTG) (including Wernicke’s area), Intraparietal Sulcus (IPS), Occipital areas and Fusiform Gyrus (FG). Figure 2b presents the FFX-GLM statistical map from the overt speech naming task (FFX, q(FDR) < 0.01), considering the contrast of interest “picture naming task” > “baseline” in which it is possible to identify several brain regions such as the IFG (including Broca’s area), pCG, SMA, MTG (including Wernicke’s area), Occipital areas and FG.

a)

b)

Fig. 2. (a) FFX-GLM group activation map for the inner speech runs (q(FDR) < 0.01), showing areas with higher activation during the task relative to the baseline. (b) FFXGLM group activation map for the overt speech runs (q(FDR) < 0.01), showing areas with higher activation during the task relative to the baseline. The regions in blue, in both, show the expected deactivation particular in the default mode network. (Color ﬁgure online)

3.2

Comparing Inner and Overt Speech - ROI-Based Analysis and the Speech Brain Network

One of the aims of the study was to compare inner and overt speech activation patterns. To this end, considering a literature review on speech-related brain networks, we identiﬁed a total of 16 ROIs (summarized in Table 1). In order to functionally deﬁne each ROI, we identiﬁed the relevant anatomical landmarks and selected a ROI around the highest t-statistic voxel considering the whole brain inner speech statistical map. Table 1 presents the coordinates of the center of gravity of each ROI (in Talairach coordinates) and the total number of voxels. The beta weights of the contrast “picture naming task” > “baseline” for each region and condition (ROI-GLM) were extracted per participant and run (these weights reﬂect the BOLD signal variation during the task condition relative to the baseline). To evaluate the statistical signiﬁcance of the diﬀerence between inner and overt speech naming tasks, we performed a two-sided Wilcoxon rank sum test on the beta values for each ROI. The results are presented in Table 1.

172

C. Ferreira et al.

Table 1. Inner vs Overt speech: Talairach coordinates and number of voxels per ROI. Wilcoxon statistical test results to evaluate main diﬀerences between inner and overt speech. Region

Num. voxels

ROI x

inner overt U p-value (median) (median) y

z

Left middle temporal gyrus

738

−47 −46 9

−0.0244

0.2949

3 0.0556

Left frontal lobe (Broca’s area)

1000

−48 146 11

0.0293

0.0908

12 > 0.9999

Left putamen

826

−21 −5

14

0.1152

0.2666

4 0.0952

Right putamen

412

22

0

10

0.1875

0.3125

10 0.6905

Supplementary motor area

962

−3

−5

57

0.2725

0.3740

6 0.2222

Right middle temporal gyrus

414

47

−31 1

0.1699

0.6025

1 0.0159

Left M1 - Precentral gyrus

970

−43 −5

0.1699

0.5049

5 0.1508

Left intraparietal sulcus

979

−46 −41 41

−0.0137

0.0615

7 0.3095

Left inferior parietal lobule

989

−32 −59 45

0.2920

0.1875

15 0.6905

Right inferior parietal lobule

951

32

−58 44

0.2822

−0.1025

17 0.4206

Right M1 precentral gyrus

973

48

−4

0.3379

0.9150

2 0.0317

41

45

Left FG

1000

−42 −50 −14 0.3125

0.4873

7 0.3095

Right FG

1000

41

−60 −20 0.2783

1.5811

3 0.0556

Left inferior occipital gyrus

940

−26 −88 −7

1.4863

1.4590

10 0.6905

Right inferior occipital gyrus

941

27

−88 −8

1.6309

1.6074

9 0.5476

Right inferior frontal gyrus

525

48

36

0.0156

0.3828

6 0.2222

5

Our results show that overt speech elicits a stronger activation pattern. Statistical signiﬁcant diﬀerences were found in the right MTG and the right pCG. Additionally, we computed the subtraction between the overt and inner speech activation maps (Fig. 3). The results suggest that the overt speech activation pattern in most brain structures is higher than the activation pattern presented by the inner speech task, complying with the ROI-GLM results.

Functional Mapping of Areas Related with Inner Speech in Portuguese

173

Fig. 3. Subtraction FFX-GLM group activation map between the overt and inner speech runs (q(FDR) < 0.01), showing areas with higher activation during the overt relative to the inner.

4

Discussion

In this study we sought to assess brain activity patterns when performing two speech tasks - one related with inner speech and other related with overt speech. One of the new ﬁndings that has not been reported in other studies, was IPS activity. This ﬁnding can be explained by the involvement of IPS in tasks related with working memory, attention and attentional control by left fronto-parietal network that can be ﬂexibly allocated to language processing as a function of task demands [2,7,11]. Another interesting ﬁnding is the activation of the FG, specially the visual word form area (VWFA) during both tasks. Usually related with the processing of visually presented letter strings, words, pseudowords but also to nonwords stimuli [3–5,23,25], the VWFA was active during the performance of speech tasks with image presentation (non-verbal material) in both tasks. This can be supported by Cohen [3] that mention the relation between the visual system and left lateralized regions engaged in language processing and by Stevens [22] that mention a functional connectivity between visual word form area and core regions of language processing. Bouhali [1] recently showed functional and anatomical connections between visual word form area and most perisylvian language-related areas, including Broca’s area.

174

C. Ferreira et al.

The major task related diﬀerence in the statistical analysis indicates more activation in right precentral gyrus during overt speech task. This is in concordance with some results published in the literature that assume the need of a strong motor response to produce the overt speech controlling all the elements involved in speech production while, the inner speech, as is not so dependent of activating articulatory elements, should have lower pCG activation [8,17,18,21]. Another source of diﬀerence is right middle temporal gyrus (rMTG) that shows stronger activation in overt speech when comparing to inner speech. This ﬁnding remains controversial in the literature where some authors mentioned that in the inner speech conditions they found high activations in other MTG subregions [18]. These intriguing results can be explained by some limitations in our study. First the small sample size of this exploratory study is more prone to be inﬂuenced by single individual results, in particular in an FFX analysis. There is also the possibility that distinct subregions in MTG modulate diﬀerentially. Nevertheless, this ﬁrst approach in Portuguese speaking participants allow us to map the mechanisms involved in inner speech even without the use of verbal material (e.g. words and sentences). This proof of concept/pilot study may pave way to further explore the mechanisms involved in inner speech when using verbal stimuli.

5

Conclusions

In this work we were able to map the inner speech related areas that are in accordance with the literature and explicit a wider bilateral brain activation during overt speech when compared with inner speech, although these diﬀerences dominate mainly in two regions (a part of MTG and M1). Future research should focus on expanding the understanding of the neural correlates of inner and overt speech. In this sense, we believe that using a parametric diﬃculty level paradigm design (e.g. from vowel to sentence) may represent an important tool to evaluate major diﬀerences between the several areas engaged in the inner speech performance when diﬃculty of the task is increasing.

References 1. Bouhali, F., de Schotten, M.T., Pinel, P., Poupon, C., Mangin, J.F., Dehaene, S., Cohen, L.: Anatomical connections of the visual word form area. J. Neurosci. 34(46), 15402–15414 (2014) 2. Bray, S., Almas, R., Arnold, A.E., Iaria, G., MacQueen, G.: Intraparietal sulcus activity and functional connectivity supporting spatial working memory manipulation. Cereb. Cortex 25(5), 1252–1264 (2013) 3. Cohen, L., Dehaene, S., Naccache, L., Leh´ericy, S., Dehaene-Lambertz, G., H´enaﬀ, M.A., Michel, F.: The visual word form area: spatial and temporal characterization of an initial stage of reading in normal subjects and posterior split-brain patients. Brain 123(2), 291–307 (2000)

Functional Mapping of Areas Related with Inner Speech in Portuguese

175

4. Cohen, L., Leh´ericy, S., Chochon, F., Lemer, C., Rivaud, S., Dehaene, S.: Languagespeciﬁc tuning of visual cortex? functional properties of the visual word form area. Brain 125(5), 1054–1069 (2002) 5. Dehaene, S., Le Clec’H, G., Poline, J.B., Le Bihan, D., Cohen, L.: The visual word form area: a prelexical representation of visual words in the fusiform gyrus. Neuroreport 13(3), 321–325 (2002) 6. Geva, S., Jones, P.S., Crinion, J.T., Price, C.J., Baron, J.C., Warburton, E.A.: The neural correlates of inner speech deﬁned by voxel-based lesion-symptom mapping. Brain 134(10), 3071–3082 (2011) 7. Grefkes, C., Fink, G.R.: The functional organization of the intraparietal sulcus in humans and monkeys. J. Anat. 207(1), 3–17 (2005) 8. Huang, J., Carr, T.H., Cao, Y.: Comparing cortical activations for silent and overt speech using event-related fMRI. Hum. Brain Mapp. 15(1), 39–53 (2002) 9. Jones, S.R., Fernyhough, C.: Neural correlates of inner speech and auditory verbal hallucinations: a critical review and theoretical integration. Clin. Psychol. Rev. 27(2), 140–154 (2007) 10. Logothetis, N.K.: What we can do and what we cannot do with fMRI. Nature 453(7197), 869 (2008) 11. Majerus, S.: Language repetition and short-term memory: an integrative framework. Front. Hum. Neurosci. 7, 357 (2013) 12. Marvel, C.L., Desmond, J.E.: From storage to manipulation: how the neural correlates of verbal working memory reﬂect varying demands on inner speech. Brain Lang. 120(1), 42–51 (2012) 13. Morin, A.: Inner speech. In: Hirstein, W. (ed.) Encyclopedia of Human Behavior, pp. 436–443. Elsevier, London (2012) 14. Morin, A., Hamper, B.: Self-reﬂection and the inner voice: activation of the left inferior frontal gyrus during perceptual and conceptual self-referential thinking. Open Neuroimaging J. 6, 78–89 (2012) 15. Morin, A., Michaud, J.: Self-awareness and the left inferior frontal gyrus: inner speech use during self-related processing. Brain Res. Bull. 74(6), 387–396 (2007) 16. Morin, A., Uttl, B., Hamper, B.: Self-reported frequency, content, and functions of inner speech. Procedia - Soc. Behav. Sci. 30, 1714–1718 (2011) 17. Palmer, E.D., Rosen, H.J., Ojemann, J.G., Buckner, R.L., Kelley, W.M., Petersen, S.E.: An event-related fMRI study of overt and covert word stem completion. Neuroimage 14(1), 182–193 (2001) 18. Perrone-Bertolotti, M., Rapin, L., Lachaux, J.P., Baciu, M., Loevenbruck, H.: What is that little voice inside my head? inner speech phenomenology, its role in cognitive performance, and its relation to self-monitoring. Behav. Brain Res. 261, 220–239 (2014) 19. Shergill, S.S., Brammer, M.J., Fukuda, R., Bullmore, E., Amaro, E., Murray, R.M., McGuire, P.K.: Modulation of activity in temporal cortex during generation of inner speech. Hum. Brain Mapp. 16(4), 219–227 (2002) 20. Snodgrass, J.G., Vanderwart, M.: A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity. J. Exper. Psychol. Hum. Learn. Mem. 6(2), 174 (1980) 21. Stephan, F., Saalbach, H., Rossi, S.: How the brain plans inner and overt speech production: a combined EEG and fNIRS study. In: 23rd Annual Meeting of the Organization for Human Brain Mapping (OHBM), Vancouver, Canada (2017) 22. Stevens, W.D., Kravitz, D.J., Peng, C.S., Tessler, M.H., Martin, A.: Privileged functional connectivity between the visual word form area and the language system. J. Neurosci. 37(21), 5288–5297 (2017)

176

C. Ferreira et al.

23. Tagamets, M.A., Novick, J.M., Chalmers, M.L., Friedman, R.B.: A parametric approach to orthographic processing in the brain: an fMRI study. J. Cogn. Neurosci. 12(2), 281–297 (2000) 24. Talairach, J., Tournoux, P.: Co-planar Stereotaxic Atlas of the Human Brain. Thieme, New York (1988) 25. Vigneau, M., Jobard, G., Mazoyer, B., Tzourio-Mazoyer, N.: Word and non-word reading: what role for the visual word form area? Neuroimage 27(3), 694–705 (2005) 26. Willinek, W.A., Schild, H.H.: Clinical advantages of 3.0 T MRI over 1.5 T. Eur. J. Radiol. 65(1), 2–14 (2008)

Semi-Supervised Acoustic Model Retraining for Medical ASR Greg P. Finley1(B) , Erik Edwards1 , Wael Salloum1,2,3 , Amanda Robinson1 , Najmeh Sadoughi1 , Nico Axtmann3 , Maxim Korenevsky1 , Michael Brenndoerfer2 , Mark Miller1 , and David Suendermann-Oeft1 1

2

EMR.AI Inc., San Francisco, CA, USA [email protected] University of California, Berkeley, Berkeley, CA, USA 3 DHBW, Karlsruhe, Germany

Abstract. Training models for speech recognition usually requires accurate word-level transcription of available speech data. For the domain of medical dictations, it is common to have “semi-literal” transcripts available: large numbers of speech ﬁles along with their associated formatted episode report, whose content only partially overlaps with the spoken content of the dictation. We present a semi-supervised method for generating acoustic training data by decoding dictations with an existing recognizer, conﬁrming which sections are correct by using the associated report, and repurposing these audio sections for training a new acoustic model. The eﬀectiveness of this method is demonstrated in two applications: ﬁrst, to adapt a model to new speakers, resulting in a 19.7% reduction in relative word errors for these speakers; and second, to supplement an already diverse and robust acoustic model with a large quantity of additional data (from already known voices), leading to a 5.0% relative error reduction on a large test set of over one thousand speakers.

Keywords: Medical speech recognition Acoustic modeling

1

· ASR · Medical dictation

Introduction

Training automatic speech recognition (ASR) systems requires transcribed speech corpora to build acoustic models (AMs) and language models (LMs). Traditionally, such transcriptions are created by human labor, which imposes limitations on how large such corpora can be, how many speakers they can cover, how quickly they can be created, and how consistently transcriptions are following required guidelines. To overcome these limitations, techniques have been proposed to create transcriptions automatically, substantially increasing the size of the training corpus with relatively little eﬀort. For example, Suendermann et al. perform speech recognition on millions of utterances collected in industrial spoken dialog systems and determine, based upon the recognizer’s c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 177–187, 2018. https://doi.org/10.1007/978-3-319-99579-3_19

178

G. P. Finley et al.

conﬁdence score, which of the hypotheses can be accepted without further review and which ones should undergo human quality assurance [8]. Such fully automatic techniques suﬀer from the disadvantage that they rely on pre-existing speech recognition models and settings and have no way to acquire new vocabulary or adapt to new domains. Thus, they suﬀer if there is a signiﬁcant mismatch between training and adaptation language. The medical transcription domain is a special case, however, in that speech recordings of clinical dictations are almost always subject to transcription into a formatted outpatient report which contains a well-formatted and corrected version of the dictated matter. Note that the process of correcting and modifying a literal transcript into a report is an extensive one and often involves changes that make it impossible to use reports directly as ASR training data: intuiting punctuation, list numbering, etc. when formatting is not explicitly spoken; executing requests by the speaker (“scratch that,”, e.g.); or even inserting material from elsewhere in the patient’s medical history. Strategies for using this very rich set of data for the purpose of model enhancement, and to overcome its lack of word-level correspondence between spoken and written content, have been discussed in the literature for about two decades. Early research showed that this type of data can indeed be used to adapt a speaker-independent model to new speakers [5,9]; the basic approach is to use an ASR engine to decode new audio with matching reports, then use the results that can be veriﬁed correct as new training data. However, these studies use very small test sets and speech recognition technology which is widely considered dated. Consequently, the baseline performance is very poor by modern standards, and reported improvements often do not meet statistical signiﬁcance thresholds. To increase the amount of usable data beyond only the correct outputs of the recognizer, researchers have also explored using LMs for decoding built speciﬁcally to the report [5] or have explored the use of phonetic [7] and semantic [2,6] features to correct ASR errors using the report as reference. However, the latter studies either did not test how accuracy of a speech recognizer is impacted when adding the new data, or limited the study to LM adaptation. Outside of the medical domain, this type of semi-supervised approach has more recently been applied to parliamentary transcription, which is a similar case in that large amounts of semi-verbatim transcription data are available [3,4]. To our knowledge, however, no validation of these methods exists for building AMs for a modern, production-scale medical ASR system. In this paper we present such a validation for two applications: adapting a model to previously unseen speakers, and enhancing an already large model with additional data from known speakers.

2

Method

We applied semi-supervised methods to enhance the training corpus for AMs in two diﬀerent experiments. Experiment 1 represents a case of speaker adaptation, using semi-supervised data for speakers unknown to the original acoustic model.

Semi-Supervised Acoustic Model Retraining for Medical ASR

179

In Experiment 2, on the other hand, we test whether a model can be augmented by adding a large quantity of additional data from many known speakers. The general procedure for both experiments is the same; they diﬀer only in the data sources used. Except where otherwise speciﬁed, the methodological details given below are identical for both experiments. For each experiment, we built two AMs and compared their performance in word error rate (WER) on a test set. AM1 was a “traditional” model, trained from fully manually transcribed dictations; AM2 contained all the data of AM1 plus a large set of “virtual transcriptions,” generated by (1) ASR decoding of a large set of untranscribed data, then (2) identifying correct hypotheses by comparing with matching reports. The entire training and testing process, including all data and models, is described in detail in Sect. 2.2 and visualized in Fig. 1. 2.1

Data

The primary source of training data consists of manually transcribed dictations, as do all test sets for results reported in this paper. For Experiment 1, no speakers from Test are represented in Train; for Experiment 2, all speakers in Test have exactly one or two dictations in Train. (Recall that Experiment 1 tests the adaptation of a model to new speakers, and Experiment 2 tests the bolstering of an already comprehensive model with more data.) In addition, we have access to a large number of audio dictations with corresponding reports but no transcripts. This corpus constitutes the “Untranscribed” set for each experiment. See Table 1 for size statistics of all corpora. Table 1. Summary of all dictations. Manual transcriptions are available for Train and Test, and reports for Untranscribed. AM1 was trained on Train and AM2 on Train+Aug. Data set

# Speakers # Utterances # Hours

Experiment 1 Train

245

6,857

Test

26

32

305.0 3.5

Untranscribed 458

12,207

652.8

Augmentation 457

211,909

259.5

Train+Aug.

702

218,766

564.5

Train

2,384

9,214

396.1

Test

1,033

Experiment 2 1,033

28.9

Untranscribed 1,241

93,581

6646.5

Augmentation 1,228

2,269,801

2617.1

Train+Aug.

2,279,015

3013.2

2,384

180

G. P. Finley et al.

In general, corpora used for Experiment 2 are much larger than for Experiment 1. The data also come from diﬀerent providers, with diﬀerent speakers, recording conditions, and report styles. Despite the methodological similarity between the two experiments, they should be considered entirely separate cases. Although manual transcriptions are generally considered to be the most accurate source of data for ASR training, medical speech is notoriously diﬃcult due to a number of factors including specialized vocabulary, high rate of speech, etc. [1]. The medical transcriptionists who created Train and Test did so with the aid of matching reports, which themselves were generated through multiple rounds of transcription and quality assurance by other trained transcriptionists. Additionally, to estimate human WER when unaided by reports, we obtained separately three rounds of transcription on a set of 334 dictations: two rounds using reports as a reference, as is our normal procedure, and one “blind” round. These dictations did not overlap with any other data set. Note also that these reports were taken from the same provider as the data from Experiment 2, so any human WER results should only be considered relevant to Experiment 2. 2.2

Generating Additional AM Training Data

The entire Untranscribed set was decoded using our best prior acoustic model and a specially designed language model (AM1 and LM1, described below). Sequences of correctly recognized words in the hypotheses were identiﬁed by aligning hypotheses with reports using a dynamic programming algorithm. Any sequence consisting of ﬁve or more consecutive words matching perfectly between hypothesis and transcript was excised, alongside its matching audio range, and considered a training utterance in a large set of supplementary, semi-supervised training data, which we call the Augmentation set (see Table 1). We decided upon a ﬁve-word window based on an informal assessment of the excised clips; shorter windows exhibited more slight errors in word boundary detection, which we suspected would propagate in re-training. Our approach for generating training data is conservative in that we only allow perfect matches of substantial length between hypothesis and report. This ensures that virtual transcriptions are as accurate as possible. Although we piloted some strategies for correcting hypotheses using reports, we have found that, for the quanitities of data that we are considering, the perfect matches already provide very large training corpora by themselves. 2.3

Acoustic and Language Modeling

Our speech recognizer is based on a state-of-the-art stack with 40-dimensional MFCCs, deltas and delta-deltas, fMMLR, ivectors, SAT, GMM-HMM pre-training, and a DNN acoustic model. Two n-gram LMs were used: a trigram model (LM1) for decoding the large Untranscribed set, and a 4-gram model (LM2) for the experimental results comparing AM1 and AM2. (LM1 is faster to decode with, whereas LM2 is more accurate, so LM1 was chosen for the massive Untranscribed set and LM2 to achieve the best possible results on Test.) To generate

Semi-Supervised Acoustic Model Retraining for Medical ASR

181

LM1, language models are ﬁrst built for (1) the Train set and (2) the Train + Untranscribed sets; these two are then interpolated, with coeﬃcients tuned to minimize perplexity on a held-out set, to yield the ﬁnal model. The procedure for LM2 was the same, except that all n-gram counts of Untranscribed were decremented by one, eﬀectively removing singletons and signiﬁcantly accelerating decoding for an otherwise slow 4-gram model with minimal eﬀect on WER. At no point did we use Augmentation data to train LMs. We suspected that doing so would bias the recognizer towards easy speech and very short utterances. (Note also that some version of the linguistic information from the Augmentation set is already present in the LM, which contains Untranscribed.) This bias is not a concern for AM training, where the currency of recognition is at the phonetic level, and transitional probabilities between words are less important.

LM2 (4-gram) Untranscribed reports Train dictations with manual transcriptions

Untranscribed dictations only

AM1 LM1 (trigram)

Hypotheses AM1 result AM2 result

AM2

Test dictations with manual transcriptions Augmentation dictation segments w/ verified accurate hypotheses

Fig. 1. Experimental training and decoding procedure. Rectangles represent audio, possibly transcribed; cylinders, reports; diamonds, models; ellipses, decoding results.

182

3

G. P. Finley et al.

Results: Experiment 1

For Experiment 1, we compared WER on our test set between the baseline acoustic model (AM1) and the large expanded acoustic model (AM2). AM2 decreases the WER from AM1 by 19.7% relative, from 23.1% WER (5,377 edits out of 23,257 words) to 18.6% (4,317 edits), a statistically signiﬁcant diﬀerence as determined by a test of equal proportions (χ2 = 146.2, p < .001). Out of 26 speakers, 22 exhibit a decline in WER—up to a 52.6% relative reduction in the most extreme case (from 105 errors down to 46 errors, out of 512 words). Of the 4 that see an increase, the highest is an 11.1% relative increase (72 errors up to 80, out of 436 words). The distribution across speakers of relative WER change is visualized in Fig. 2.

3 2 0

1

Number of speakers

4

5

Histogram of error change from AM1 to AM2 (Exp. 1)

−60

−40

−20

0

Relative increase in errors (%)

Fig. 2. Relative WER change by speaker, AM1 to AM2 (Experiment 2)

4

Results: Experiment 2

For Experiment 2, we also measured diﬀerences in WER between two acoustic models. Additionally, as the test set contains a much larger number of speakers compared to Experiment 1, we dive deeper into the by-speaker results. Note again that all models and corpora in this Experiment are diﬀerent than those used in Experiment 1; mentions of ‘AM1’/‘AM2’ in this section refer now to the Experiment 2 versions of these.

Semi-Supervised Acoustic Model Retraining for Medical ASR

183

250 200 150 100 0

50

Number of speakers

300

350

Histogram of error change from AM1 to AM2 (Exp. 2)

−60

−40

−20

0

20

40

60

80

Relative increase in errors (%)

Fig. 3. Relative WER change by speaker, AM1 to AM2 (Experiment 2).

4.1

Decoding Accuracy

Decoding with AM2 decreases the WER from AM1 by 5.0% relative, from 22.0% WER (52,961 edits out of 240,382 words) to 20.9% (50,332 edits). Though this eﬀect is smaller than that demonstrated in Experiment 1, the diﬀerence is still statistically signiﬁcant (χ2 = 85.2, p < .001). The decrease in error rate is far from uniform across all speakers, however: relative WER over each speaker decreases by as much as 56% and increased by as much as 75%. WER increases for 303 out of 1033 speakers. See Fig. 3 for the distribution of relative change for individual speakers. 4.2

Eﬀect of Amount of Data Added

The extreme range of variation between speakers, and the fact that many speakers actually see a deterioration in performance, is a surprising ﬁnding that invites an explanation. Towards this end, a natural question is whether there is any relationship between the observed changes in WER and the amount of audio data added from the Augmentation set. Across all speakers, there is a correlation between relative change in WER and minutes of audio added, albeit a weak one (Kendall’s τ = −.046, p = .026; correlation is measured over ranks because time added is non-parametric, with a long right tail). This correlation measurement is only possible given the huge number of speakers in the Experiment 2 test set; no signiﬁcant similar eﬀect could be observed for Experiment 1. The relationship between time added and WER is visualized in Fig. 4. For this plot, speakers are grouped into bins according to the amount of audio data added,

184

G. P. Finley et al. Change in word error rate by amount of Augmentation audio per speaker (Exp. 2) ●

●

●

0.30

●

0.25

● ●

*

*

● ●

0.20

●

● ●

● ●

*

●

● ● ●

● ●

0.15

● ● ●

● ●

● ●

*

●

●

*

● ● ●

● ●

●

●

●

●

● ●

● ●

●

0.10

Word error rate for entire bin

● ●

●

●

●

●

●

● ●

●

●

0

100

200

300

400

500

600

700

Minutes of audio added (bin width = 10 minutes)

Fig. 4. Binned speaker WERs by amount of audio for each speaker in Augmentation data (Experiment 2). AM1 WER is marked by the narrow end of the bar, AM2 WER by the wide end with a circle. Asterisks underneath bars denote statistical signiﬁcance of WER change from AM1 to AM2 (α = .05, Bonferroni correction).

with each bin accounting for a 10-min range (inclusive on the low end only). The plot shows WER for AM1 (narrow end of the trapezoid) and AM2 (wide end) for each bin—thus, the trapezoid “points” in the direction of the change—calculated over all utterances in that bin. We performed a test of equal proportions for each bin, applying Bonferroni correction for multiple comparisons; those ﬁve bins with p < .05 are starred in the plot. (Note that the degree of change in a bin is not necessarily tied to statistical signiﬁcance, as bins do not all contain the same number of speakers or spoken words.) These individual bins are rather small, so most do not show statistically signiﬁcant changes; all those that do are for speakers with fewer than 220 min of speech added. Most interesting, however, is that the only bin to show a signiﬁcant increase in WER using AM2 is the 0- to 10-minute bin. This increase is driven mostly by the 30 speakers (out of the bin’s 44 total) who had no additional data added and saw an increase in WER of over 2% absolute, 8.5% relative (χ2 = 16.2, p < .001). These 30 speakers stand in stark contrast to the dataset as a whole, which shows a 1.1% decrease in absolute WER. 4.3

Human Word Error Rate

Given the nature of the data used in this work (recordings with at times extreme noise, non-native accent, audio compression artifacts, hesitations, etc.), and inspired by an earlier publication along these lines [1], we decided to study inter-rater consistency of the dataset by measuring the human error rate. Since our standard transcription procedure (Assisted condition) provides transcriptionists with the existing outpatient report of the dictation (which itself had undergone at least two tiers of transcription), we decided to conduct two types of human error rate experiments: (a) compare two transcriptions of the same audio

Semi-Supervised Acoustic Model Retraining for Medical ASR

185

Table 2. Human WER between diﬀerent sets of transcriptions. The “Assisted” conditions were done by professional transcriptionists using matched ﬁnal reports as a reference, and “Unassisted” by transcriptionists without access to the reports. Comparison Assisted1–Assisted2

WER 9.3%

Assisted1–Unassisted 18.0% Assisted2–Unassisted 20.1%

ﬁles created in the Assisted condition and (b) compare transcriptions created in the Assisted condition with those in the Unassisted condition. We expected (a) to exhibit a lower WER than (b) due to the existence of shared material. The inter-transcriber WERs are given in Table 2. In the Unassisted condition, transcribers diﬀer from the Assisted conditions by 18.0% to 20.1%. From these results, it appears that WER on our data by a single transcriber without pre-generated reference material would approach 20%. Even when such material is available, however, there are notable disagreements or errors in transcription (9.3%), further emphasizing the diﬃculty of the speech in these dictations. Recall again that we commissioned these transcriptions only for the data used in Experiment 2; human WER for Experiment 1 may not be this high.

5

Discussion

Our proposed method of providing guaranteed accurate data for AM retraining leads to models with lower average decoding error rates. For the purposes of adapting a model to previously unseen speakers, there was a major reduction in WER, eliminating nearly a ﬁfth of all errors. When bolstering an already large model, the gains are somewhat more modest—especially so when considering that AM2 in Experiment 2 was trained on 7.6 times the amount of audio data as AM1. Our human WER measurements do suggest, however, that these dictations are especially diﬃcult, and that we are already approaching human accuracy, so it may simply be the case that performance of the acoustic models has been “saturated” by this point. The more mixed results in Experiment 2, as well as the large and diverse test set used, invite some speculation as to how speakers may be aﬀected diﬀerently vis-` a-vis their WER by the data augmentation step. Despite the average drop in WER with AM2, performance did deteriorate in some instances. This was most evident for speakers for whom no data was added to the model. We suspect the cause is that the representation of these speakers in AM2 was diluted compared to their representation in the much smaller AM1. As a concrete recommendation, we would not suggest using an augmented acoustic model for speakers who had no data added, assuming they were already represented in the base model. Other than in this speciﬁc case, however, it was diﬃcult to demonstrate any strong relationship between the amount of data added for a speaker and the

186

G. P. Finley et al.

degree of recognition improvement. One explanation may be the presence of a confounding eﬀect: speakers with higher AM1 WERs will naturally have less data in the Augmentation set. Because accurate recognition on Untranscribed is a prerequisite for ﬁnding utterances to add to Augmentation, speakers for whom the model already does well tend to have the most added data. Indeed, there is a moderately strong correlation (Kendall’s τ = −.20, p < .001) between AM1 WER and amount of data added per speaker; note that this correlation is visually unmistakable in the general downward trend on the left side of Fig. 4. Thus, speakers with the most added data tend to be those who already showed low WER before augmentation. These same speakers would have had less “room for improvement” from changes to the AM: indeed, those speakers with higher AM1 WER tended to have larger relative improvements than those with lower AM1 WER (τ = −.065, p = .002). Taken together, the eﬀects of prior AM1 WER on WER change and on amount of data added may be obscuring some of the positive eﬀects of having more added data. Further gains in performance may be possible via strategies described in the literature for using reports to correct ASR errors on the Untranscribed set, allowing speech previously missed by the recognizer to be used for training. While our methods are suﬃcient to produce a very large training set, it is likely that adding more diﬃcult speech to training would improve recognition further. This would eﬀectively be an automated active learning approach, using alignment with reports as a semi-supervised step. We also did not attempt to bolster LMs in the same way we did for AMs; however, fully corrected machine transcripts would make this possible to test also.

6

Conclusion

We presented and evaluated a semi-supervised method for augmenting a speakerindependent AM using large numbers of dictations with matching ﬁnal reports. Our bolstered AMs achieve a signiﬁcant reduction in error rates, inching closer to human error rates. The methods detailed here are especially eﬀective as a means of adapting an AM to new speakers. By measuring performance on a large test set of over 1,000 speakers, we were able to note patterns in the procedure’s eﬀects. The amount of data added seems not to matter much, except that those speakers without any added acoustic data saw on average an increase in WER. This leads naturally to the conclusion that, whenever practical, diﬀerent AMs should be used for diﬀerent speakers depending on whether or not data from the target speaker was added in the augmentation stage. Future work will include expanding the approach to language modeling and applying more sophisticated techniques to select optimal models, e.g. using speaker clustering. We will also look deeper into the inﬂuence of the human error rate on ASR performance in both training and testing cycles and possible techniques to enhance inter-rater reliability for this diﬃcult domain.

Semi-Supervised Acoustic Model Retraining for Medical ASR

187

References 1. Edwards, E., et al.: Medical speech recognition: reaching parity with humans. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 512–524. Springer, Cham (2017). https://doi.org/10.1007/978-3-31966429-3 51 2. Jancsary, J., Klein, A., Matiasek, J., Trost, H.: Semantics-based automatic literal reconstruction of dictations. In: Semantic Representation of Spoken Language, pp. 67–74 (2007) 3. Kawahara, T.: Transcription system using automatic speech recognition for the Japanese Parliament (Diet). In: IAAI (2012) 4. Kleynhans, N., De Wet, F.: Aligning audio samples from the South African parliament with Hansard transcriptions (2014) 5. Pakhomov, S., Schonwetter, M., Bachenko, J.: Generating training data for medical dictations. In: Proceedings of NAACL-HLT, pp. 1–8 (2001) 6. Petrik, S., et al.: Semantic and phonetic automatic reconstruction of medical dictations. Comput. Speech Lang. 25(2), 363–385 (2011) 7. Petrik, S., Kubin, G.: Reconstructing medical dictations from automatically recognized and non-literal transcripts with phonetic similarity matching. In: ICASSP, vol. 4, pp. IV-1125. IEEE (2007) 8. Suendermann, D., Liscombe, J., Pieraccini, R.: How to drink from a ﬁre hose: one person can annoscribe 693 thousand utterances in one month. In: Proceedings of SIGdial, Tokyo, Japan (2010) 9. Wightman, C.W., Harder, T.A.: Semi-supervised adaptation of acoustic models for large-volume dictation. In: Proceedings of Eurospeech, pp. 1371–1374 (1999)

You Sound Like Your Counterpart: Interpersonal Speech Analysis Jing Han1(B) , Maximilian Schmitt1 , and Bj¨ orn Schuller1,2 1

ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany [email protected] 2 GLAM – Group on Language, Audio & Music, Imperial College London, London, UK

Abstract. In social interaction, people tend to mimic their conversational partners both when they agree and disagree. Research on this phenomenon is complex but not recent in theory, and related studies show that mimicry can enhance social relationships, increase aﬃliation and rapport. However, automatically recognising such a phenomenon is still in its early development. In this paper, we analyse mimicry in the speech domain and propose a novel method by using hand-crafted lowlevel acoustic descriptors and autoencoders (AEs). Speciﬁcally, for each conversation, two AEs are built, one for each speaker. After training, the acoustic features of one speaker are tested with the AE that is trained on the features of her counterpart. The proposed approach is evaluated on a database consisting of almost 400 subjects from 6 diﬀerent cultures, recorded in-the-wild. By calculating the AE’s reconstruction errors of all speakers and analysing the errors at diﬀerent times in their interactions, we show that, albeit to diﬀerent degrees from culture to culture, mimicry arises in most interactions. Keywords: Aﬀective computing Computational paralinguistics

1

· Conversation analysis

Introduction

Research in psychology has shown that people unconsciously mimic their counterpart in social interaction, which can be operationalised in varying ways including mimic posture, facial expressions, mannerisms, and other verbal and nonverbal expressions [5]. Moreover, the automatic detection of temporal mimicry behaviour can serve as a powerful indicator of social interaction, e.g., cooperativeness, courtship, empathy, rapport, and social judgement [12]. The previous works focus on automatically detecting mimicry behaviours particularly from head nod and smile, i.e., visual cues [3,23]. In this work, we focus on the acoustic side, given that in social interaction, people mimic others not only by body language, but also in their speech. To the best of our knowledge, this is the ﬁrst time that identical behaviour is analysed from speech over c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 188–197, 2018. https://doi.org/10.1007/978-3-319-99579-3_20

You Sound Like Your Counterpart: Interpersonal Speech Analysis

189

diﬀerent cultures in empirical research, though previous works exist where similar topics have been studied in theory [4]. As there is limited related works in this speciﬁc topic, we ﬁrst utilised low-level descriptors (LLDs) such as log-energy, and pitch, and measured the similarities over each conversation turn, but hardly found any obvious trend in these descriptors. Thus, we propose an autoencoder-based framework to leverage the power of machine learning. Speciﬁcally, for each interaction, two autoencoders (AEs) are trained on speech from two subjects A and B, respectively. Then, once the training procedure is done, the instances are exchanged and fed into the two autoencoders again, i.e., A is evaluated on the AE trained by data from B while B is evaluated on the AE trained by data from A. This goes under the hypothesis that, when a subject tends to behave similarly to her counterpart, the reconstructed features from the AE trained with her counterpart’s data should have a decreasing error along time. In the following Section, the related work is summarised both from a sociological and a technical perspective. In Sect. 3, we describe the data and acoustic features used in our research. In Sect. 4, we explain the experiments and present the results, before concluding in Sect. 5.

2

Related Work

Mimicry behaviour can be categorised into two diﬀerent groups: emotional mimicry and motor mimicry [13]. While the ﬁrst describes mimicry in the underlying aﬀective state, such as, happiness or sadness, the latter considers only imitation of physical expressions, such as, e.g., raising an eye-brow or nodding the head. As can be expected, motor mimicry is much easier to detect than emotional mimicry, given that physical expressions can be classiﬁed quite objectively by a human observer and also by automated tools. In the late 1970s, Friesen and Ekman proposed the ‘Facial Action Coding System’ [11] based on so-called facial action units (FAUs). FAUs describe 44 diﬀerent activations of facial muscles, resulting in a certain facial expression, e.g., ‘raising eye brow’, ‘wrinkling nose’, or ‘opening mouth’. However, several FAUs can be combined and be active at the same time. Ekman and Friesen have also shown that, there is a strong relationship between FAUs and aﬀective states [8] and that those relationships are largely universal despite there are some diﬀerences between cultures [7]. FAUs and head movements can be robustly recognised with state-of-the-art tools, such as OpenFace [2]. Motor mimicry is a means of persuasion in human-to-human interaction, by conforming to the other’s opinions and behaviour [13]. Humans are susceptible to mimic behaviours through both the audio and the visual domain [16]. Although mimicry is usually found in interactions both when subjects disagree with each other and when they do not, there are more mimicry interactions where people agree [23]. Moreover, it has been shown that there is usually a tendency to adopt gestures, postures, and behaviour of the chat partner over time during the conversation [5,6].

190

J. Han et al.

From the methodological point of view, for the automatic detection of behavioural mimicry, a temporal regression model has been proposed by Bilakhia et al. predicting audio-visual features of the chat partner using a deep recurrent neural network [3]. An ensemble of models has been trained for each class (mimicry/non-mimicry) and the ensemble providing the lowest reconstruction error determined the class. Mel-frequency cepstral coeﬃcients have been employed as acoustic features and facial landmarks as visual features. Compared to motor mimicry, emotional mimicry has been studied much less. However, it has been found that the tendency to mimic others’ behaviour is much less valid from the emotional perspective [14]. The extent of emotional mimicry highly depends on the social context and emotional mimicry is not present if people do not like each other or each other’s opinion. Scissors et al. found the same analysing the linguistic behaviour [21]. They observed that in a text-based chat system, within-chat mimicry (i.e., repetition of words or phrases) is much higher in chats where subjects trusted each other than in chats with a low level of trust. Furthermore, it was found that linguistic mimicry has a positive eﬀect on the outcome of negotiations [24].

3

Dataset and Features

Our experiments are based on the SEWA corpus of audio-visual interaction inthe-wild1 . Hand-crafted acoustic features have been extracted on the frame-level from the audio of all chats. 3.1

SEWA Video Chat Dataset

In the SEWA database, 197 conversations have been recorded from subjects of six diﬀerent cultures (Chinese, Hungarian, German, British, Serbian, and Greek). Table 1 summarises the number and total duration of conversations for each culture. The number of subjects is always twice the number of conversations. In these conversations, each lasting up to 3 min, a pair of subjects from the same culture discussed about an advertisement they just watched beforehand on a web platform. Figure 1 illustrates a screenshot of one dyadic conversation. The commercial seen beforehand was a 90 s long video clip advertising a (water) tap. All subjects were recorded in an ‘in-the-wild setting’, i.e., using the subjects’ personal desktop computers or notebooks and recording them either at their homes or in their oﬃces. The chat partners always knew each other beforehand (either family, friends, or colleagues) and were balanced w.r.t. gender constellations (female-male, female-female, male-male). Subjects with an age ranging from 18 to older than 60 are included in the database. The dialogues had to be held in the native language of the chat partners, but there were no restrictions concerning the exact aspects to be discussed during their chat about the commercial. Conversations showed a large variety of emotions and levels of agreement/disagreement or rapport. The SEWA corpus has been used as the oﬃcial benchmark database in the 2017 and 2018 Audio-Visual Emotion Challenges (AVEC) [17,18]. 1

https://sewaproject.eu/.

You Sound Like Your Counterpart: Interpersonal Speech Analysis

3.2

191

Acoustic Features

We use the established ComParE feature set of acoustic features [9]. For each audio recording, we capture the acoustic low-level descriptors (LLDs) with the openSMILE toolkit [10] at a step size of 10 ms. The ComParE LLDs extracted on frame-level have been introduced at the Interspeech 2013 Computational Paralinguistics Challenge (ComParE) [20]. However, the functionals deﬁned in the feature set, i.e., the statistics summarising the LLDs on utterance level, are not applied in this work, as we are interested in the time-dependent information on frame-level. ComParE comprises 65 LLDs summarised in Table 2, covering spectral, cepstral, prosodic, and voice quality information, extracted from a frame with a size of 20 ms to 60 ms length. In addition, the ﬁrst order derivatives (deltas) are computed, resulting in a frame-level feature vector of size 130 for each step of 10 ms. Table 1. SEWA corpus: Number of conversations and subjects and total duration for each culture. Index Culture

# Conversations # Subjects Total duration [min]

C1

Chinese

35

70

101

C2

Hungarian

33

66

67

C3

German

32

64

89

C4

British

33

66

94

C5

Serbian

36

72

98

C6

Greek

Sum

28

56

81

197

394

530

Fig. 1. SEWA corpus: Screenshot taken from a sample video chat with one female and one male German subject.

192

J. Han et al.

Table 2. Interspeech 2013 Computational Paralinguistics Challenge (ComParE) feature set. Overview of 65 acoustic low-level descriptors (LLDs). RMS: Root-MeanSquare, RASTA: RelAtive SpecTral Amplitude, MFCC: Mel-Frequency-Cepstral Coefﬁcients, SHS: Sub-Harmonic Summation. 4 energy related LLD

Group

Loudness

Prosodic

Modulation loudness

Prosodic

RMS energy, zero-crossing rate

Prosodic

55 spectral related LLD

Group

RASTA auditory bands 1–26

Spectral

MFCC 1–14

Cepstral

Spectral energy 250–650 Hz, 1–4 kHz Spectral

4

Spectral roll-oﬀ pt. .25, .50, .75, .90

Spectral

Spectral ﬂux, entropy, variance

Spectral

Spectral skewness and kurtosis

Spectral

Spectral slope

Spectral

Spectral harmonicity

Spectral

Spectral sharpness (auditory)

Spectral

Spectral centroid (linear)

Spectral

6 voicing related LLD

Group

F0 via SHS

Prosodic

Probability of voicing

Voice quality

Jitter (local and delta)

Voice quality

Shimmer

Voice quality

Log harmonics-to-noise ratio

Voice quality

Behaviour Similarity Tendency Analysis with Autoencoder

To analyse the interpersonal sentiment and investigate the temporal behaviour patterns from speech, we ﬁrst standardised (zero mean and unit standard deviation) the 130 frame-level features within the same recordings to minimise the diﬀerences between diﬀerent recording conditions. This procedure turned these LLDs into suitable ranges, as the inputs and target outputs of an autoencoder (AE). Before training the AE, we ﬁrst segmented the LLD sequences based on the transcriptions provided in the SEWA database, where information on the start and end of each speech segment and the subject ID of the corresponding segment is given. After that, the whole LLD sequences of each recording were divided into two sub-sequences, each including features from only one subject. Following the above-mentioned separation process, features from one subject were utlised to train an AE, and features from the other subject in the same

You Sound Like Your Counterpart: Interpersonal Speech Analysis

193

recording were fed into the trained AE for testing. Furthermore, once all features for testing have been reconstructed with the AE, we calculate the root-meansquared errors (RMSEs) of the reconstructed features over time, and examine how and to which extent the RMSE varies along time. Consequently, for each recording, two AEs are learnt based on the two subjects involved in the recording, resulting in two one-dimensional RMSE sequences calculated between the input and the output feature sequences during the testing step. 4.1

Experimental Settings

The AE we applied is a 3-layer encoder with a 3-layer decoder. In the preliminary experiments, the number of nodes in each layer has been chosen as follows: 130-64-32-12-32-64-130, where the output dimension is exactly the same as the input dimension. During network training, the network weights were updated by using mean squared error loss and the Adagrad optimizer, and the training process was ceased after 512 epochs. Furthermore, to accelerate the training process, the network weights were updated after running every batch of 256 LLDs for computation in parallel. The training procedure was performed with Keras, which is a deep learning library for python. After generating the reconstruction errors of the tested subject over time, the resulting sequence is exploited to perform a linear regression, with the assumption that the slope of the learnt line may indicate the changes of the behaviour patterns along time. More speciﬁcally, when the slope is negative, it may demonstrate that during the chat session, the tested subject turns to become more similar to the subject who (s)he is talking to. Thence, if the slope is positive, it may imply the opposite. Additionally, the amplitude of the slope can be an indication to denote the level of the similarity or dissimilarity. 4.2

Results and Discussion

We ﬁrst discuss the results achieved with the data from the ﬁrst culture, Chinese (C1). From all 35 recordings, the average slope of the RMSE sequences of all 70 subjects is −0.07. From Fig. 2, we notice that, most of the slopes (54 of 70) are negative, whereas only a few (16 of 70) are positive. This indicates that, during the recordings, the acoustic LLD features of the tested subjects have a smaller reconstruction error when time passes by. Considering that the AE is trained with the other subject within the same recording, a smaller reconstruction error may reveal a higher similarity between these two subjects. To sum up, a negative slope implies a decreasing reconstruction error along the time and could indicate a similarity increasing among the speakers during the video chat. Interestingly, similar patterns have also been found in all the other ﬁve cultures. Nevertheless, the ratio of the negative slopes and the average slope are diﬀerent from culture to culture. Given these results, we calculated the average slopes s of all cultures separately, as well as the Pearson correlation coeﬃcients (PCCs) of two slopes obtained from all recordings within the same culture, respectively, with the aim

194

J. Han et al.

Fig. 2. Slope of RMSE sequences of 70 Chinese subjects from 35 recordings. In each recording, there are two subjects as denoted with blue and red bars, respectively. (Color ﬁgure online)

to perceive cultural variation in spontaneous remote conversations. Results are given in Table 3. Note that, a negative slope denotes that the subject shows a more similar speech behaviour in a conversation along the time; the more similar a subject is speaking like his partner, the larger the slope is towards the negative direction. From Table 3, one may notice that on average, individuals of all six cultures tend to behave more similar during the conversation, given that the average slopes are all negative. However, cultural variation remains, as the most negative slope (−0.12) is obtained for the Greek (C6) culture and the smallest slope (−0.07) is seen for Chinese (C1) and British (C4). Moreover, taking the PCC into account, we may see the cultural variation from another view. A positive PCC value demonstrates that subjects of a culture tend to converge to a similar state, either both behave like or unlike each other, while a negative PCC may indicate that conversations are more likely to be dominated by one subject. For example, no correlation has been seen in Table 3. Average slope of RMSE sequences of all subjects within six diﬀerent cultures is listed in the upper row, respectively; the correlation coeﬃcient denoted as pcc of pairs indicates the correlation of behaviours of two subjects and is listed in the last row for each culture (C1: Chinese, C2: Hungarian, C3: German, C4: British, C5: Serbian, and C6: Greek). C1

C2

C3

C4

C5

C6

Average slope −0.07 −0.11 −0.10 −0.07 −0.08 −0.12 pcc of pairs

−0.03 0.34

0.15

0.39

0.39

−0.26

You Sound Like Your Counterpart: Interpersonal Speech Analysis

195

conversations of the Chinese pairs (C1), with a PCC of −0.03, which is close to 0. However, strong linear correlations have been revealed in four cultures, either positive (Hungarian (C2), British (C4), and Serbian (C5)) or negative (Greek (C6)). Besides, a weak positive correlation can be seen in German (C3). These ﬁndings need to be veriﬁed by literature in sociology, anthropology, and in the anthropologic linguistics domain, particularly in the ﬁeld of conversation analysis [22], which is, however, out of the scope of this work. Note that, despite that the SEWA database was designed and developed with a control of age and gender of the subjects, discrepancies caused by these or other aspects such as educational background, occupation, and health status cannot be avoided and might still may have an impact on our observations.

5

Conclusion and Outlook

In this work, we have demonstrated that, an autoencoder has a great potential to recognise the spontaneous and unconscious mimicry in the social interaction, by the observation of the reconstruction error using the acoustic features extracted from the speech of a conversational partner. We have given some insights into the synchronisation of vocal behaviour in dyadic conversations of people from six diﬀerent cultures. Future work will focus on optimised feature representations, such as bag-of-audio-words [19] or learnt features such as auDeep [1]. Moreover, we are going to exploit also the linguistic domain through state-of-the-art word embeddings, such as word2vec [15]. Lastly, other than the slope of the reconstruction errors, additional evaluation strategies to measure the degree of similarity of similarity between subjects will be explored in the future [6]. Acknowledgments. The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under GA No. 645094 (Innovation Action SEWA) and through the EFPIA Innovative Medicines Initiative under GA No. 115902 (RADAR-CNS).

References 1. Amiriparian, S., Freitag, M., Cummins, N., Schuller, B.: Sequence to sequence autoencoders for unsupervised representation learning from audio. In: Proceedings of the DCASE 2017 Workshop, Munich, Germany (2017) 2. Baltruˇsaitis, T., Robinson, P., Morency, L.P.: OpenFace: An open source facial behavior analysis toolkit. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, pp. 1–10 (2016) 3. Bilakhia, S., Petridis, S., Pantic, M.: Audiovisual detection of behavioural mimicry. In: Proceedings of Humaine Association Conference on Aﬀective Computing and Intelligent Interaction (ACII), Geneva, Switzerland, pp. 123–128 (2013)

196

J. Han et al.

4. Burgoon, J.K., Hubbard, A.E.: Cross-cultural and intercultural applications of expectancy violations theory and interaction adaptation theory. In: Gudykunst, W.B. (ed.) Theorizing about Intercultural Communication, pp. 149–171. Sage Publications, Beverly Hills (2005) 5. Chartrand, T.L., Bargh, J.A.: The chameleon eﬀect: the perception-behavior link and social interaction. J. Pers. Soc. Psychol. 76(6), 893–910 (1999) 6. Delaherche, E., Chetouani, M., Mahdhaoui, A., Saint-Georges, C., Viaux, S., Cohen, D.: Interpersonal synchrony: a survey of evaluation methods across disciplines. IEEE Trans. Aﬀect. Comput. 3(3), 349–365 (2012) 7. Ekman, P.: Universals and cultural diﬀerences in facial expressions of emotion. In: Nebraska Symposium on Motivation. University of Nebraska Press (1971) 8. Ekman, P., Friesen, W.V.: Unmasking the Face: A Guide to Recognizing Emotions from Facial Clues. Ishk, Ujjain (2003) 9. Eyben, F.: Real-Time Speech and Music Classiﬁcation by Large Audio Feature Space Extraction. Springer, Switzerland (2016). https://doi.org/10.1007/978-3319-27299-3 10. Eyben, F., Weninger, F., Groß, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia (MM), Barcelona, Spain, pp. 835–838 (2013) 11. Friesen, E., Ekman, P.: Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto (1978) 12. Gueguen, N., Jacob, C., Martin, A.: Mimicry in social interaction: its eﬀect on human judgment and behavior. Eur. J. Soc. Sci. 8(2), 253–259 (2009) 13. Hess, U., Fischer, A.: Emotional mimicry as social regulation. Pers. Soc. Psychol. Rev. 17(2), 142–157 (2013) 14. Hess, U., Fischer, A.: Emotional mimicry: why and when we mimic emotions. Soc. Pers. Psychol. Compass 8(2), 45–57 (2014) 15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, NV, pp. 3111–3119 (2013) 16. Parrill, F., Kimbara, I.: Seeing and hearing double: the inﬂuence of mimicry in speech and gesture on observers. J. Nonverbal Behav. 30(4), 157 (2006) 17. Ringeval, F., et al.: AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural aﬀect recognition. In: Proceedings of the 8th Annual Workshop on Audio/Visual Emotion Challenge, Seoul, Korea (2018, to appear) 18. Ringeval, F., et al.: AVEC 2017: Real-life depression, and aﬀect recognition workshop and challenge. In: Proc. of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, pp. 3–9 (2017) 19. Schmitt, M., Ringeval, F., Schuller, B.: At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech. In: Proceedings of INTERSPEECH, San Francisco, CA, pp. 495–499 (2016) 20. Schuller, B., et al.: The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conﬂict, emotion, autism. In: Proceedings of INTERSPEECH, Lyon, France, pp. 148–152 (2013) 21. Scissors, L.E., Gill, A.J., Gergle, D.: Linguistic mimicry and trust in text-based CMC. In: Proceedings of the ACM Conference on Computer Supported Cooperative Work, San Diego, CA, pp. 277–280 (2008) 22. Stivers, T., et al.: Universals and cultural variation in turn-taking in conversation. Proc. Nat. Acad. Sci. U.S.A 106(26), 10587–10592 (2009)

You Sound Like Your Counterpart: Interpersonal Speech Analysis

197

23. Sun, X., Nijholt, A., Truong, K.P., Pantic, M.: Automatic visual mimicry expression analysis in interpersonal interaction. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Colorado Springs, CO, pp. 40–46 (2011) 24. Swaab, R.I., Maddux, W.W., Sinaceur, M.: Early words that work: When and how virtual linguistic mimicry facilitates negotiation outcomes. J. Exper. Soc. Psychol. 47(3), 616–621 (2011)

TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation Fran¸cois Hernandez1(B) , Vincent Nguyen1 , Sahar Ghannay2 , Natalia Tomashenko2 , and Yannick Est`eve2 1

Ubiqus, Paris, France [email protected] 2 LIUM, University of Le Mans, Le Mans, France {sahar.ghannay,natalia.tomashenko,yannick.esteve}@univ-lemans.fr https://www.ubiqus.com https://lium.univ-lemans.fr/

Abstract. In this paper, we present TED-LIUM release 3 corpus (TEDLIUM 3 is available on https://lium.univ-lemans.fr/ted-lium3/) dedicated to speech recognition in English, which multiplies the available data to train acoustic models in comparison with TED-LIUM 2, by a factor of more than two. We present the recent development on Automatic Speech Recognition (ASR) systems in comparison with the two previous releases of the TED-LIUM Corpus from 2012 and 2014. We demonstrate that, passing from 207 to 452 h of transcribed speech training data is really more useful for end-to-end ASR systems than for HMM-based state-ofthe-art ones. This is the case even if the HMM-based ASR system still outperforms the end-to-end ASR system when the size of audio training data is 452 h, with a Word Error Rate (WER) of 6.7% and 13.7%, respectively. Finally, we propose two repartitions of the TED-LIUM release 3 corpus: the legacy repartition that is the same as that existing in release 2, and a new repartition, calibrated and designed to make experiments on speaker adaptation. Similar to the two ﬁrst releases, TED-LIUM 3 corpus will be freely available for the research community. Keywords: Speech recognition · Opensource corpus Speaker adaptation · TED-LIUM

1

· Deep learning

Introduction

In May 2012 and May 2014, the LIUM team released two versions (respectively 118 h of audio and 207 h of audio) from the TED conference videos which were since widely used by the ASR community for research purposes. These corpora were called TED-LIUM, release 1 and release 2, presented respectively in [10,11]. Ubiqus joined these eﬀorts to pursue the improvements both from an increased data standpoint, as well as from a technical achievement one. We believe that c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 198–208, 2018. https://doi.org/10.1007/978-3-319-99579-3_21

TED-LIUM 3 Corpus

199

this corpus has become a reference and will continue to be used by the community to improve further the results. In this paper, we present some enhancements regarding the dataset, by using a new engine to realign the original data, leading to an increased amount of audio/text, and by adding new TED talks, which combined with the new alignment process, gives us 452 h of aligned audio. A new data distribution is also proposed that is more suitable for experimenting with speaker adaptation techniques, in addition to the legacy distribution already used on TED-LIUM release 1 and 2. Section 2 gives details about the new TED-LIUM 3 corpus. We present experimental results with diﬀerent ASR architectures, by using Time Delay Neural Network (TDNN) [5] and Factored TDNN (TDNN-F) acoustic models [7] on the legacy distribution of TED-LIUM 3 in Sect. 3, and also exploring the use of a pure neural end-to-end system in Sect. 4. In Sect. 5, we report experimental results obtained on the speaker adaptation distribution by exploiting GMM-HMM and TDNN-Long Short-Term Memory (TDNN-LSTM) [6] acoustic models and two standard adaptation techniques (ivectors and feature space maximum linear regression (fMLLR)). The ﬁnal section is dedicated to discussion and conclusion.

2

TED-LIUM 3 Corpus Description

2.1

Data, Alignment and Filtering

All raw data (acoustic signals and closed captions) were extracted from the TED website. For each talk, we built a sphere audio ﬁle, and its corresponding transcript in stm format. The text from each .stm ﬁle was automatically aligned to the corresponding .sph ﬁle using the Kaldi toolkit [8]. This consists of the adaptation of existing scripts1 , intended to ﬁrst decode the audio ﬁles with a biased language model, and then align the obtained .ctm ﬁle with the reference transcript. To maximize the quality of alignments, we used our best model (at the time of corpus preparation) trained on the previous release of the TEDLIUM corpus. This model achieved a WER of 9.2% on both development and test sets without any rescoring. This means the ratio of aligned speech versus audio from the original 1,495 talks of releases 1 and 2 has changed, as well as the quantity of words retained. It increased the amount of usable data from the same basis ﬁles by around 40% (Table 1). In the previous release, aligned speech represented only around 58.9% of the total audio duration (351 h). With these new alignments, we now cover around 83.0% of audio. A ﬁrst set of experiments was conducted to compare equivalent systems trained on the two sets of data (Table 2). With strictly equivalent models, there is no clear improvement of results for the proposed new alignments. Yet, there is no degradation of performance either. We will show in further experiments that the increased amount of data will not just be harmless, but also useful. 1

https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/cleanup/ segment long utterances.sh.

200

F. Hernandez et al. Table 1. Maximizing alignments - TED-LIUM release 2 talks. Characteristic Alignments Evolution Original New Speech

207h

290h 40.1%

Words

2.2M

3.2M 43.1%

Table 2. Comparison of training on original and new alignments for TED-LIUM release 2 data (Experiments conducted with the Kaldi toolkit - details in Sect. 3).

2.2

Model (rescoring)

Original - 207h New - 290h Dev Test Dev Test

HMM-GMM (none)

19.0% 17.6%

18.7% 17.2%

HMM-GMM (Ngram)

17.8% 16.5%

17.7% 16.1%

HMM-TDNN-F (none)

8.5%

8.3%

8.2%

8.3%

HMM-TDNN-F (Ngram)

7.8%

7.8%

7.7%

7.9%

HMM-TDNN-F (RNN)

6.8%

6.8%

6.6%

6.7%

Corpus Distribution: Legacy Version

The whole corpus is released as what we call a legacy version, for which we keep the same development and test sets as the ﬁrst releases. Table 3 summarizes the characteristics of text and audio data of the new release of the TED-LIUM corpus. Statistics from the previous and new releases are presented, as well as the evolution between the two. Additionally, we mention that aligned speech (including some noises and silences) represents around 82.6% of audio duration (540 h). Table 3. TED-LIUM 3 corpus characteristics. Characteristic

Corpus v2

Evolution v3

Total duration

207 h

452 h

118.4%

- Male

141 h

316 h

124.1%

- Female

66 h

134 h

103.0%

Mean duration

10 min 12 s

11 min 30 s

12.7%

Number of unique speakers

1242

2028

63.3%

Number of talks

1495

2351

Number of segments

92976

268231

188.5%

57.3%

Number of words

2.2M

4.9M

122.7%

TED-LIUM 3 Corpus

2.3

201

Corpus Distribution: Speaker Adaptation Version

Speaker adaptation of acoustic models (AMs) is an important mechanism to reduce the mismatch between the AMs and test data from a particular speaker, and today it is still a very active research area. In order to design a suitable corpus for exploring speaker adaptation algorithms, additional factors and dataset characteristics, such as number of speakers, amount of pure speech data per speaker, and others, should be taken into account. In this paper, we also propose and describe the training, development and test datasets specially designed for the speaker adaptation task. These datasets are obtained from the proposed TED-LIUM 3 training corpus, but the development and test sets are more balanced and representative in characteristics (number of speakers, gender, duration) than the original sets and more suitable for speaker adaptation experiments. In addition, for the development and test datasets we chose only speakers who are not present in the training data set in other talks. The statistics for the proposed data sets are given in Table 4. Table 4. Data sets statistics for the speaker adaptation task. Unlike the other tables, the duration is calculated only for pure speech (excluding silence, noise, etc.). Characteristic Duration of speech, hours

Data set Train Dev.

Test

Total 346.17 3.73 Male 242.22 2.34 Female 104.0 1.39

3.76 2.34 1.41

Duration of speech per speaker, minutes Mean Min. Max.

3

10.7 1.0 25.6

14.0 13.6 14.4

14.1 13.6 14.5

16 10 6

16 10 6

Number of speakers

Total 1938 Male 1303 Female 635

Number of words

Total

4437K 47753 43931

Number of talks

Total

2281

16

16

Experiments with State-of-the-Art HMM-Based ASR System

We conducted a ﬁrst set of experiments on the TED-LIUM release 2 and 3 corpora using the Kaldi toolkit. These experiments were based on the existing recipe2 , mainly changing model conﬁgurations and rescoring strategies. We also kept the lexicon from the original release, containing 159,848 entries. For this, and all other experiments in this paper, no glm ﬁles were applied to deal with equivalences between word spelling (e.g. doctor vs. dr). 2

https://github.com/kaldi-asr/kaldi/tree/master/egs/tedlium/s5 r2.

202

F. Hernandez et al.

3.1

Acoustic Models

All experiments were conducted using chain models [9] with the now well-known TDNN architecture [5] as well as the recent TDNN-F architecture [7]. Training audio samples were randomly perturbed in speed and volume during the training process. This approach is commonly called audio augmentation and is known to be beneﬁcial for speech recognition [4]. 3.2

Language Model

Two approaches were used, both aiming at rescoring lattices. The ﬁrst one is an N-gram model of order 4 trained with the pocolm toolkit3 , which was pruned to 10 million N-grams. We also considered a RNNLM with letter-based features and importance sampling [15], coupled with a pruned approach to lattice-rescoring [14]. The RNNLM we retained was a mixture of three TDNN layers with two interspersed LSTMP layers [12] containing around 10 million parameters. The latter helps to reduce the word error rate drastically. We used the same corpus and vocabulary in both methods, which are those released along with TED-LIUM release 2. These experiments were conducted prior to the full preparation of the new release, so we only appended text from the original alignments of release 2 to this corpus. In total, the textual corpus used to train language models contains approximately 255 million words. These source data are described in [11]. 3.3

Experimental Results

In this section, we present the recent development on Automatic Speech Recognition (ASR) systems that can be compared with the two previous releases of the TED-LIUM Corpus from 2012 and 2014. While the ﬁrst version of the corpus achieved a WER of 17.4% at that time, the second version decreased it to 11.1% using additional data and Deep Neural Network (DNN) techniques. TDNN. Our basis chain-TDNN setup is based on 6 layers with batch normalization, and a total context of (−15, 12). Prior tuning experiments on TED-LIUM release 2 showed us that the model did not improve beyond the dimension of 450. More than doubling the training data allows the training of bigger, and better, models of the same architecture as shown in Table 5. As part of experiments in tuning Kaldi models, it appeared that a form of L2 regularization could help to allow training for longer with less risk to overﬁt. This was implemented in Kaldi as the proportional-shrink option. Some tuning on TED-LIUM 2 data gave the best result for a value of 20. All experiments presented in Table 5 were realized with this value to keep a consistent baseline. Aiming to reduce the WER even more, and with time constraints, we chose to train again the model with dimension 1024, with a proportional-shrink value of 10 (as we approximately doubled the size of the corpus). After RNNLM latticerescoring, the WER decreased to 6.2% on the dev set and 6.7% on the test. 3

https://github.com/danpovey/pocolm.

TED-LIUM 3 Corpus

203

Table 5. Tuning the hidden dimension of chain-TDNN setup on TED-LIUM release 3 corpus. Dimension WER WER - Ngram WER - RNN Dev Test Dev Test Dev Test 450

9.0% 9.1% 8.0% 8.4%

6.9% 7.3%

600

8.7% 8.9% 8.0% 8.4%

6.6% 7.3%

768

8.3% 8.6% 7.6% 8.1%

6.5% 7.0%

1024

8.3% 8.5% 7.5% 8.0%

6.4% 6.9%

TDNN-F. As a ﬁnal set of experiments, we tried the recently-introduced factorized TDNN approach, which again resulted in signiﬁcant improvements in WER for both TED-LIUM release 2 and 3 corpora (Table 6). Table 6. Factorized TDNN experiments on TED-LIUM release 2 and 3 corpora.

4

Corpus Model

WER WER - Ngram WER - RNN Dev Test Dev Test Dev Test

r2

TDNN-F - 11 layers 1280/256 - ps20

8.5% 8.3% 7.8% 7.8%

6.8% 6.8%

r3

TDNN-F - 11 layers 1280/256 - ps10

7.9% 8.1% 7.4% 7.7%

6.2% 6.7%

Experiments with Fully Neural End-to-End ASR System

We also conducted experiments to evaluate the impact of adding data to the training corpus in order to build a neural end-to-end ASR. The system with which we experimented does not use a vocabulary to produce words, since it emits sequences of characters. 4.1

Model Architecture

The fully end-to-end architecture used in this study is similar to the Deep Speech 2 neural ASR system proposed by Baidu in [1]. This architecture is composed of nc convolution layers (CNN), followed by nr uni or bidirectional recurrent layers, a lookahead convolution layer [13], and one fully connected layer just before the softmax layer, as shown in Fig. 1. The system is trained end-to-end by using the CTC loss function [2], in order to predict a sequence of characters from the input audio. In our experiments, we used two CNN layers and six bidirectional recurrent layers with batch normalization as mentioned in [1]. Given an utterance xi and label y i sampled from a training set X = (x1 , y 1 ), (x2 , y 2 ), ...,

204

F. Hernandez et al.

the RNN architecture has to train to convert an input sequence xi into a ﬁnal transcription y i s. For notational convenience, we drop the superscripts and use x to denote a chosen utterance and y the corresponding label. The RNN takes as input an utterance x represented by a sequence of log-spectrograms of power normalized audio clips, calculated on 20 ms windows. As output, all the characters l of a language alphabet may be emitted, in addition to the space character used to segment character sequences into word sequences (space denotes word boundaries) and a blank character useful to absorb the diﬀerence in a time series length between input and output in the CTC framework. The RNN makes a prediction p(lt |x) at each output time step t. At test time, the CTC model can be coupled with a language model trained on a large textual corpus. A specialized beam search CTC decoder [3] is used to ﬁnd the transcription y that maximizes: Q(y) = log(p(lt |x)) + αlog(pLM (y)) + βwc(y)

(1)

where wc(y) is the number of words in the transcription y. The weight α controls the relative contributions of the language model and the CTC network. The weight β controls the number of words in the transcription. 4.2

Experimental Results

Experiments were made on the legacy distribution of the TED-LIUM 3 corpus in order to evaluate the impact on WER of training data size for an end-to-end speech recognition system inspired by Deep Speech 2. In these experiments, we used an open source Pytorch implementation4 .

Fig. 1. Deep Speech 2 -like end-to-end architecture for speech recognition. 4

https://github.com/SeanNaren/deepspeech.pytorch.

TED-LIUM 3 Corpus

205

Three training datasets were used: TED-LIUM 2 with original alignment (207 h of speech), TED-LIUM 2 with new alignment (290 h), and TED-LIUM 3 (452 h), as presented in Sects. 2.1 and 2.2. They correspond to the three possible abscissa values (207, 290, 452) in Fig. 2. For each training dataset, the ASR tuning and the evaluation were respectively made on the TED-LIUM release 2 development and test dataset, similar to the experiments presented in Sect. 3.3. Figure 2 presents results in both WER (left side), and Character Error Rate (CER, right side) on the test dataset. Evaluation in CER is interesting because the end-to-end ASR system is trained to produce sequences of characters, instead of sequences of words.

Fig. 2. Word error rate (left) and character error rate (right) on the TED-LIUM 3 legacy test data for three end-to-end conﬁgurations according to the training data size. (Color ﬁgure online)

For each training dataset, three conﬁgurations have been tested: – the Greedy conﬁguration, in blue in Fig. 2 that consists of evaluating sequences of characters directly emitted from the neural network by gluing all the characters (including spaces to delimit words); – the Greedy+augmentation conﬁguration, in red, which is similar to the Greedy one, but in which each training audio samples is randomly perturbed in gain and tempo for each iteration [4]; – the Beam+augmentation conﬁguration, in brown, achieved by applying a language model through a beam search decoding on the top of the neural network hypotheses using the Greedy+augmentation conﬁguration. This language model is the cantab-TEDLIUM-pruned.lm3 provided with the Kaldi TEDLIUM recipe.

206

F. Hernandez et al.

As expected, the best results in WER and CER are achieved by the Beam+augmentation conﬁguration, with a WER of 13.7% and a CER of 6.1%. Regardless of the conﬁguration, increasing training data size signiﬁcantly improves the transcription quality: for instance, while the Greedy mode reached a WER of 28.1% with the original TED-LIUM 2 data, it reaches 20.3% with TEDLIUM 3. We can observe that with TED-LIUM 3, the Greedy+augmentation conﬁguration gets a lower WER than the Beam+augmentation one when trained with the original TED-LIUM 2 data. This shows that increasing the training data size for the pure end-to-end architecture oﬀers a higher potential for WER reduction than using an external language model in a beam search decoding.

5

Experiments with the Speaker Adaptation Distribution

In this section, we present results of speaker adaptation experiments on the adaptation version of the corpus described in Sect. 2.3. In this series of experiments, we trained three pairs of AMs. In each pair, we trained a speaker-independent (SI) AM and a corresponding speaker adaptive trained (SAT) AM. We explore two standard adaptation techniques: (1) i-vectors for a TDNN-LSTM and (2) feature space maximum linear regression (fMLLR) for a GMM-HMM and a TDNNLSTM. The Kaldi toolkit [8] was used for these experiments. First, we trained two GMM-HMM AMs on 39-dimensional features MFCC-39 (13-dimensional Mel-frequency cepstral coeﬃcients (MFCCs) with Δ and ΔΔ): (1) a SI AM and (2) a SAT model with fMLLR. Then, we trained four TDNN-LSTM AMs. All TDNN-LSTM AMs have the same topology, described in [6], and diﬀer only in the input features. They were trained using LF-MMI criterion [9] and 3-fold reduced frame rate. For the ﬁrst SI TDNN-LSTM AM, 40-dimensional MFCCs without cepstral truncation (hires MFCC-40) were used as the input into the neural network. For the corresponding SAT model, i-vectors were used (as in the standard Kaldi recipe). For the second SI TDNN-LSTM AM, MFCC-39 features (the same as for the GMM-HMM) were used, and the corresponding SAT model Table 7. Speaker adaptation results for the speaker adaptation task (on the corpus described in Sect. 2.3. MFCC-39 denotes 13-dimensional MFCCs appended with Δ and ΔΔ; hires MFCC-40 denotes 40-dimensional MFCCs without cepstral truncation). Model

Features

WER, % – Dev. WER, % – Test

GMM SI

MFCC-39

20.69

18.02

GMM SAT

MFCC-39 – fMLLR

16.47

15.08

TDNN-LSTM SI

hires MFCC-40

TDNN-LSTM SAT hires MFCC-40 ⊕ i-vect TDNN-LSTM SI

MFCC-39

TDNN-LSTM SAT MFCC-39 – fMLLR

7.69

7.25

7.12

7.10

8.19

7.54

7.68

7.34

TED-LIUM 3 Corpus

207

was trained using fMLLR adaptation. The 4-gram pruned LM was used for the evaluation5 . Results in terms of WER are presented in Table 7.

6

Discussion and Conclusion

In this paper, we proposed a new release of the TED-LIUM corpus, which doubles the quantity of audio with aligned text for acoustic model training. We showed that increasing this training data reduces the word error rate obtained by a stateof-the-art HMM-based ASR system very slightly, passing from 6.8% (release 2) to 6.7% (release 3) on the legacy test data (and from 6.8% to 6.2% on the legacy dev data). To measure the recent advances realized in ASR technology, this word error rate can be compared to the 11.1% reached by such a state-of-the-art system in 2014 [10]. We were also interested in emergent neural end-to-end ASR technology, known to be very voracious in training data. We noticed that without external knowledge, i.e. by using only aligned audio from TED-LIUM 3, such technology reaches a WER of 17.4%, which is exactly the WER reached by stateof-the-art ASR technology in 2012 with the TED-LIUM 1 training data. Assisted by a classical 3-gram language model used in a beam search on top of the end-toend architecture, this WER decreases to 13.7% with the TED-LIUM 3 training data, while with the TED-LIUM 2 training data the same system reached a WER of 20.3%. Increasing training data composed of audio with aligned text for this kind of ASR architecture still seems very important in comparison to the HMM-based ASR architecture that reaches a plateau on such TED data, with a low WER of 6.7%. Finally, we propose a new data distribution dedicated to experimenting on speaker adaptation, and propose some results that can be considered as a baseline for future work. Acknowledgments. This work was partially funded by the French ANR Agency through the CHIST-ERA M2CR project, under the contract number ANR-15-CHR20006-01, and by the Google Digital News Innovation Fund through the news.bridge project.

References 1. Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. In: International Conference on Machine Learning, pp. 173–182 (2016) 2. Graves, A., Fern´ andez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classiﬁcation: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006) 5

This LM is similar to the “small” LM trained with the pocolm toolkit, which is used in the Kaldi tedlium s5 r2 recipe. The only diﬀerence is that we modiﬁed a training set by adding text data from TED-LIUM 3 and removing from it those data, that present in our test and development sets (from the adaptation corpus).

208

F. Hernandez et al.

3. Hannun, A.Y., Maas, A.L., Jurafsky, D., Ng, A.Y.: First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv preprint arXiv:1408.2873 (2014) 4. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH (2015) 5. Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for eﬃcient modeling of long temporal contexts. In: INTERSPEECH (2015) 6. Peddinti, V., Wang, Y., Povey, D., Khudanpur, S.: Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Process. Lett. 25(3), 373–377 (2018) 7. Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: INTERSPEECH (2018, submitted) 8. Povey, D., et al.: The Kaldi speech recognition toolkit. In: ASRU. IEEE Signal Processing Society, December 2011 9. Povey, D., et al.: Purely sequence-trained neural networks for ASR based on latticefree MMI. In: INTERSPEECH (2016) 10. Rousseau, A., Del´eglise, P., Est`eve, Y.: TED-LIUM: an automatic speech recognition dedicated corpus. In: LREC, pp. 125–129 (2012) 11. Rousseau, A., Del´eglise, P., Est`eve, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: LREC, pp. 3935– 3939 (2014) 12. Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH (2014) 13. Wang, C., Yogatama, D., Coates, A., Han, T., Hannun, A., Xiao, B.: Lookahead convolution layer for unidirectional recurrent neural networks. In: ICLR 2016 (2016) 14. Xu, H., et al.: A pruned RNNLM lattice-rescoring algorithm for automatic speech recognition. In: ICASSP (2017) 15. Xu, H., et al.: Neural network language modeling with letter-based features and importance sampling. In: ICASSP (2017)

LipsID Using 3D Convolutional Neural Networks ˇ Miroslav Hlav´ aˇc1,2,3(B) , Ivan Gruber1,2,3 , Miloˇs Zelezn´ y1 , 3,4 and Alexey Karpov 1

2

Faculty of Applied Sciences, Department of Cybernetics, UWB, Pilsen, Czech Republic {mhlavac,zelezny}@kky.zcu.cz Faculty of Applied Sciences, NTIS, UWB, Pilsen, Czech Republic 3 ITMO University, St. Petersburg, Russia 4 SPIIRAS, St. Petersburg, Russia [email protected]

Abstract. This paper presents a proposition for a method inspired by iVectors for improvement of visual speech recognition in the similar way iVectors are used to improve the recognition rate of audio speech recognition. A neural network for feature extraction is presented with training parameters and evaluation. The network is trained as a classiﬁer for a closed set of 64 speakers from the UWB-HSCAVC dataset and then the last softmax fully connected layer is removed to gain a feature vector of size 256. The network is provided with sequences of 15 frames and outputs the softmax classiﬁcation to 64 classes. The training data consists of approximately 20000 sequences of grayscale images from the ﬁrst 50 sentences that are common to every speaker. The network is then evaluated on the 60000 sequences created from 150 sentences from each speaker. The testing sentences are diﬀerent for each speaker.

Keywords: Visual speech Deep features

1

· Neural network · 3D convolution

Introduction

The ﬁeld of visual speech recognition is behind the ﬁeld of audio speech recognition in the rate of success of the recognition algorithm. The current methods [5,6] usually employ end to end systems based on neural networks. The networks are using joint learning for audio and video inputs to gain as much information as possible to improve the recognition rate. The learning process is based on an analysis of video sequences by employing either Long short-term memory (LSTM) [8] or 3D Convolutions [10] to learn the dynamic features of the visual speech. The Connectionist temporal classiﬁcation [9] is then used as output and loss function for the neural network [2]. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 209–214, 2018. https://doi.org/10.1007/978-3-319-99579-3_22

210

M. Hlav´ aˇc et al.

These networks achieve accuracy of around 60% for the visual speech recognition on an open set of words. There is still a lot of space for improvement of these results by providing additional information in the training process. In the ﬁeld of audio speech recognition, various methods are utilized to improve the results of the automatic speech recognition(ASR) algorithms. One of the methods named iVectors [11] originally developed for speaker identiﬁcation proved to be a useful additional information source for adaptation of the audio speech recognition to diﬀerent speakers. The problem of classiﬁcation is well documented in neural networks and thus our idea is to adapt the iVectors method to provide additional information about the speaker to improve the recognition rate of visual speech. This paper is proposing a method for obtaining the deep features from the input sequences of visual speech by employing a neural network composed of 3D Convolutional layers trained for the task of classiﬁcation of the speakers. We have named our method LipsID because it is based on speaker identiﬁcation based on lips images and the networks are trained in the task of classiﬁcation. The paper is organized as follows: in Sect. 2 we introduce used dataset and processing method; in Sect. 3 we describe the experiment, specify implementation details, show obtained results and compare them with a chosen baseline approach; and ﬁnally in Sect. 4 we make a conclusion and outline our future research.

2

Methods and Datasets

In this section, we will present the data used for training our neural networks and also the background for sequence analysis with neural networks. The ﬁrst part discusses the dataset we have used to create training data for our experiments and then exactly describes how they were created. The second part is focused on the tested sequence analysis approaches. 2.1

UWB-HSCAVC Dataset

The UWB-HSCAVC [7] dataset was created at West Bohemian University in Pilsen to provide a speech recognition dataset for the Czech language. It contains both visual and audio data of 100 diﬀerent speakers (39 males, 61 females). It was recorded in a laboratory environment in controlled light conditions. Each speaker recorded 200 sentences, with 50 common to everyone and the rest 150 diﬀers from speaker to speaker. Clapperboard was used as a synchronization mechanism for audio and video. The sentences are chosen with care to provide equal representation of phonemes occurring in the Czech language. The videos are recorder in resolution of 720/ × 576 with 25 fps. The dataset is also preprocessed by creating manual speech transcriptions, speakers head detection, lips corners position detection, and it provides skin texture samples for the regions of nose and cheeks. The sample texture of both eyes of every speaker is also provided (Fig. 1).

LipsID Using 3D Convolutional Neural Networks

211

Fig. 1. Recording conditions of the UWB-HSCAVC dataset [7].

2.2

Training Data

The training data were created from the UWB-HSCAVC dataset by following means. Only the data for 64 speakers were available. The videos were ﬁrst tracked by Chehra [3] tracker to detect facial keypoints. Then the regions around lips keypoints were extracted and processed to provide an image (40, 60) pixels in size. The order of the frames in each sentence was preserved to provide the suitable source for the creation of visual speech sequences. We chose the length of the sequence to be 15 frames. This number was chosen based on the size of the data and available hardware for training the networks. The sentences were then cut into sequences of the chosen length without overlap. The training dataset was created from the ﬁrst ﬁfty sentences that are common for each speaker which produced 20740 training sequences for 64 speakers. The remaining 150 sentences from each speaker were then used as testing data. This produced a testing set of 61709 sequences. All of the images were converted to grayscale for the purpose of this work. Example of the training data is included in the Fig. 2. For the purpose of the initial experiments, the frames were randomly shuﬄed and provided with one-hot labels per frame. Then for the purpose of sequence classiﬁcation, the sequences were shuﬄed and provided with one one-hot label for the whole sequence. 2.3

Sequence Analysis

We have chosen two approaches to analyze the speaker data from input sequences. At ﬁrst, we tried to create a neural network based on LSTM [8], but we were unable to create a good topology to get good results. However, we would like to solve this problem in the future. After that, we have created several testing topologies based on 3D Convolutions [10] which after some adjustments provided results that are further discussed in the next section. These two

212

M. Hlav´ aˇc et al.

Fig. 2. Example of the data used for training the networks.

approaches were mainly selected because the current visual speech recognition system also use this type of sequence analysis and it would be easier to implement our speaker adaptation to these systems.

3

Experiments

The experiments are composed of initial tests with single image classiﬁcation on a closed set of speakers as our baseline approach, and of sequence classiﬁcation using 3D convolutions also on a closed set of speakers. The experiments were programmed in Keras [4] neural network framework with Tensorﬂow [1] backend in version 1.7. 3.1

Single Image Classification

The initial experiments were designed with 2D Convolutions to test the recognition rate on the source dataset UWB-HSCAVC [7]. The training data were composed of 83327 images and the testing data were composed of 245 398 images. The experiment involved a VGG-like [12] CNN in the task of per frame classiﬁcation of the speakers included in the dataset. The neural network was trained with 15 epochs with mini-batch size 32 and initial learning rate = 0.01. The recognition rate ﬁnished at 99.1% on the test data. The topology is described in the following Table 1. Where DO means dropout, FC means fully connected layer and ReLU activation functions are applied if not speciﬁed otherwise. We have used standard SGD as the optimizer with momentum = 0.9, weight decay = 1e−6, and with categorical cross-entropy loss. The strides of the convolutional layers were set to one. On the other hand, the strides of maxpooling layers were set to two. 3.2

Sequence Classification

To further improve the recognition we redesigned the topology with 3D Convolutions [10]. The network takes sequences of frames as input and produces a single output speaker classiﬁcation for the whole sequence. The last but one fully connected layer produces a feature vector of size 256. This vector will serve as

LipsID Using 3D Convolutional Neural Networks

213

Table 1. LipsID - single image topology. Conv2D(64,3×3)

Conv2D(128,3×3)

Conv2D(256,3×3)

FC(4096)

Conv2D(64,3×3)

Conv2D(128,3×3)

Conv2D(256,3×3)

Dropout(0.5)

Batch normalization Batch normalization Batch normalization FC(4096) Maxpooling(2×2)

Maxpooling(2×2)

Maxpooling(2×2)

FC(64,softmax)

Table 2. LipsID - sequences topology. Conv3D(32, 3 × 3 × 3,ReLU) Conv3D(64, 3 × 3 × 3,ReLU) Conv3D(128, 3 × 3 × 3,ReLU) FC(256) Conv3D(32, 3 × 3 × 3,ReLU) Conv3D(64, 3 × 3 × 3,ReLU) Conv3D(128, 3 × 3 × 3,ReLU) FC(64,softmax) Batch normalization

Batch normalization

MaxPool3D(3 × 3 × 2)

MaxPool3D(3 × 3 × 2)

Batch normalization

the LipsID features in our further experiments. The network was trained by SGD optimizer (with same parameters as in the single image classiﬁcation) with the categorical cross-entropy loss. The neural network was trained with 15 epochs with mini-batch size 32 and initial learning rate = 0.01 again. The training ﬁnished with 99.98% recognition rate on training data and 99.29% recognition rate on the test data. This is a signiﬁcant improvement over the single image LipsID classiﬁcation. To be more concrete, we decrease the recognition error by 0.19%, which is relative decrease by 21%. The stride of the 3D convolutions is set to one in every dimension and the stride of the 3D maxpooling is set to two. FC means fully connected layer (Table 2).

4

Conclusion and Future Work

This paper has presented a method for producing LipsID feature vector from sequences of visual speech. The method was tested on our own dataset UWBHSCAVC and produced good results in speaker classiﬁcation based on lips only images. The sequence classiﬁcation shows improvement over single frame classiﬁcation. With the usage of sequence classiﬁcation instead of single image classiﬁcation, we reached relative decrease of recognition error by 21%. In the future, we would like to add training data from other datasets and also data captured in diﬀerent light conditions. Then we will implement this method to existing visual speech recognition systems to assess the contribution of the LipsID features to visual speech recognition accuracy. We also would like to test LipsID detection with LSTM networks with which we hopefully reach similar or even better results. Acknowledgments. This work was supported by the Ministry of Education of the Czech Republic, project No. LTARF18017. The work has been also supported by the grant of the University of West Bohemia, project No. SGS-2016-039. This work was supported by the Government of the Russian Federation (grant No. 08-08) and the Russian

214

M. Hlav´ aˇc et al.

Foundation for Basic Research (project No. 18-07-01407) too. Moreover, access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

References 1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorﬂow.org/. software available from tensorﬂow.org 2. Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: Sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016) 3. Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Incremental face alignment in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1859–1866 (2014) 4. Chollet, F., et al.: Keras: Deep learning library for theano and tensorﬂow, vol. 7, p. 8 (2015). https://keras.io/k 5. Chung, J.S., Senior, A.W., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. CoRR abs/1611.05358 (2016). http://arxiv.org/abs/1611.05358 6. Chung, J., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision (2016) ˇ 7. C´ısaˇr, P., Zelezn` y, M., Krˇ noul, Z., Kanis, J., Zelinka, J., M¨ uller, L.: Design and recording of czech speech corpus for audio-visual continuous speech recognition. In: Proceedings of the Auditory-Visual Speech Processing International Conference 2005 (2005) 8. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM (1999) 9. Graves, A., Fern´ andez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classiﬁcation: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006) 10. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013) 11. Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU, pp. 55–59 (2013) 12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

From Kratzenstein to the Soviet Vocoder: Some Results of a Historic Research Project in Speech Technology R¨ udiger Hoﬀmann(B) , Peter Birkholz, Falk Gabriel, and Rainer J¨ ackel Institut f¨ ur Akustik und Sprachkommunikation, Technische Universit¨ at Dresden, Dresden, Germany {ruediger.hoffmann,peter.birkholz,falk.gabriel, rainer.jaeckel}@tu-dresden.de

Abstract. This paper demonstrates by means of an example, how historic collections of universities can be utilized in modern research and teaching. The project refers to the Historic Acoustic-phonetic Collection (HAPS) of the TU Dresden. Two “guiding fossils” from the history of speech technology are selected to present a selection of results. Keywords: History of speech communication research Mechanical speech synthesis · Vocoder

1

Introduction

Experimental phonetics and speech technology show continuing interest in their own history. Prominent examples in the literature date back to PanconcelliCalzia [1], Dudley and Tarnoczy [2], and Ohala et al. [3], followed by numerous other papers and the foundation of the Special Interest Group on the History of Speech Communication Sciences of the ISCA and IPA in 2011. The literature is supported and complemented by collections of historic items not only in scientiﬁc museums, but also in the diﬀerent historic collections of the universities. University collections are scientiﬁcally important, but endangered, because they are no “real” museums. The best way to take care of a collection is to include it in the processes of research and teaching at the university. It was the aim of a call for proposals of the German Federal Ministry of Education and Research (BMBF) in 2015, to support the collections in this sometimes diﬃcult process [4]. The TU Dresden was successful with the proposal “Sprechmaschine” (speaking machine), which aimed to investigate the exhibits on the history of speech synthesis in their Historic Acoustic-phonetic Collection (HAPS). Five research groups from the TU Dresden and one of the State Art Collections Dresden (as external partner) cooperate in the project. In this paper, we merely present two partial aspects of the ongoing research: the study of Kratzenstein’s “vowel organ” as the starting point of the mechanical speech synthesis (Sect. 3), and the investigation of the history of the vocoder as guiding fossil of the electronic speech synthesis (Sect. 4). c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 215–225, 2018. https://doi.org/10.1007/978-3-319-99579-3_23

216

2

R. Hoﬀmann et al.

History of the HAPS Collection

Research in electronic speech processing started at the TU Dresden with the development of a vocoder in the 1950s. Walter Tscheschner (1927–2004) started the work in speech synthesis and recognition, which continues until today. Many devices, which were developed during this long time span, were preserved and form the core of the historic collection. There was a close cooperation with the Institute of Phonetics of the Humboldt University in Berlin, which had its origin in the laboratory of the renowned speech therapists Hermann Gutzmann and Franz Wethlo. Dieter Mehnert, the last phonetician on this chair, collected numerous items which demonstrated the history of experimental phonetics in Berlin and at other places. When the institute was closed in 1996, this collection was transferred to Dresden. In this way, the development of experimental phonetics as well as electronic speech technology could be demonstrated as a whole. The fusion of the collections was completed in 1999, when the name HAPS was introduced. The most important place in the development of experimental phonetics in Germany was the Phonetic Laboratory of Giulio Panconcelli-Calzia (1878– 1966) in Hamburg, founded 1910 at the “Colonial Institute”, since 1919 at the Hamburg University. When the successional Institute of Phonetics was closed in 2006, the very important collection of devices from the era of Panconcelli-Calzia in Hamburg was united with the HAPS in Dresden, which is a really important special collection since that time [5]. The exploitation of the HAPS started with cataloging the exhibits from the ﬁeld of experimental phonetics [6]. A second catalogue volume is planned with the title “Historic devices of speech acoustics”. The BMBF project “Sprechmaschine” (Speaking machine) requires the development of those parts of this catalogue, which are focused on the exhibits from the ﬁeld of synthetic speech. The history of experimental phonetics starts at the end of the 19th century with the development of the colonial system. However, there are predecessors like the automata constructors of the late Baroque (Kratzenstein, Kempelen, Mical) and the great physiologists of the 19th century (M¨ uller, Ludwig, Helmholtz). Of course, the HAPS is not able to demonstrate original items from these periods, but there are some useful and rare replicas. In the following section, we will focus on one of them: Kratzenstein’s vowel organ.

3 3.1

Kratzenstein’s Vowel Organ – Guiding Fossil of Mechanical Speech Synthesis Kratzenstein’s Revival

Christian Gottlieb Kratzenstein (1723–1795) was the ﬁrst, who experimentally demonstrated the source-ﬁlter theory of speech production. In his “vowel organ”, which he presented at the occasion of a contest of the Imperial Academy of Sciences in St. Petersburg in 1780, he applied a reed pipe as source and diﬀerent

From Kratzenstein to the Soviet vocoder

217

Fig. 1. Left: Replicas of Kratzenstein’s vowel resonators, designed and manufactured by C. Korpiun. – Right: Replicas of the vowel resonators from Chiba und Kajiyama, c TU Dresden, HAPS. designed and manufactured by T. Arai. Photographs

resonators for the basic vowels as ﬁlters. Because the shape of the resonators was found in an empirical way, later scientists did not value his invention. This situation lasted until 2006 (!), when the German linguist Christian Korpiun (1948–2017) proved, that there is enough information in the work of Kratzenstein to make real replicas of his resonators (Fig. 1 left). The replicas are now in the HAPS as a gift of C. Korpiun. Furthermore, Korpiun published a commented German translation of Kratzenstein’s Tentamen [7], which will be complemented by an English version as soon as possible. In the succession of Kratzenstein’s source-ﬁlter idea, several improvements have been developed: – With regard to the source, the reed pipes have for example been replaced by arrangements similar to the vocal cords. The most successful attempt was the cushion pipe of Wethlo [8], published in 1913, which is contained as an original in the HAPS. – With regard to the ﬁlter, the empirically deﬁned shapes of the vowel resonators have, for example, been replaced by straight tube models, which represent the human articulation tract more precisely. The measurement-based models of Chiba und Kajiyama [9] from 1941 formed the starting point of the contemporary acoustic phonetics. The HAPS owns a replica of these resonators as a gift of T. Arai (Fig. 1 right) [10]. Today, the models for the source and for the vocal tract can be improved even further by new measurement methods and/or new materials. The following subsections brieﬂy sketch, how this is performed in the framework of the project “Sprechmaschine”. 3.2

Vocal Fold Models Using Modern Materials

One goal of the project “Sprechmaschine” is the development of synthetic physical vocal fold models with characteristics as similar as possible to human vocal folds. Human vocal folds have a layered structure: The outermost layer is a thin

218

R. Hoﬀmann et al.

Fig. 2. (a) 3D view of a vocal fold and its casing; (b) schematic view of the layered structure in the coronal plane; (c) screwed casing and its negative; (d) oblique view of a ﬁnished vocal fold model.

skin with a thickness of 0.05–0.1 mm (the epithelum), and the innermost layer is the vocalis muscle. Between the epithelium and the muscle is the lamina propria, a soft, water-like system of nonmuscular tissue. The challenge in the creation of synthetic vocal folds is the reproduction of this layered structure with appropriate materials such that the oscillations of the synthetic vocal folds become similar to those of real vocal folds. In our ongoing study, we use two-composite silicon with diﬀerent amounts of added silicone oil to recreate the diﬀerent physical properties of the three layers [11,12]. Figure 2a and b show the general layered geometry of our vocal fold models. The outer shape is based on the geometry by Scherer [13]. The production of the vocal folds is based on 3D-printed casings and moulds (Fig. 2c and d), somewhat similar to [14]. Recently, we investigated the acoustics and vibration patterns of diﬀerent models to examine the dependencies between the behavior and the geometrical and mechanical properties of the vocal folds. To this end, the vocal fold thickness, the angle of the conus elasticus, and the stiﬀness of the vocalis muscle were varied. As an example, a “soft” and “hard” vocalis muscle was used and the thickness was varied between 2 mm (T2), 3 mm (T3), and 4 mm (T4) at a constant 40◦ angle of the conus elasticus, resulting in six diﬀerent synthetic models (two stiﬀness values × 3 thickness values). There exist some characteristic parameters of the glottal area function (glottal area as a function of time during an oscillation cycle) to describe the vocal fold vibration pattern; cf. [15]. The glottal area function was measured by a high-speed camera during the vibration of the models. The parameters of the glottal area function were extracted and examined using appropriate software tools. As an example, Fig. 3a shows the maximum of the glottal area during an oscillation cycle as a function of the subglottal pressure. A typical acoustical property to make assertions of the voice is the diﬀerence between the ﬁrst and second harmonic in the spectrum of the source signal; cf. [16]. A microphone in a distance of 20 cm a little diagonal to the vocal fold models measured the pressure variations and so the voice of the models. The harmonics were extracted with the tool “VoiceSauce”, and H1-H2 as a function of subglottal pressure is shown in Fig. 3b for the six models. Eventually, measurements like the examples in Fig. 3 need to be compared to real phonation to assess the suitability of certain synthetic model geometries and materials.

From Kratzenstein to the Soviet vocoder

219

Fig. 3. (a) Maximum glottal area of six diﬀerent vocal fold models. – (b) Diﬀerence between the ﬁrst and second harmonic of the voice spectra of the models.

3.3

Towards a Database of Physical Vocal Tract Models with Realistic Geometries

With recent advances of Magnetic Resonance Imaging (MRI), 3D scanning, and 3D-printing technology, it is now possible to create (static) physical models of the vocal tract with very realistic geometries. Such models have gained increasing interest as research tools in speech science and can be created as follows (also see Fig. 4): First, MRI is used to capture the complete 3D shape of the vocal tract of the speech sound(s) of interest in high detail. Because the scanning takes a few seconds per sound, only sustainable sounds can be captured in 3D (e.g. [17]). Furthermore, because teeth are not visible in MRI data, plaster models of the subject’s teeth must be made and scanned using a 3D scanner. The wireframe models of the teeth are then merged with the MRI data ([18]). Based on the merged data set, the vocal tract is segmented in terms of a triangle mesh that represents the inner vocal tract walls, using freely available software tools (e.g., ITK-SNAP [19]). This surface mesh is then extruded to obtain a vocal tract model that has a certain wall thickness and can be printed as a physical 3D object (Fig. 4). Compared to the vocal tracts of living humans, the 3D-printed counterparts have the main advantage that both their acoustic and aerodynamic properties can be precisely measured. For example, a method to measure the volume velocity transfer function between the glottis and the lips for such models was recently presented by Fleischer et al. [18]. The 3D-printed vocal tract models can be used for research in multiple ways: – The physical models, along with their measured transfer functions, can be used to validate computational models that simulate vocal tract acoustics in one, two, or three dimensions. For example, for a one-dimensional acoustic simulation based on a 2D or 3D vocal tract shape, the vocal tract area function needs to be estimated. Multiple methods have been proposed for this purpose, e.g. [20–22]. However, so far it is not clear, which of these methods generates

220

R. Hoﬀmann et al.

Fig. 4. Processing steps to obtain a 3D-printable physical model of the vocal tract from volumetric MRI data and 3D scans of plaster models of the upper and lower teeth.

–

–

–

–

acoustic transfer functions that are most similar to the real transfer function of the corresponding 3D vocal tract shapes. 3D-printed and measured vocal tract models can be used to validate and compare diﬀerent transformation methods. The printed vocal tract models can be used to assess the acoustic eﬀects of certain geometric features or side cavities of the vocal tract. For example, 3Dprinted vocal tract models have been used to examine the acoustic eﬀect of the piriform fossa [23,24]. In a similar way, the acoustic eﬀects of interdental spaces or the vallecula (space between the epiglottis and the tongue root) could be examined. The 3D-printed models can be used to study “formant tuning” of professional singers at high pitches, which align formant frequencies with frequencies of the harmonics of the glottal source to maximize the radiated acoustic power [25]. 3D-printed vocal tract models of fricatives could be used for the investigation of noise sources when airﬂow is injected through the glottis. Measured spectra of the noise produced by the physical models could be used to validate computational aero-acoustic models. For education, the 3D-printed models, in combination with a suitable glottal excitation, can be used to generate sustained vowels. Suitable devices for glottal excitation are, e.g., reed pipes as described by Arai [10], or synthetic vocal fold models as described in Sect. 3.2.

Given this range of applications, we currently prepare a database that contains the detailed MRI-based 3D geometries of the vocal tract (with inserted teeth) of 22 sustained German speech sounds uttered by one male and one female speaker each. All of these 44 vocal tract shapes are printed using a 3D printer (type Ultimaker 3). For each vocal tract shape, the volume velocity transfer function is measured according to the method by Fleischer et al. [18], and the

From Kratzenstein to the Soviet vocoder

221

radiated noise spectra are determined for a range of stationary airﬂows injected through the glottis. The transfer functions and the noise spectra will be provided in the database along with the corresponding 3D geometries and the ﬁles needed for 3D-printing the models.

4 4.1

The Vocoder – Guiding Fossil of Electronic Speech Synthesis Problems of the History of the Early Vocoders

The vocoder was invented for bandwidth reduction in voice transmission at a time when its implementation was still very complex and its use was therefore limited to a few cases. However, it has provided many new insights into the analysis and synthesis of the speech signal, making it the most important key fossil of electronic speech technology today. For this reason, the vocoder also plays an important role in our study on the history of speech synthesis [26]. Since the vocoder was also used in security-relevant applications, there are still gaps in the presentation of its history. Even with regard to Germany, where the ﬁrst patent for an apparatus similar to the later American vocoder was granted, these gaps are only partially closed [27,28]. This ﬁnding applies in particular to the development in the Soviet Union, which has been perceived outside the Russian-language literature so far mainly on the belletrist processing in the novel “In the ﬁrst circle” (1968, uncut edition 1978 [29]) by Aleksandr I. Solzhenitsyn (1918–2008). Solzhenitsyn was drafted after the study of mathematics and physics to the war service and served starting from 1943 as a commander of a sound measuring battery. In 1945, he was sentenced to eight years in a detention camp for criticizing Stalin. He spent the period from 1948 to 1950 at a secret telecommunications institute in Marﬁno near Moscow. He described this time in his aforementioned novel, which also contains some details about the work carried out in Marﬁno on speech analysis and speech coding. This description served as the main source of information for the statements on the history of the Soviet vocoder in the monographs on the development of speech technology by Schroeder [30] and on the history of the vocoder by Tompkins [31]. After the end of the Cold War, accessibility to many documents in the former Soviet Union has improved and some of the involved scientists have published their memoirs. That this material is still hardly known, is probably due to the language barrier. We have therefore set ourselves the task of gaining a better overview in the context of a literature study, and report here on the status achieved so far. Most important were the biographic notes on Kotel’nikov [32] and the history of the Marﬁno laboratory by one of the leading engineers, Kalachev [33], which in turn led us to numerous papers in Soviet journals of that time.

222

4.2

R. Hoﬀmann et al.

A Literature Recherche on the Soviet Vocoder

For our project, we studied a big number of Russian documents, which are obviously less known outside of the former Soviet Union. A snapshot of this work was published in [34] and may be summarized in the following theses: – The famous mathematician Vladimir A. Kotel’nikov (1908–2005), who formulated the sampling theorem for the ﬁrst time in an engineering context at the age of just 25, worked in diﬀerent telecommunications projects, where he worked out various solutions for the associated encryption tasks. The work on the encryption apparatus Sobol-P led him to a parametric speech coding approach analogous to the vocoder. In his memoirs, Kotel’nikov notes that in late 1940, he received knowledge of the article by H. Dudley on the vocoder, which conﬁrmed his approach. At the beginning of 1941, the ﬁrst vocoder in the USSR began to work in his laboratory [32]. – At the same time, the acoustician Lev L. Myasnikov (1905–1972) worked in Leningrad. He is considered the inventor of the ﬁrst “objective” recognition of speech sounds in 1937 [35]. He habilitated in 1942 on technical phonetics. A patent ﬁled in 1940 describes a parallel ﬁlter bank, as it is also suitable for the analysis part of a vocoder. – Since 1943, a work group of the Ministry of State Security (MGB) under the direction of Andrey P. Peterson (*1915) dealt with the improvement of encryption technology. Following a memorandum, the above-mentioned laboratory in Marﬁno was founded in 1948. The most important specialists in telecommunications and cryptology should be integrated into one facility. From the beginning, the vocoder technology played the most important role. A ﬁrst version of the vocoder-based encryption system M-803 was tested at the communication line Moscow-Kiev in November 1949, but with insuﬃcient signal quality. As a resort, A. P. Peterson proposed a new approach that integrated the concept of “clipped speech” and that of the vocoder. In April 1950, the improved speech encryption system M-803 found approval of an evaluation committee that included Kotel’nikov [33]. – The development of the Marﬁno vocoder resulted in three remarkable results: • The modiﬁcation of the vocoder, in which a part of the speech signal was left in the time domain while the signal energy in the frequency bands was transmitted parametrically, was later known as semi-vocoder or voice excited vocoder [36], which accordingly was invented in Marﬁno. • For the further improvement of the voice quality, several suggestions for improvement were examined in 1950/51, among them the variant M-803M by Anton M. Vassilyev (1899–1965). If we interpret his proposal correctly, the principle of the formant vocoder was suggested here in a form as also described by Munson and Montgomery in 1950 [37]. • Part of the assessment of the transmission system is that it was probably the world’s ﬁrst system for the digital transmission of encrypted vocoder signals. – Since the middle of the 1950s, open publications on speech compression and vocoder applications appeared, e.g. the remarkable textbook [38].

From Kratzenstein to the Soviet vocoder

5

223

Conclusion

This paper describes selected results from the authors’ work in the project “Sprechmaschine”. It should be ﬁnally mentioned, that other project groups (from linguistics, design, and computer sciences) are additionally working in project parts, which will result in a “virtual collection” of typical instruments and devices from the history of synthetic speech. Acknowledgments. Supported by the German Federal Ministry of Education and Research (BMBF) in the project “Sprechmaschine”, FKZ 01UQ1601A.

References 1. Panconcelli-Calzia, G.: Geschichtszahlen der Phonetik (1941)/Quellenatlas der Phonetik (1940), New edition by K. Koerner. Benjamins, Amsterdam (1994) 2. Dudley, H., Tarnoczy, T.H.: The speaking machine of Wolfgang von Kempelen. JASA 22(2), 151–166 (1950) 3. Ohala, J.J. (ed.): A Guide to the History of the Phonetic Sciences in the United States. University of California, Berkeley (1999) 4. Bekanntmachung von F¨ orderrichtlinien “Vernetzen - Erschließen - Forschen. Allianz f¨ ur universit¨ are Sammlungen” (2015). BMBF Homepage https://www. bmbf.de/foerderungen/bekanntmachung-1029.html. Accessed 22 Apr 2018 5. Hoﬀmann, R.; Mehnert, D.: Early experimental phonetics in Germany - historic traces in the collection of the TU Dresden. In: Proceedings of the 16th International Congress of Phonetic Sciences (ICPhS 2007), Saarbr¨ ucken, pp. 881–884 (2007) 6. Mehnert, D.: Historische phonetische Ger¨ ate. Katalog der historischen akustischphonetischen Sammlung der TU Dresden, 1. Teil. TUDpress, Dresden (2012) ¨ 7. Kratzenstein, C.G.: Tentamen resolvendi problema, Petersburg 1781. Ubersetzt und kommentiert von Christian Korpiun. TUDpress, Dresden (2016) 8. Wethlo, F.: Versuche mit Polsterpfeifen. Passow-Schaefers Beitr¨ age f¨ ur die gesamte Physiologie 6(3), 268–280 (1913) 9. Chiba, T., Kajiyama, M.: The Vowel: Its Nature and Structure. Tokyo-Kaiseikan Pub. Co., Tokyo (1941) 10. Arai, T.: Education in acoustics and speech science using vocal-tract models. JASA 131(3), 2444–2454 (2012) 11. Chhetri, D.K., Zhang, Z., Neubauer, J.: Measurement of Young’s modulus of vocal folds by indentation. J. Voice 25(1), 1–7 (2011) 12. Alipour, F., Vigmostad, S.: Measurement of vocal folds elastic properties for continuum modeling. J. Voice 26, 816.e21–816.e29 (2012) 13. Scherer, R.C., et al.: Intraglottal pressure proﬁles for a symmetric and oblique glottis with a divergence angle of 10 degrees. JASA 109(4), 1616–30 (2001) 14. Murray, P.R., Thomson, S.L.: Synthetic, multi-layer, self-oscillating vocal fold model fabrication. J. Vis. Exp. (JoVE) 58 (2011) 15. Chen, G., et al.: Development of a glottal area index that integrates glottal gap size and open quotient. JASA 133(3), 1656–66 (2013) 16. Kreiman, J., et al.: Variability in the relationships among voice quality, harmonic amplitudes, open quotient, and glottal area waveform shape in sustained phonation. JASA 132(4), 2625–32 (2012)

224

R. Hoﬀmann et al.

17. Stone, S., Marxen, M., Birkholz, P.: Construction and evaluation of a parametric one-dimensional vocal tract model. IEEE Trans. Audio Speech Lang. Process. 26(8), 1381–1392 (2018) 18. Fleischer, M., Mainka, A., K¨ urbis, S., Birkholz, P.: How to precisely measure the volume velocity transfer function of physical vocal tract models by external excitation. PLoS ONE 13(3), e0193708 (2018). https://doi.org/10.1371/journal.pone. 0193708 19. Yushkevich, P.A., et al.: User-guided 3D active contour segmentation of anatomical structures: signiﬁcantly improved eﬃciency and reliability. Neuroimage 31(3), 1116–1128 (2006) 20. Birkholz, P.: Enhanced area functions for noise source modeling in the vocal tract. In: Proceedings of the 10th International Seminar on Speech Production (ISSP 2014), Cologne, pp. 37–40 (2014) 21. Beautemps, D., Badin, P., Bailly, G.: Linear degrees of freedom in speech production: analysis of cineradio- and labio-ﬁlm data and articulatory-acoustic modeling. JASA 109(5), 2165–80 (2001) 22. Laprie, Y., Loosvelt, M., Maeda, S., Sock, R., Hirsch, F.: Articulatory copy synthesis from cine X-ray ﬁlms. In: Proceedings of the Interspeech, Lyon, France (2013) 23. Dang, J., Honda, K.: Acoustic characteristics of the piriform fossa in models and humans. JASA 101(1), 456–465 (1997) 24. Delvaux, B., Howard, D.: A new method to explore the spectral impact of the piriform fossae on the singing voice: benchmarking using MRI-based 3D-printed vocal tracts. PLOS ONE 9(7), e102680 (2014) 25. Echternach, M., et al.: Articulation and vocal tract acoustics at soprano subject’s high fundamental frequencies. JASA 137(5), 2586–2595 (2015) 26. Hoﬀmann, R.: On the development of early vocoders. In: Proceedings of the 2nd IEEE Histelcon 2010, Madrid, pp. 359–364, 3–5 November 2010 27. Hoﬀmann, R.: Zur Entwicklung des Vocoders in Deutschland. In: Jahrestagung f¨ ur Akustik, DAGA 2011, D¨ usseldorf, 37. Jahrestagung f¨ ur Akustik, DAGA 2011, pp. 149–150, 21–24 March 2011 28. Hoﬀmann, R., Gramm, G.: The Sennheiser vocoder goes digital: On a German R&D project in the 1970s. In: Proceedings of the 2nd International Workshop on the History of Speech Communication Research (HSCR 2017), Helsinki, 18–19 August 2017, pp. 35–44. TUDpress, Dresden (2017) 29. Solschenizyn, A.: Im ersten Kreis. Aus dem Russ. u ¨ bersetzt und zusammengetragen von S. Geier. Vollst¨ andige Ausgabe der wiederhergestellten Urfassung. S. Fischer Verlag, Frankfurt am Main (1982) 30. Schroeder, M.R.: Computer Speech: Recognition, Compression, Synthesis. Springer Series in Information Sciences, vol. 35. Springer, Heidelberg (1999). https://doi. org/10.1007/978-3-662-06384-2 31. Tompkins, D.: How to Wreck a Nice Beach: The Vocoder from World War II to Hip-Hop. Melville House/Chicago: Stop Smiling Media, Brooklyn (2010) 32. Kotel’nikov, V.A.: Sud’ba, ochvativˇsaja vek. Tom 2: N. V. Kotel’nikova ob otce. Fizmatlit, Moskva (2011) 33. Kalaˇcev, K.F.: V kruge tret’em. Vospominanija i razmyˇslenija o rabote Marﬁnskoj laboratorii v 1948–1951 godach. Moskva (1999) 34. Hoﬀmann, R., J¨ ackel, R.: Zur Geschichte des Vocoders in der Sowjetunion. In: Jahrestagung f¨ ur Akustik, DAGA 2018, M¨ unchen, 44. Jahrestagung f¨ ur Akustik, DAGA 2018, pp. 840–843, 19–22 March 2018 ˇ 35. Mjasnikov, L.L.: Ob-ektivnoe raspoznavanie zvukov reˇci. Zurnal Techniˇceskoj Fiziki 13(3), 109–115 (1943)

From Kratzenstein to the Soviet vocoder

225

36. Schroeder, M.R., David, E.E.: A vocoder for transmitting 10 kc/s speech over a 3.5 kc/s channel. Acustica 10, 35–43 (1960) 37. Munson, W.A., Montgomery, H.C.: A speech analyzer and synthesizer. JASA 22(5), 678 (1950) 38. Sapoˇzkov, M.A.: Reˇcevoj signal v kibernetike i svjazi. Svjaz’izdat, Moskva (1963)

LSTM Neural Network for Speaker Change Detection in Telephone Conversations Marek Hr´ uz1(B) and Miroslav Hlav´ aˇc1,2,3 1

Faculty of Applied Sciences, NTIS, UWB, Pilsen, Czech Republic [email protected] 2 Faculty of Applied Sciences, Department of Cybernetics, UWB, Pilsen, Czech Republic [email protected] 3 ITMO University, St. Petersburg, Russia

Abstract. In this paper, we analyze an approach to speaker change detection in telephone conversations based on recurrent Long Short-Term Memory Neural Networks. We compare this approach to speaker change detection via Convolutional Neural Networks. We show that by ﬁnetuning the architecture and using suitable input data in the form of spectrograms, we obtain better results relatively by 2%. We have discovered that a smaller architecture performs better on unseen data. Also, we found out that using stateful LSTM layers that try to remember whole conversations is much worse than using recurrent networks that memorize only small sequences of speech. Keywords: Speaker change

1

· Diarization · Stateful LSTM

Introduction

Speaker change detection (SCD) is the task of ﬁnding boundaries of speech segments of two diﬀerent speakers. Usually, the ﬁnal goal is to use these segments for the task of diarization [4], where the segments are clustered and labelled leading to the solution of the problem “who speaks when”. In our previous paper [3] we extended the deﬁnition of SCD to detection of time instances in an audio stream when a change of audio source occurs which yields segments with constant audio sources. This is important for the real world scenario where the speech is produced naturally; e.g. telephone conversation. There are frequent speech overlaps, or a loud noise is present, and the silence also plays its role. If we look at the SCD as a preliminary step to speaker diarization it is reasonable to detect a time in audio stream when a second speaker starts speaking into the speech of the ﬁrst speaker producing overlapped speech and also the time when the occlusion ends. The segment has constant audio source and can be handled by the diarization system as an outlier. Another issue is silence. When a speaker speaks and then he makes a long pause and then continues to speak, c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 226–233, 2018. https://doi.org/10.1007/978-3-319-99579-3_24

LSTM Neural Network for Speaker Change Detection

227

the boundaries of the silence should be also detected although no speaker change occurred. This is reasonable since the diarization system will model one segment as one speaker and the long silence can aﬀect the acoustic properties of the speaker leading to a model that is not correct. One can argue that this can be handled by a voice activity detection system, but our deﬁnition explicitly handles this problem and we believe that the system of SCD learns this naturally from the data. The open problem is the length of the silence – what is considered a long pause?

2

Related Work

SCD has been addressed by many researchers in the past. There are classical methods based on the comparison of neighbouring segments of audio – Bayesian Information Criterion [5] and other forms of distances. More recently, some papers presented Deep Neural Networks (DNN) in the task of SCD; Convolutional Neural Networks (CNN) [3], standard fully connected DNNs [2] and recurrent DNNs with Long Short-Term Memory (LSTM) cells [6]. The biggest diﬀerence is that the CNN uses spectrogram of the audio while the other approaches use hand crafted features like MFCC. In this paper, we experimented with diﬀerent architectures of LSTM networks to ﬁnd out whether it outperforms the baseline CNN approach we published earlier [3]. We also compared whether the DNNs perform better when handcrafted features are presented to them or we can use the raw audio signal in the form of a spectrogram.

3

Dataset

For the training and testing purposes, we used a fraction of telephone conversation data from CallHome [1] corpus. The data are sampled at 8 kHz and are in English. We consider only the conversations where two speakers are present. In total, we obtain 109 conversations, each approximately 10 min long. We used 71 conversations for training and 38 conversations for testing. If spectrograms are used we compute each sample using a Hamming window of length 64 ms with the stride of 10 ms. Each sample represents 256 frequencies of the speech micro segment. When we use MFCC the setup is as follows; we use Hamming window of length 32 ms and shift it by 16 ms. We use 25 triangular ﬁlter banks that are spread non-linearly (Mel) and 11 cepstral coeﬃcients are extracted. We use the deltas and delta-deltas of the coeﬃcients and furthermore deltas and delta-deltas of the signal energy. This setup follows the setup in [6].

4

Network Architectures

In this paper, we use a CNN published earlier [3] as a baseline and compare the performance of diﬀerent LSTM DNNs architectures with it. The CNN architecture is summarized in Table 1. The convolutional layers use ReLU activation functions and the fully connected dense layers use sigmoid activation functions.

228

M. Hr´ uz and M. Hlav´ aˇc Table 1. Summary of the architecture of the CNN. Layer

Kernels

Size

Shift

Convolution Max pooling Batch norm

50

16 × 8 2×2

2×2 2×2

Convolution Max pooling Batch norm

200

4×4 2×2

1×1 2×2

Convolution Max pooling Batch norm

300

3×3 2×2

1×1 2×2

Dense Dense

4000 1

The LSTMs were used in two diﬀerent ways. First, we trained stateful LSTMs, where we train the models from whole conversations. This means that the net is remembering what was said, how it was said and who was speaking. Secondly, we trained the LSTMs on short sequences and after each sequence, the state of the network was reset. This means that the memory of the net is limited to the length of the sequence. Both approaches share the same architectures, but the style of training is changed. The architectures are summarized in Tables 2 and 3. Table 2. Summary of the architectures of the LSTMs. Architecture 1 Layer Cells Activation LSTM 512 tanh LSTM 512 tanh LSTM 512 tanh Dense 1024 relu Dense 512 relu Dense 256 relu Dense 1 sigmoid

Architecture 2 Layer Cells Activation Dense 1024 linear LSTM 512 tanh LSTM 512 tanh LSTM 512 tanh Dense 1024 relu Dense 512 relu Dense 256 relu Dense 1 sigmoid

Architecture 3 Layer Cells Activation LSTM 512 tanh LSTM 512 tanh LSTM 512 tanh Dense 1024 tanh Dense 512 tanh Dense 256 tanh Dense 1 sigmoid

The ﬁrst set of architectures in Table 2 use a larger number of parameters (circa 7−8 million of trainable parameters). The architecture in Table 3 follows the work of Yin et al. [6], which is more lightweight (circa 100k trainable parameters). Some architectures were also tested in a Bi-directional scenario, in which the LSTM layers observe the sequences from both directions. This eﬀectively doubles the number of LSTM layers parameters. The last layer in all networks consists only of one neuron with sigmoid activation function. It represents the probability for a given input. In case of the

LSTM Neural Network for Speaker Change Detection

229

Table 3. Summary of the architecture of the lightweight LSTMs Architecture 4 Layer Cells Activation LSTM 32 tanh LSTM 40 tanh Dense 40 tanh Dense 10 tanh Dense 1 sigmoid

Architecture 5 Layer Cells Activation LSTM 64 tanh LSTM 64 tanh LSTM 128 tanh Dense 128 tanh Dense 64 tanh Dense 1 sigmoid

CNN, the input is a spectrogram of 1.4 s of input audio signal. The output is the probability of a speaker change in the middle of the signal. For the LSTMs the input is either a sequence of spectrogram samples or a sequence of MFCCs.

5

Experiments

All the networks were trained to minimize the loss between predicted probability signal and labelling signal. The labelling signal is created according to the work [4]. It is a linear fuzzy labelling, where the probability of the speaker change is given by a normalized distance from an annotated value as in Eq. 1. mini (|t − si |) , (1) L(t) = max 0, 1 − τ where si is the time of ith annotated speaker change and τ = 0.6 is the tolerance. The annotated speaker changes were ﬁltered so that a pause in one speaker’s utterance is limited to 0.5 s. The loss function is deﬁned as binary crossentropy function. For all setups optimal hyperparameters have been found and the networks are trained until convergence of the loss function. The results are presented in the form of coverage-purity curves (Eq. 2) and equal coverage-purity values. maxh∈H |r ∩ h| , (2) coverage(R, H) = r∈R r∈R |r| where R are the reference speaker segments, H are the predicted speaker segments, |s| is the duration of segment s and r ∩ h is the intersection of segments r and h. Purity is the dual metric where the roles of R and H are interchanged. The coverage represents how well we divided the audio signal according to the annotations. Low coverage means that the signal was oversegmented. On the other hand undersegmentation results in high coverage. That is why the overall quality of the segmentation has to be analyzed dually by the purity measurement. Low purity means that the signal was undersegmented, while oversegmentation results in high purity. That is why the best result is achieved when both coverage and purity are high.

230

5.1

M. Hr´ uz and M. Hlav´ aˇc

Baseline CNN

The system is trained according to the work [4]. Each spectrogram representing 1.4 s of the audio signal is regressed into the probability value of a speaker change in the middle of the audio segment. The training samples are randomly selected from the training set. The testing audio signals are analyzed window by window with a shift of 0.1 s. 5.2

Stateful LSTMs

Stateful LSTMs are trained from sequences covering whole conversations. The LSTMs are trained on batches of shorter sequences, but the internal states of the networks are remembered across the whole conversation. The experiment should show whether the networks are able to model the speakers present in the conversation and/or whether they are able to learn from generally longer sequences. The LSTMs return sequences, which allows us to predict frame based probabilities of the speaker change. In the case of the spectrograms on the input of the network, the frame is 0.01 s long, in the case of MFCCs the frame is 0.016 s long. The testing audio signals are analyzed conversation by conversation. After each testing conversation, the internal states of the network are reset. 5.3

Short Sequence LSTMs

These networks were trained with “forgetting” after each short sequence. According to [6] we used sequences 3.2 s long. The networks were presented with batches of randomly selected sequences from the training set. The testing sequences were obtained from individual conversations and shifted by 0.8 s resulting in overlapping probability signals. The resulting probability value for a given time was computed as the average value of overlapping values for the given time.

6

Results

In Fig. 1, we show the coverage-purity curves for stateful LSTM DNNs. The different networks are summarized in Table 4. The best result with equal coveragepurity (ECP) of 0.7257 was achieved by the network that uses raw spectrogram as input. The handcrafted MFCC features worsen the results. The network denoted net03 uses a dense layer to simulate a feature extraction step supersedes the MFCC features but is still not as good as when the raw spectrogram is used. Still, the best result is far behind the ECP of the CNN which is equal to 0.7955. This indicates that there are not enough data to train the stateful LSTMs to obtain a model that is general and works well on unseen testing data. We tried lowering the number of trainable parameters to address this issue by using the lightweight architecture summarized in Table 3. Our experiments showed that using the bidirectional LSTM layers is beneﬁcial by a small margin. The ECP achieved by the network denoted net04 in Fig. 2 was equal to 0.7675 which is

LSTM Neural Network for Speaker Change Detection

Fig. 1. Coverage-purity curves for stateful large LSTMs.

Fig. 2. Coverage-purity curves for lightweight LSTMs.

231

232

M. Hr´ uz and M. Hlav´ aˇc

much better than the larger architecture but still not as good as the CNN. This result supports our theory of not having enough data to train the stateful LSTMs. There may be other reasons, but we were not able to achieve better results. This conclusion led us to new experiments with shorter sequences and “forgetful” LSTM DNNs. During experimenting, we also found out that using hyperbolic tangent as activation function of the dense layers yields better results than the ReLU function, hence the Architecture 3. With this setup, we achieved an ECP of 0.7639 which is again worse as the lightweight architecture. This means that the large architectures are inadequately large for the shorter sequences. When we use the lightweight architecture in this scenario we achieve an ECP of 0.7807 when using MFCC features and ﬁnally ECP of 0.8121 when using raw spectrogram. This result is better than the results of the CNN. One last test was with a network with Architecture 5 (Table 3) which has more parameters. The resulting ECP of 0.8027 shows that the smaller architecture performs better on unseen data. Table 4. Diﬀerent types of LSTM networks. Spectro indicates that spectrogram was on the input while MFCC means MFCC features. Bi means that the LSTM layers were bidirectional and Arch is the type of architecture used. Name Stateful Spectro MFCC Bi Arch ECP net01 ×

×

net02 ×

×

net03 ×

×

net04 × net05 net06

7

×

1

0.7160

1

0.7257

2

0.7189

×

× 4

0.7639

×

× 4

0.7807

× 4

0.8121

Conclusion

We have conducted experiments with recurrent neural networks for the task of speaker change detection in telephone conversations. We have shown that the recurrent LSTM DNNs are able to outperform the CNN approach of [4] when proper care is put into the selection of the architecture and form of the input audio data. Smaller architectures have a better chance to generalize the problem and perform well on unseen data. With a larger architecture, much more data would be needed to train good models. Even though it seems that the problem of speaker change detection is not a very diﬃcult problem when handled with machine learning approach. Another important conclusion is that using the raw spectrogram input is much better than using handcrafted MFCC features. This has been observed in many more applications of neural networks, particularly CNNs in computer vision. With the best setup of LSTM DNNs,

LSTM Neural Network for Speaker Change Detection

233

we achieve a result of 0.8121 outperforming the baseline CNN relatively by 2%. When comparing both approaches as a whole one has to consider the usage of them. CNN is able to decide about any segment 1.4 s long independently on other segments, but generally has much more parameters than lightweight LSTM DNN. On the other hand, our LSTM DNN needs to observe 3.2 s long segments and uses local averaging of the prediction probability function. This requires more forward passes through the network but we obtain frame level decision about the speaker change. Given the smaller number of parameters in the network, this might not be an issue. Acknowledgments. This paper was supported by the project no. P103/12/G084 of the Grant Agency of the Czech Republic. The work has also been supported by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

References 1. Canavan, A., Graﬀ, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997) 2. Gupta, V.: Speaker change point detection using deep neural nets. In: ICASSP, pp. 4420–4424. Brisbane (2015). https://doi.org/10.1109/ICASSP.2015.7178806 3. Hr´ uz, M., Kuneˇsov´ a, M.: Convolutional neural network in the task of speaker change detection. In: Ronzhin, A., Potapova, R., N´emeth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 191–198. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-43958-7 22 4. Hr´ uz, M., Zaj´ıc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: ICASSP: 42nd IEEE International Conferecnce on Acoustics, Speech and Signal Processing, pp. 4945-4949 (2017) 5. Shaobing, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: Proceedings DARPA Broadcast News Transcription and Understanding Workshop, vol. 8, pp. 127–132 (1998) 6. Yin, R., Bredin, H., Barras, C.: Speaker change detection in broadcast TV using bidirectional long short-term memory networks. In: Interspeech 2017, Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), ISCA, Stockholm, Sweden (2017). https://doi.org/ 10.21437/Interspeech.2017-65

Noise Suppression Method Based on Modulation Spectrum Analysis Takuto Isoyama(B) and Masashi Unoki(B) Japan Advanced Institute of Science and Technology, 1–1 Asahidai, Nomi, Ishikawa 923–1292, Japan {isoyama-t,unoki}@jaist.ac.jp

Abstract. Conventional methods for noise suppression can successfully reduce stationary noise. However, non-stationary noise such as intermittent and impulsive noise cannot be suﬃciently suppressed since these methods do not focus on temporal features of noise. This paper proposes a method for suppressing both stationary and non-stationary noise based on modulation spectrum analysis. Modulation spectra (MS) of the stationary, intermittent, and impulsive noise were investigated by using the time/frequency/modulation analysis techniques to characterize the MS features. These features were then used to suppress the stationary and non-stationary noise components from the observed signals. Using the proposed method, the direct-current components of the MS in the stationary noise, harmonicity of the MS in the intermittent noise, and higher modulation-frequency components of the MS in the impulsive noise were removed. The following advantages of the proposed method were conﬁrmed: (1) sound pressure level of the noise was dramatically reduced, (2) signal-to-noise ratio of the noisy speech was improved, and (3) loudness, sharpness, and roughness of the restored speech were enhanced. These results indicate that the stationary as well as non-stationary noise can be successfully suppressed using the proposed method. Keywords: Noise suppression · Modulation spectrum Non-stationary noise · Gammatone ﬁlterbank Psychoacoustical sound-quality index

1

Introduction

We perceive various types of sounds at various sound-pressure levels in our daily life. For example, speech and music are perceived as wanted sound, and background stationary and non-stationary noise as unwanted sound. Heavy noise not only dramatically reduces intelligibility of speech but also induces hearing loss and hearing fatigue in case of long-time exposure. Therefore, noise suppression is important for enhancing speech intelligibility as well as protecting hearing ability. There are many kinds of noise suppression methods. Classical and most commonly used method for suppressing noise is Boll’s spectrum subtraction method c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 234–244, 2018. https://doi.org/10.1007/978-3-319-99579-3_25

Noise Suppression Method Based on Modulation Spectrum Analysis

235

[1]. It can successfully suppress stationary components of background noise by subtracting the averaged amplitude spectrum from noisy speech. However, this method cannot suﬃciently suppress non-stationary noise such as impulsive noise and intermittent noise. Several methods have been proposed for suppressing non-stationary noise. One of them can suﬃciently suppress impulsive noise by using zero-phase signal [2]. However, the drawback of that method is that the impulsive components of the unvoiced signals (consonants) are also being removed. Non-negative spectral decomposition was proposed to suppress both stationary and non-stationary noise [3]. However, this method trains noise properties using a preliminary learning technique, so noise reduction is limited to the training data. It is diﬃcult to reduce both stationary and non-stationary noise simultaneously without prior knowledge of the noise types and preliminary learning. From knowledge of human auditory perception [4], temporal modulation can be regarded as an important part of speech perception as well as of sound quality assessment. Therefore, our motivation is to mimic noise reduction based on auditory modulation perception. This paper proposes a method for suppressing both stationary and non-stationary noise based on modulation spectrum (MS) analysis. MS features are used to characterize speech signals as well as various types of noise. These features are then used to reduce stationary, impulsive, and intermittent noise components from noisy speech.

Fig. 1. Block diagram of method for suppressing stationary/non-stationary noise.

2

Noise Suppression Method

Fig. 1 shows a block diagram of the suppression method based on MS. First, the observed signal s(t) is decomposed to its frequency components (k-channel signals) xk (t)s of k by the gammatone ﬁlterbank [5]. Second, the temporal amplitude envelope ek (t) and carrier signal ck (t) are decomposed from xk (t). The temporal power envelope of k-th channel, e2k (t), can be derived by using the Hilbert transform as follows: e2k (t) = LPF |xk (t) + j · Hilbert(xk (t))|2 , (1) where LPF(·) is the low-pass ﬁlter with the cut-oﬀ frequency of 64 Hz, | · | is the absolute value, and Hilbert(·) is the Hilbert transform. Third, noise components

236

T. Isoyama and M. Unoki

on the temporal power envelope of k-channel are removed by using the following steps: (1) removal of stationary noise component, (2) removal of intermittent noise component, and (3) removal of impulsive noise component. This is referred as a “noise-suppressed” power envelope. Fourth, the restored amplitude envelope is derived from the noise-suppressed power envelope by a square root operation and the stored carrier is multiplied to resynthesize the noise-suppressed channel signal. Finally, the noise-suppressed signal, y(t), is obtained by using the inversed gammatone ﬁlterbank. The blocks in Fig. 1 indicate the following: (·)2 is the square operation, Mean(·) is a mean operation in the time domain, HWR(·) is a half-wave rectiﬁcation, BSF(·) is a band-stop ﬁlter, and LPFS (·) is a low-pass ﬁlter, where the pass-band is approximated as an MS shape of speech. Figure 2 shows an example of how the power envelope of an observed signal is processed to suppress stationary and non-stationary noise components. Figure 2(a) shows the derived temporal power envelope from the outputs of the gammatone ﬁlterbank. Figure 2(b) shows the suppressed power envelope by removing direct-current (DC) component on the power envelope in Fig. 2(a). Figure 2(c) shows the suppressed power envelope by removing harmonics in the MS in Fig. 2(b). Figure 2(d) shows the suppressed power envelope by removing higher modulation frequency components in MS in Fig. 2(c). The temporal power envelopes in Figs. 2(a)–(d) are obtained from the outputs in the corresponding processing in Figs. 1(a)–(d).

Fig. 2. Examples of noise suppression by the proposed method: (a) power envelope of the observed noisy signal; (b) suppressed power envelope by DC-removal from (a); (c) suppressed power envelope by band-stop ﬁltering of (b), and (d) suppressed power envelope by low-pass ﬁltering of (c).

3

Modulation Spectrum Analysis

Modulation spectrum analysis (MSA) is used in Fig. 1(a) to investigate modulation features of various types of sounds. The MS, Ek (fm ), can be then derived

Noise Suppression Method Based on Modulation Spectrum Analysis

237

from the power envelope, e2k (t), by using the discrete Fourier transform (DFT) as follows: Ek (fm ) = |DFT(e2k (t))|,

(2)

where fm is the modulation frequency in Hz. 3.1

Database

Table 1 shows database information of the sound sources in details. Here, fs is the sampling frequency in Hz. This frequency has diﬀerent values in each dataset so that the sampling frequency used in the proposed method and the MSA is set to be 44.1 kHz by using the resampling technique. Speech stimuli including male and female speech signals with four morae words from familiarity-controlled word-lists (FW07) [6] were used to analyze the MS features of speech signals. Noise stimuli including stationary noise (white noise, pink noise, and babble noise) and non-stationary noise (machine-gun noise as intermittent noise and impulses as impulsive noise) from NOISEX-92 [7] were used to analyze the MS features of stationary and non-stationary noise. Table 1. Sound sources used for modulation spectrum analysis.

3.2

Sound source

fs [Hz] Number of stimulations Duration [sec]

White noise

19,980

1

235

Pink noise

19,980

1

235

Babble noise

19,980

1

235

Machinegun noise 19,980

1

235

Impulse noise

19,980

4

1

Male speech

48,000 400

1

Female speech

48,000 400

1

Feature Analysis and Results

The MSA was used to analyze all stimuli in various sound sources as shown in Table 1. Figure 3 shows the results of MSA for various sound sources. Horizontal axis indicates the modulation frequency in Hz and vertical axis indicates the normalized MS in dB. Here, normalization was done as the level of MS at 0 Hz is 0 dB. It was reconﬁrmed that the modulation spectra of speech stimuli have a unique peak around modulation frequency of 4 Hz, as shown in Fig. 3 [8]. It was found that the modulation spectra of stationary noise such as white noise, pink noise, and babble noise appear in the lower modulation frequencies. It was also found that MS of machine gun noise as intermittent noise appears as harmonics. From the analyses of the datasets, it was found that the fundamental

238

T. Isoyama and M. Unoki

Fig. 3. Modulation spectra Ek (fm ) of various signals in the case of (a) k=11, (b) k=16, (c) k=21, and (d) k=27.

modulation frequency of the machine gun noise was 8 Hz, while the MS of the impulsive noise appears as a ﬂat shape with the dynamic range of 5 dB in all modulation frequencies. These values of the fm and dynamic range depend on the datasets, so they should be automatically determined by the auto-correlation technique. From these ﬁndings, the MS features of various types of noise sounds that we found may be used to suppress both stationary and non-stationary noise simultaneously in the MS domain.

4

Algorithms of Suppression Processing

This section presents three algorithms of noise suppression processing in Fig. 1 to remove stationary and non-stationary noise components. 4.1

Removal of Stationary Noise Component

From the results in Sect. 3, it is found that the MS of stationary noise appears in the lower modulation frequencies. Thus, to obtain the power envelope qk2 (t)

Noise Suppression Method Based on Modulation Spectrum Analysis

239

with removed stationary noise, the DC component of the MS in Fig. 1(a) was cancelled out by using the following processing. e2k (t) − µk e2k (t) ≥ µk 2 qk (t) = , (3) 0 otherwise TN 1 e2k (t)dt, (4) µk = TN 0 where TN is the time length of the non-speech section. In this paper, the speech and non-speech sections were determined by the voice activity detection (VAD) method [9]. 4.2

Removal of Intermittent Noise Component

From the results in Sect. 3, it is found that the MS of intermittent noise appears as harmonics with the fundamental modulation frequency of 8 Hz. Thus, to remove the intermittent noise component, these harmonics of the MS in Fig. 1(b) are canceled out by the following ﬁnite impulse response (FIR) band-stop ﬁltering. H(z) = b0 − rL z −L ,

(5)

where b0 = 1, r = 0.995, fc is the fundamental modulation frequency and L = round(fs /fc ). In this paper, fc was determined from the rectiﬁed signal in Fig. 1(b) by using the auto-correlation method. Figure 4(a) shows an example of a band-stop ﬁlter (BSF) with fc = 8 Hz. 4.3

Removal of Impulsive Noise Component

From the results in Sect. 3, it is found that the MS of impulsive noise appears in the entire modulation frequency domain as a ﬂat shape. Thus, to remove the impulsive noise component, the MS shape in Fig. 1(c) is attenuated by using the following low-pass ﬁlter (LPFS ). H(z) =

b0 + b1 z −1 , 1 + a1 z −1

(6)

where H(z) was designed as inﬁnite impulse response (IIR) Butterworth ﬁlter. For example, when the cut-oﬀ frequency is 5 Hz, coeﬃcients are b0 = 0.07, b1 = 0.07, and a1 = 0.85. Figure 4(b) shows an example of an LPFS with a cut-oﬀ frequency of 5 Hz.

5 5.1

Evaluations Evaluation Measures

Five types of objective measures were used to evaluate the proposed method.

240

T. Isoyama and M. Unoki

Fig. 4. Frequency responses of (a) band-stop ﬁlter and (b) low-pass ﬁlter.

The ﬁrst two measures evaluated the eﬃciency of the proposed method in suppressing the noise components. One of them was used to evaluate the suppression level (SL ) with regard to the noise itself, deﬁned as: T 2 s (t)dt , (7) SL = 10 log10 0T y 2 (t)dt 0 where s(t) is the observed signal before suppression and y(t) is the noise signal after suppression. Another measure was used to evaluate the relative suppression level with regard to noisy speech, deﬁned as: T 2 s (t)dt , (8) NS = SNR − 10 log10 T 0 s (ss (t) − y(t))2 dt 0 where SNR is the signal-to-noise ratio with regard to noise condition, ss (t) is the original speech, y(t) is the noisy speech, and T is the signal duration. The last three measures were psychoacoustical sound-quality indices, loudness, sharpness, and roughness [10]. These measures were used to objectively evaluate sound-quality after noise-suppression. Loudness indicates the attribute of a sound that determines the magnitude of the auditory sensation produced. Sharpness and roughness indicate complex eﬀects that quantify the subjective perception of rapid and sharp sound. Thus, heavy noise, in general, induces increasing loudness, increasing sharpness, and increasing roughness. 5.2

Results

The proposed method was evaluated by using ﬁve types of noise signals presented in Table 1 and ﬁve evaluation measures presented in Sect. 5.1 to conﬁrm whether

Noise Suppression Method Based on Modulation Spectrum Analysis

241

the proposed method can suﬃciently suppress stationary and non-stationary noise, as well as reduce the perceptual eﬀects due to noise exposure. Figure 5 shows the results of noise suppression level as a function of the sound pressure level (SPL) of noise from 60 dB to 100 dB. These results were obtained from the ﬁve types of noise signals. They indicate that the proposed method can suﬃciently suppress the noise level regardless of the SPLs and noise types. They also indicate that the three algorithms of noise suppression have suﬃcient eﬀect on noise suppression in comparison with each algorithm of noise suppression. Figure 6 shows the results of relative suppression level as a function of SNR at three speciﬁc SPLs of noise from 60 dB to 100 dB in heavy noisy conditions. The relative suppression level was calculated from ﬁve types of noise by using Eq. (8). It was found that the noise suppression level under speech presentation by the proposed method in the cases of the SPLs at 100 dB and 80 dB exceeds 5 dB, while the suppression level decreases from 5 as the SNR increases in the case of the SPL at 60 dB.

Fig. 5. Noise suppression level by the proposed method for ﬁve types of noisy signals for (a) sound pressure level of 60 dB, (b) 80 dB, and (c) 100 dB.

Psychoacoustical sound-quality indices were calculated from the ﬁve types of noisy speech and the noise-suppressed speech signals. Figure 7 shows relative improvement of these indices when using the proposed method. Figure 7(a) shows improvement in loudness when using the proposed method, that is, reduced loudness, LR . This was calculated by LR = Lorg − Lsup , where Lorg is loudness of the original noisy speech and Lsup is loudness of the noisesuppressed speech. It was found that when the SPL of noise is 100 dB, the LR of white, pink, and babble noise is 50 sone, while for the intermittent noise and impulsive noise it is 20 sone. In addition, it was found that the reduced loudness, LR , increases as the SPL of noise increases. Figure 7(b) shows reduced sharpness, KR . This was calculated by KR = Korg − Ksup , where Korg is sharpness of the original noisy speech and Ksup is sharpness of the noise-suppressed speech. It was found that when the SPL of noise is 100 dB, the KR of white, pink, and babble noise is 0.1 acum, while for the inter-

242

T. Isoyama and M. Unoki

Fig. 6. Relative suppression level for ﬁve types of noisy signals for (a) sound pressure level of 100 dB, (b) 80 dB, and (c) 60 dB.

Fig. 7. Evaluations by psychoacoustical sound-quality indices: (a) reduced loudness LR , (b) reduced sharpness KR , and (c) reduced roughness RR .

mittent noise and impulsive noise it is 0 acum. In addition, it was found that the reduced sharpness, KR , remains the same when the SPL of noise increases. Figure 7(c) shows reduced roughness, RR . This was calculated by RR = Rorg − Rsup , where Rorg is roughness of the original noisy speech and Rsup is roughness of the noise-suppressed speech. It was found that when the SPL of noise is 100 dB, the RR of white, pink, and babble noise is 0.05 asper, that of intermittent noise is 0.73 asper, and that of impulsive noise is 0.25 asper. In addition, it was found that reduced roughness is sensitive to non-stationary temporal ﬂuctuations in such non-stationary noise. All of the results conﬁrmed that the proposed method can perceptually reduce the noise eﬀects for speech enhancement, even if the SPL of noise is high.

Noise Suppression Method Based on Modulation Spectrum Analysis

6

243

Conclusion

This paper proposed a method for suppressing both stationary and nonstationary noise based on MSA. MSA was used to investigate the unique features of stationary and non-stationary noise in the MS domain and the methods for cancelling out these features. The proposed method was evaluated for various types of noisy speech signals by using ﬁve types of evaluations (two suppression levels and three psychoacoustical sound-quality indices) to verify whether or not the noise level can be suﬃciently suppressed and the perceptual eﬀects of stationary and non-stationary noise can be reduced. It was found that the proposed method can suppress stationary noise by 8 dB, intermittent noise by 6 dB, and impulsive noise by 8 dB terms of the suppression level. It was also found that the proposed method can suﬃciently suppress the noise eﬀects from noisy speech by 8 dB as SNRs from −20 to −60 dB in terms of relative suppression level. It was found that when the SPL of noise is 100 dB, LR of stationary noise is 50 sone, while that of intermittent noise and impulsive noise is 20 sone. In addition, it was found that LR increases as the SPL of noise increases. It was found that when the SPL of noise is 100 dB, KR of stationary noise is 0.1 acum while that of intermittent noise and impulsive noise is 0 acum. It was found that when the SPL of noise is 100 dB, RR of stationary noise is 0.05 asper, that of intermittent noise is 0.73 asper, and that of impulsive noise is 0.25 asper. This conﬁrms that the proposed method can not only suﬃciently suppress stationary and non-stationary noise but can also reduce the perceptual eﬀects due to noise exposure. Acknowledgments. This work was supported by the Secom Science and Technology Foundation by the Suzuki Foundation, and by a Grant in Aid for Innovative Areas (No. 16H01669, and 18H05004) from MEXT, Japan.

References 1. Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27, 113–120 (1979) 2. Takehara, R., Kawamura, A., Iiguni, Y.: Impulsive noise suppression using interpolated zero phase signal. In: APSIPA2017, pp. 1382–1389 (2017) 3. Zhiyao, D., Gautham, J.M., Paris, S.: Speech enhancement by online non-negative spectrogram decomposition in non-stationary noise environments. In: Proceedings of Interspeech 2012, pp. 595–598 (2012) 4. Stephan, D.E., Torsten, D.: Characterizing frequency selectivity for envelope ﬂuctuations. J. Acoust. Soc. Am. 108, 1181 (2000) 5. Patterson, R., Nimmo-Smith, L., Holdsworth, J., Rice, P.: An auditory ﬁlter bank based on the gammatone function. Paper Presented at a Meeting of the IOC Speech Group on Auditory Modelling at RSRE, pp. 14–15 (1987) 6. Kondo, T., Amano, S., Sakamoto, S., Susuki, Y.: Development of familiaritycontrolled word-lists (FW07). IEICE Tech. Rep. 107(436), 43–48 (2008)

244

T. Isoyama and M. Unoki

7. Varga, A., Steeneken, J.M.H.: Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the eﬀect of additive noise on speech recognition systems. Speech Commun. 12(13), 247–251 (1993) 8. Atlas, L., Greenberg, S., Hermansky, H.: The Modulation Spectrum and Its Application to Speech Science and Technology. Interspeech Tutorial, Antwerp (2007) 9. Kanai, Y., Morita, S., Unoki, M.: Concurrent processing of voice activity detection and noise reduction using empirical mode decomposition and modulation spectrum analysis. In: Proceedings of INTERSPEECH, pp. 742–746 (2013) 10. Zwicker, F.: Psychoacoustics: Facts and Models. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-68888-4

Designing Advanced Geometric Features for Automatic Russian Visual Speech Recognition Denis Ivanko1 ✉ , Dmitry Ryumin1, Alexandr Axyonov1, and Miloš Železný2 (

1

)

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia [email protected], [email protected], [email protected] 2 University of West Bohemia, Pilsen, Czech Republic [email protected]

Abstract. The use of video information plays an increasingly important role for automatic speech recognition. Nowadays, audio-only based systems have reached a certain accuracy threshold and many researchers see a solution to the problem in the use of visual modality to obtain better results. Despite the fact that audio modality of speech is much more representative than video, their proper fusion can improve both quality and robustness of the entire recognition system that was proved in practice by many researches. However, no agreement between researchers on the optimal set of visual features was reached. In this paper, we investigate this issue in more detail and propose advanced geometry-based visual features for automatic Russian lip-reading system. The experiments were conducted using collected HAVRUS audio-visual speech database. The average viseme recognition accuracy of our system trained on the entire corpus is 40.62%. We also tested the main state-of-the-art methods for visual speech recognition, applying them to continuous Russian speech with high-speed recordings (200 frames per seconds). Keywords: Lip-reading · Automatic speech recognition Visual speech decoding · Visual features · Geometric features · Russian speech

1

Introduction

Nowadays, automatic speech recognition is one of the rapidly developing areas of computer science. This fact is confirmed by a large amount of practical applications appearing almost every day. At the moment, the most popular of the existing appli‐ cations are Google “Speech API”, Apple “Siri”, Microsoft “Cortana”, Amazon “Alexa”, Yandex “Alisa” from the giants of the global IT industry, which deserve the recognition of millions of users. Along with them, there are thousands of practical applications that have spread in many areas of human’s life: in automatic processing of incoming calls in telephone call-centers, in voice control for home appliances and car navigation systems, in social services for people with disabilities, in healthcare, military, education, etc.

© Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 245–254, 2018. https://doi.org/10.1007/978-3-319-99579-3_26

246

D. Ivanko et al.

The idea of using voice input for converting speech into text or for organizing humanmachine interaction (HMI) is very convenient and more natural for users compared to the keyboard input. In recent years, the widespread use of machine learning techniques and artiﬁcial neural networks (ANN) has made it possible to set the accuracy and reli‐ ability of speech recognition systems to a new level. Usually, the maximization of recognition accuracy is achieved through the use of cascades of ANNs, trained on various combinations of acoustic features [1]. However, despite sometimes satisfactory results obtained with proper training of the system for solving a particular task, the recognition accuracy of spontaneous and continuous speech in real-life conditions is still far from human capabilities. At the same time, audio-based practical applications have a number of signiﬁcant drawbacks. When acoustic noises occur, the recognition accuracy of such systems is rapidly deteriorating. To date, the main approach to solve this problem is the use of a certain pre-processing techniques for noise reduction in the incoming signal. However, in real life conditions, there are many diﬀerent types of noise: starting from stationary wideband noise in the telephone channel and to the crowd-noise (“cocktail party” noise) in the room along with the reverberation. Thus, it is not always possible to conduct a proper noise reduction, and that is still an open and currently unresolved issue for auto‐ matic speech recognition. Obviously, in noisy conditions the recognition accuracy of automatic systems is also far from human capabilities. On the other hand, it is worth taking into account that human’s speech is bimodal by its nature, and people themselves pay attention to the lip movements of the interlocutor during a conversation [2]. Therefore, it is not entirely correct to expect from an automatic speech recognition system the same high result as from a human if it receives signiﬁ‐ cantly less information. Because of this, many researchers began to use visual informa‐ tion about speech in their studies. At First, it allows creating more robust systems (since video information is invariant to acoustic noises). Secondly, the correct fusion of modal‐ ities makes it possible to obtain advantages from both and, at the same time, eliminate their shortcomings, giving the best recognition results. In this paper, we focus on the study of visual Russian speech and present a method for extracting an advanced set of geometric features. We also present the results of experiments obtained with the developed lip-reading system. The remainder of this paper is organized as follows. A review of the state of research ﬁeld is presented in Sect. 2; in Sect. 3, we describe the basic methodology, including region of interest (ROI) localization and proposed geometry-based visual features; in Sect. 4, we describe the setup and the results of the experiments; some conclusions are given in Sect. 5.

2

Related Work

One of the first works in which the researchers tried to systematize the existing knowledge about audio-visual speech recognition was [3]. The paper showed that despite the fact that visual modality of natural speech is much less informative than audio, information received from it is often enough to solve simple tasks (e.g.

Designing Advanced Geometric Features

247

isolated words recognition). A description of the current state of the field can be found in the works [4, 5] dedicated to visual-only speech recognition. And also in the works [6, 7], dedicated to the audio-visual speech recognition. In the framework of the statistical approach to speech recognition, a representative database for model training is an indispensable element. For English speech multiple data‐ bases are publicly available, such as: AVICAR [8], AVLetters [9], CUAVE [10], AVTimit [11], IBMSR [12], PRAV Corpus [13], etc. However, the situation with Russian speech is much more complicated, since there are very few existing Russian visual speech databases. In our work, we used one own database of continuous Russian speech with highspeed recordings – HAVRUS, collected in 2016–2017 in SPIIRAS [14]. The next important step in the construction of a lip-reading system is to locate the region of interest that contains the mouth motion relevant to speech. It is important since the quality of ROI has a signiﬁcant inﬂuence on the recognition accuracy. To extract ROIs, many researchers relied on the active appearance model (AAM) [15, 16], Haarlike feature based boosted classiﬁcation framework [17, 18], skin color thresholding [19], etc. Despite numerous studies, researchers were not able to ﬁnd the best feature set universally accepted for representing visual speech (e.g., in comparison with wellknown MFCC features for acoustic speech). To date, there are several basic types of features, which can be found in the literature. The most frequently used of them are: pixel-based features [20] – raw pixel data used directly or after some image transfor‐ mation; geometry-based features [21] – geometric information of the talking lips is extracted as features; motion-based features [22] – features designed to describe the motion; model-based features [23] – a model of the visible articulators is built and the compact model parameters are used as visual features, or a combination of mentioned above features [24, 25]. There are also several state-of-the-art methods for model training. Initially, the most widespread methods were based on the use of Hidden Markov Models (HMM) for lipreading and their coupled or multistream versions for audio-visual speech recognition [26]. However, at present, the approaches based on the use of neural networks of diﬀerent architectures have become increasingly popular [27]. In this research, we used AAM-based algorithm for ROI localization, developed geometric features to extract information about uttered speech and multilayer neural network for classiﬁcation.

3

Methodology

3.1 Region of Interest Localization Since the most valuable information about pronounced speech is contained in the mouth area, the ﬁrst important step is the preprocessing of raw video frames and ROI locali‐ zation. For this purpose, we used AAM-based algorithm implemented in the Dlib open source computer vision library [28]. The main idea of the algorithm is to match the statistical model of object shape and appearance, containing a set of facial landmarks, to a new image. The face detector we use is made using the classic Histogram of Oriented

248

D. Ivanko et al.

Gradients (HOG) feature combined with a linear classiﬁer, an image pyramid, and sliding window detection scheme. Figure 1 shows how to ﬁnd facial landmarks in an image using this method. These are points on the face such as the corners of the mouth, along the eyebrows, on the eyes, and so forth.

Fig. 1. Full 2D face shape model used in [29] (left) and the face landmarks localization algorithm results (right).

Thus, on each frame, where a face was found, we get the coordinates of the facial landmarks, 20 of which are located in the mouth region (12 on the external and 8 on the internal borders of lips). This method works very well on frontal face images and, since the HAVRUS database contains frontal video recordings, we managed to obtain very precise coordinates of lips landmarks. 3.2 Feature Extraction Method The general structure of the proposed method for extracting geometric features is shown in Fig. 2 and includes the sequential execution of the following 5 steps: 1. Load a frame from a video ﬁle. 2. Use the facial landmarks detection algorithm described in Sect. 3.1 to ﬁnd 68 facial key points. 3. Normalize the coordinates of the obtained landmarks in order to bring the data to a single format, as it was done in the work [30]. 4. Calculate a number of Euclidean distances [31] between landmarks in accordance with Table 1. 5. Save the feature vector.

Designing Advanced Geometric Features

Fig. 2. General diagram of the feature extraction method.

249

250

D. Ivanko et al. Table 1. Relationships between facial landmarks used for features extraction. # 1 2 3 4 5 6 7 8 9 10 11 12

Distance between landmarks (№) 49–61 61–60 60–68 68–59 59–67 67–58 67–57 57–66 66–56 56–65 65–55 65–54

# 13 14 15 16 17 18 19 20 21 22 23 24

Distance between landmarks (№) 54–64 64–53 64–52 52–63 52–62 62–51 62–50 50–61 62–68 63–67 64–66 61–65

In this work, we attempted to determine an optimal set of geometric features to maximize the recognition accuracy. Figure 3 shows 24 pairs of key points (highlighted in green) that have been selected experimentally and convey the most valuable infor‐ mation about uttered speech. Table 1 demonstrates a set of these 24 selected features. The columns indicate the landmark numbers in accordance with the map (Fig. 1, left), the Euclidean distance between which was taken as features, e.g. feature #24 shows the width of the internal borders of lips (landmarks 61 to 65). The results given in the experimental section were obtained using this feature set.

Fig. 3. Examples of the detected ROIs with 20 landmarks in the mouth region. (Color ﬁgure online)

3.3 MLP Training For viseme classiﬁcation we used Multi-layer Perceptrons (MLPs) trained with the Scikit-learn free software machine learning library [21]. MLP is a supervised learning algorithm that learns a function f (⋅):Rm → Ro by training on a dataset, where m is the number of dimensions for input and o is the number of dimensions for output. Given a set of features X = x1 , x2 , … , xm and a target y, it can learn a non-linear function approximator for either classiﬁcation or regression. Figure 4 shows a one hidden layer MLP with scalar output [32].

Designing Advanced Geometric Features

251

Fig. 4. MLP with one hidden layer [32].

The leftmost layer, known as the input layer, consists of a set of neurons xi | x1 , x2 , … , xm representing the input features (24 in our case). Each neuron in the hidden layer transforms the values from the previous layer with a weighted linear summation 𝜔1 x1 + 𝜔2 x2 + … + 𝜔m xm, followed by a non-linear activation function g(⋅):R → R. The output layer receives the values from the last hidden layer and trans‐ forms them into output values. In this work, we used the following basic parameters of a neural network: as an activation function we used the rectiﬁed linear unit function, returns f (x) = max(0, x). The number of neurons in the hidden layer ranged from 1 to 100. Batch size was calcu‐ lated by the formula batch_size = min(200, n_samples). Maximum number of iterations was 200 with 1e–4 tolerance for the optimization.

4

Evaluation Experiments

4.1 Experimental Setup The experiments were carried out using HAVRUS corpus consisting of high-speed (200 fps) video recordings of 20 speakers. Each of them uttered 200 Russian phrases, taken from phonetically rich texts. The resolution of video data is 640 × 480 pixels. The data‐ base also contains phoneme and viseme labeling. In this work, we solved the so-called phoneme/viseme recognition task (the input of the system is an image; the output is the recognized phoneme/viseme). According to the existing HAVRUS labeling, we divided the available data into 48 classes, according to the number of phonemes in Russian language. After that, the data for each class was divided with the ratio of training data to test data as 75:25%. Then, 48 MLPs were trained (one for each viseme class) according to Sect. 3.3. Thus, when an input image from the test set is fed, we get the probability of its belonging to a certain class from each MLP and the MLP with the highest probability wins. Accuracy in this case means correct

252

D. Ivanko et al.

recognition of visemes on the test set, in comparison with the viseme labeling in the speech database. 4.2 Experimental Results The best recognition result of the system trained in this way is 40.62% with 85 neurons MLP. Figure 5 shows a dependence of the viseme recognition accuracy on the number of layers. Of course, the main goal of this work was not ﬁnding the best conﬁguration of the neural network. However, we can also observe some trend of increasing accuracy with increasing the number of neurons until a certain limit.

Fig. 5. Viseme recognition accuracy trained on the HAVRUS corpus.

The main task of this study was to ﬁnd the optimal set of geometric features. We can say that the preliminary results of this work are a necessary intermediate step to improve the existing baseline of audio-visual Russian speech recognition [26, 33] and will be used for this purpose in our future research.

5

Conclusions and Future Work

In this paper, we present an advanced set of geometric features designed to improve the accuracy of the lip-reading system for Russian and also report our preliminary experi‐ mental results. The experiments were conducted using the developed MLP-based lipreading system, trained by Scikit-learn machine learning library. The average recogni‐ tion accuracy of the system trained on the HAVRUS database reaches 40.62%. The results of this work will be used in the future studies to improve audio-visual baseline for continuous Russian speech recognition [26, 33]. The fusion of diﬀerent types of visual features is also of great practical interest for the future research.

Designing Advanced Geometric Features

253

Acknowledgments. This research is ﬁnancially supported by the Ministry of Education and Science of the Russian Federation, agreement No. 14.616.21.0095 (reference RFMEFI616 18X0095) and by the Ministry of Education of the Czech Republic, project No. LTARF18017.

References 1. Yu, D., Deng, L.: Automatic Speech Recognition. SCT. Springer, London (2015). https:// doi.org/10.1007/978-1-4471-5779-3 2. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976) 3. Potamianos, G., Neti, C., Matthews, I.: Audio-visual automatic speech recognition: an overview. Issues Audio Vis. Speech Process. 22, 23 (2004) 4. Zhou, Z., Zhao, G., Hong, X., Pietikainen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32, 590–605 (2014) 5. Bowden, R., et al.: Recent developments in automated lip-reading. In: Proceedings of SPIE, Optics and Photonics for Counterterrorism, Crime Fighting and Defence IX, vol. 8901, p. 13 (2013) 6. Katsaggelos, K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015) 7. Seong, T.W., Ibrahim, M.Z.: A review of audio-visual speech recognition. J. Telecommun. Electron. Comput. Eng. 10(1–4), 35–40 (2018) 8. Lee, B., et al.: AVICAR: audio-visual speech corpus in a car environment. In: Proceedings of Interspeech 2004, pp. 380–383 (2004) 9. Cox, S., Harvey, R., Lan, Y., Newmann, J., Theobald, B.: The challenge of multispeaker lipreading. In: Proceedings of the International Conference Auditory-Visual Speech Process (AVSP), pp. 179–184 (2008) 10. Patterson, E., Gurbuz, E., Tufekci, Z., Gowdy, J.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings of the IEEE ICASSP 2002, vol. 2, pp. 2017–2020 (2002) 11. Hazen, T., Saenko, K., La, C., Glass, J.: A segment-base audio-visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of the International Conference Multimodal Interfaces, pp. 235–242 (2004) 12. Lucey, P., Potaminanos, G., Sridharan, S.: Patch-based analysis of visual speech from multiple views. In: Proceedings of AVSP 2008, pp. 69–74 (2008) 13. Abhishek, N., Prasanta, K.G.: PRAV: a phonetically rich audio visual corpus. In: Proceedings of Interspeech 2017, pp. 3747–3751 (2017) 14. Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 338–345. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_40 15. Newman, J., Cox, S.: Language identiﬁcation using visual features. Proc. IEEE Audio Speech Lang. Process. 20(7), 1936–1947 (2012) 16. Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: Proceedings of the International Conference Multimedia Expo (ICME), pp. 432–437 (2012) 17. Zhou, Z., Hong, X., Zhao, G., Pietikainen, M.: A compact representation of visual speech data using latent variables. Proc. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 181–187 (2014) 18. Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. Proc. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)

254

D. Ivanko et al.

19. Estellers, V., Thiran, J.: Multi-pose lipreading and audio-visual speech recognition. EURALISP J. Adv. Signal Process. 51 (2012) 20. Hong, X., Yao, H., Wan, Y., Chen, R.: A PCA Based visual DCT feature extraction method for lip-reading. In: Proceedings of the International Conference Intelligent Information Hiding Multimedia, Signal Process, pp. 321–326 (2006) 21. Cetingul, H., Yemez, Y., Erzin, E., Tekalp, A.: Discriminative analysis of lip motion features for speaker identiﬁcation and speech-reading. Proc. IEEE Trans. Image Process. 15(10), 2879–2891 (2006) 22. Yoshinaga, T., Tamura, S., Iwano, K., Furui, S.: Audio-visual speech recognition using lipmovement extracted from side-face images. In: Proceedings of the International Conference Auditory-Visual Speech Processing (AVSP), pp. 117–120 (2003) 23. Lan, Y., Theobald, B., Harvey, R., Ong, E., Bowden, R.: Improving visual features for lipreading. In: Proceedings of the International Conference Auditory Visual Speech Processing (AVSP), pp. 142–147 (2010) 24. Radha, N., Shahina, A., Khan, A.: An improved visual speech recognition of isolated words using combined pixel and geometric features. Proc. J. Sci. Technol. 9(44), 7 (2016) 25. Rahmani, M.H., Alamsganj, F.: Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: 3D International Conference on Pattern Recognition and Image Analysis, pp. 195–199 (2017) 26. Ivanko, D., et al.: Using a high-speed video camera for robust audio-visual speech recognition in acoustically noisy conditions. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 757–766. Springer, Cham (2017). https://doi.org/ 10.1007/978-3-319-66429-3_76 27. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-54184-6_6 28. Implementation of Computer Vision Library. https://github.com/davisking/dlib. Accessed 30 Apr 2018 29. Baltrusaitis, T., Deravi, F., Morency, L.: 3D constrained local model for rigid and non-rigid facial tracking. In: Computer Vision and Pattern Recognition (CVPR), pp. 2610–2617 (2012) 30. Howell, D., Cox, S., Theobald, B.: Visual units and confusion modelling for automatic lipreading. Image Vis. Comput. 51, 1–12 (2016) 31. Description of Euclidean Distance Calculation. https://en.wikipedia.org/wiki/Euclidean_distance. Accessed 30 Apr 2018 32. Machine Learning Toolkit. http://scikit-learn.org/stable/. Accessed 30 Apr 2018 33. Ivanko, D., et al.: Multimodal speech recognition: increasing accuracy using high-speed video data. J. Multimodal User Interfaces (JMUI) (2018, in press)

On the Comparison of Diﬀerent Phrase Boundary Detection Approaches Trained on Czech TTS Speech Corpora Mark´eta J˚ uzov´ a(B) Department of Cybernetics and New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic [email protected]

Abstract. The phrasing is a very important issue in the process of speech synthesis since it ensures higher naturalness and intelligibility of synthesized sentences. There are many diﬀerent approaches to phrase boundary detection, including simple classiﬁcation-based, HMM-based, CRF-based approaches, however, diﬀerent types of neural networks are used for this task as well. The paper compares representative methods for phrasing of Czech sentences using large-scale TTS speech corpora as training data, taking only speaker-dependent phrasing issue into consideration. Keywords: Phrase boundary · Speech corpus Conditional random ﬁelds · Neural networks

1

· Classiﬁcation

Introduction

The natural sentence splitting into smaller parts audibly separated, usually by a pause, during speech is called “phrasing” [22]. The main reason of phrasing is deﬁnitely the better intelligibility of the passed message in speech. However, one of the other reasons why people divide a sentence into phrases lies in the need for taking a breath. Thus, a speech without any pauses sounds unnatural and robotic. In spite of the fact TTS systems do not need to breathe, it is common to deal with the phrasing issue as a part of text normalization sub-system. In general, it is not a simple task – the position of phrase breaks is not clearly deﬁned in the speech, the pauses highly depend on the particular speaker, the speech speed and the particular situation or the purpose of the speech [24]. The phrase boundary detection task could be deﬁned as a sequence-tosequence problem [32]: A list of words (or tokens) w0 , w1 , . . . , wn should be assigned by a list of juncture types j0 , j1 , . . . , jn where ji = 1 if a phrase break follows a word wi and ji = 0 otherwise. And there have been many diﬀerent approaches to this natural language processing (NLP) task, usually reported on English. Besides deterministic approaches based on punctuation marks or function/content words [33], there are many classiﬁcation based approaches c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 255–263, 2018. https://doi.org/10.1007/978-3-319-99579-3_27

256

M. J˚ uzov´ a

[5,9,21,25] using diﬀerent features. However, the main disadvantage of these algorithm is the decision-making about each juncture type separately. The authors of [33] ﬁrstly used HMM model for phrase breaks prediction, a similar approach was also presented e.g. in [31]. Other techniques used are conditional random ﬁelds (CRF) [3,10,13] and all kinds of neural networks, e.g. [2,30]. The reason for testing neural networks (and CRF and HMM) for the purpose of phrasing lies in the basics of the phrase break detection task – it is a sequenceto-sequence modelling so it seems to be reasonable to use a sequence modelling framework, as these methods may be more suited for that, compared to the “classical” classiﬁcation-based approaches. Phrases, in general, contain usually only several words (see Fig. 1), the phrase boundaries are inserted in speciﬁed intervals and they depend on the other breaks in the sentence as well – as described e.g. in [32,33].

Fig. 1. A histogram of phrase and sentence lengths in the Czech speech corpus [18] (marked as corpus1 in Sect. 3).

In the TTS system ARTIC [17,35], developed on the author’s department, the appropriate phrasing of recorded and input text sentences emerged to be very important. That is why the phrase boundaries in the recorded sentences are thoroughly being detected by the automatic segmentation process [14] and they have been still corrected [4] to ensure the most possible accurate description of the speech data. On the other hand, the input text sentences (those to be synthesized) are still split to phrases by a simple algorithm based on commas. Note that in Czech texts, the commas are much more frequent compared e.g. to English texts, so it is a good indicator for the phrase boundary detection. However, phrasing based only on commas can produce extremely long phrases, for example in the case of a long compound sentence containing several simple

On the Comparison of Diﬀerent Phrase Boundary Detection Approaches

257

sentences joined with a coordinate conjunction (e.g. a, EN: and ) where no comma is written in Czech. Afterwards, the created text phrases are synthesized using the formal prosody grammar features (so called “prosodemes”, see [8,26,27,34] for more details) to ensure (in unit selection) the selection of appropriate unit candidates using Viterby search [36] – the symbolic prosody feature prosodemes ensures a correct behaviour of the F0 contour at phrase-ﬁnal words for keeping the required communication function; the prosodeme agreement is one of the component of the target cost in TTS ARTIC [15]. There have not been many other approaches to the phrasing of the text sentences for TTS ARTIC (except punctuation-based), e.g. [29]. However, recently, diﬀerent classiﬁcation-based approaches were compared in [7] and also CRF based boundary detector was trained [6] which showed to be the best option among the tested methods. Anyhow, in the last decade, the neural networks (NN) have been used more and more often for various NLP tasks, so it was decided to try new approaches to phrase boundary detection in Czech sentences. And the overall comparison is the main scope of the presented paper.

2

Data Acquisition

The process of data gaining for the phrase boundary detection could be a demanding task – the annotator agreement (both from text and speech data) is quite low, as shown in [6,28]. And, as proved e.g. in [6,24] and mentioned in Sect. 1, the phrase boundary detection is a speaker-dependent task. For these reasons, the author decided to use data from speech corpora recorded by professional speakers for the purposes of Czech TTS ARTIC [17,35]. All the recorded sentences had been manually checked by human annotators and also automatically corrected [19] (to reveal lapsus or swapped short words – the most often speakers’ errors) and than the automatic pitch-synchronous segmentation process [11,12,14,16] was performed. The resulted segmented speech corpora contain the information about positions of pauses and breaths in the speech and these information (together with commas, see Sect. 1) are used for the presented experiment. As all the speakers were professionals, the speech breaks are expected to occur in reasonable places in the read sentences. The “true” phrase boundary is set (ji = 1) after every word wi which – is followed by a comma in the text sentence or – is followed by a pause or a breath in the spoken sentence. Note that it was decided to use both commas and speech pauses/breaths to be phrase breaks in the presented study, contrary e.g. to [23] considering only speech pauses to be phrase boundaries, since commas are, especially in Czech language, good indicators for phrase breaks in speech. However, the breaths and speech pauses in the corpora (not associated with any comma) represent a “value added” and, hopefully, might ensure more accurate phrasing.

258

2.1

M. J˚ uzov´ a

Features

The compared approaches (except LSTM emb and Bi emb; see Sect. 3) use the following set of features for a word wi , inspired e.g. in [5,31], used also in the previous studies [6,7]: word wi , word wi has or has not a comma, following word wi+1 , morphological tag ti of the word wi , morphological tag ti+1 of the word wi+1 , bigram ti + ti+1 , trigram ti−1 + ti + ti+1 , sentence length N , position of the word wi in the sentence Ni , distance from the preceding word followed by a comma i − iLC (iLC ≤ i; iLC = 0 if none of words w0 . . . wi−1 has a comma), – distance to the next word followed by a comma iN C − i (iN C ≥ i; iN C = N − 1 if none of words wi+1 . . . wN −1 has a comma).

– – – – – – – – – –

Besides these features listed above, some of the presented methods use only word embeddings as it has shown up that they are powerful input representations in many NLP tasks [1,37], including the phrasing; these high-dimensional vectors, as shown e.g. in [20], are able to capture general syntactic and semantic properties of words, as well as their relations – so it is tested whether they are able to substitute the list of features.

3

Compared Approaches

The score of this paper is to compare diﬀerent phrase boundary detection approaches. The author choose representatives among simple deterministic phrasing methods, classiﬁcation-based approaches, and diﬀerent types of neural networks. The whole list of the methods used is listed below: – Comma – a simple approach which splits the given sentence after every comma (currently used in TTS ARTIC ) – LogReg – Logistic Regression classiﬁer1 – SVC – Support Vector Machines with linear kernel1 – CRF – Conditional Random Fields – MLP – Multi-layer Perception (MLP) with 30-dimensional input layer and 100-dimensional hidden layer (all layers are fully connected) with dropout set to 0.2; trained for 100 epochs 1

Note that no cross-validation results for classiﬁer’s parameters are presented in this paper since they were a part of the previous study in [7], and the parameters were set according to the best results shown in the aforementioned paper.

On the Comparison of Diﬀerent Phrase Boundary Detection Approaches

259

– LSTM – a neural network with two long short-term memory (LSTM) layers (each with 200 units) and the output fully connected layer and dropout set to 0.2; trained for 100 epochs – LSTM emb – equal to LSTM, but with input embedding layer with 200 units – Bi emb – bidirectional neural network (with 100 LSTM units in each layer) with input embedding layer (200 units), output fully connected layer and dropout set to 0.1 Note that the padding was performed on all sentences for all above mentioned approaches for fair comparison as some NN approaches require to have input data of the same length. The last two models do not use the set of features listed in Sect. 2.1, only word embeddings. Table 1. Comparison of speaker-dependent phrase boundary detection for 3 Czech speech corpora (the results of simple classiﬁcation-based approaches could slightly diﬀer from the results presented in [6] due to padding of corpus sentences to the same length). Corpus

Classiﬁer

tp

fn

fp

tn

A

R

P

F1

corpus1-male

Comma LogReg SVC-lin CRF MLP LSTM LSTM emb Bi emb

2407 2772 2784 2857 2521 2344 2337 2365

781 416 404 331 667 844 851 823

0 116 134 170 95 269 284 246

75472 75356 75338 75302 75377 75203 75188 75226

0.990 0.993 0.993 0.994 0.990 0.986 0.986 0.986

0.755 0.870 0.873 0.896 0.791 0.735 0.733 0.742

1.000 0.960 0.954 0.944 0.964 0.897 0.892 0.906

0.860 0.912 0.912 0.919 0.869 0.808 0.805 0.816

corpus2-male

Comma LogReg SVC-lin CRF MLP LSTM LSTM emb Bi emb

2319 2335 2336 2343 2361 2189 2177 2305

109 0 62534 0.998 93 1 62533 0.999 92 1 62533 0.999 85 3 62531 0.999 67 0 62534 0.999 239 152 62382 0.994 251 167 62367 0.994 123 85 62449 0.997

0.955 0.962 0.962 0.965 0.972 0.902 0.897 0.949

1.000 1.000 1.000 0.999 1.000 0.935 0.929 0.964

0.977 0.980 0.980 0.982 0.986 0.918 0.912 0.957

corpus3-female Comma LogReg SVC-lin CRF MLP LSTM LSTM emb Bi emb

2377 2397 2396 2407 2387 2231 1832 2237

155 0 65403 0.998 135 8 65395 0.998 136 9 65394 0.998 123 16 65387 0.998 145 1 65403 0.998 301 133 65270 0.994 700 217 65186 0.987 295 97 65306 0.994

0.939 0.947 0.946 0.951 0.943 0.881 0.724 0.883

1.000 0.997 0.996 0.993 1.000 0.944 0.894 0.958

0.968 0.971 0.971 0.972 0.970 0.911 0.800 0.919

260

4

M. J˚ uzov´ a

Results

The overall comparison of the tested approaches is shown in Table 1 using 4 standard evaluation measures – accuracy (A), recall (R), precision (P ) and F1-score (F 1), and the numbers of true positives (tp), true negatives (tn), false positives (fp) and false negatives (fn). All the results are calculated at word level. The results show that the CRF -based approach outperformed the other methods for two of the tested corpora, even those using diﬀerent neural networks. For the corpus2-male data, the MLP approach provided the best results. In the author’s opinion, that is because of a slightly diﬀerent nature of this corpus – the male speaker made almost all pauses/breaths at the comma punctuation, compared to the others. In any case, the results are quite surprising as many studies on phrasing of English sentences have indicated that the LSTM or bidirectional networks (sometimes using embeddings instead of a set of features) achieved better results compared to other approaches, e.g. [30]. The highest results for our Czech data among NN-based approaches were achieved by MLP, it was also proved that the bidirectional NN are more powerful compared to LSTM. It is also obvious that the word embeddings are not able to fully substitute the set of features. The problem might be the number of vector dimensions – the Czech language is much more complex compared to English so, maybe, smaller word embeddings can not cover the whole semantic and syntactic properties of words (to compare, only several tens of part-of-speech tags are deﬁned in English but about 3000 in Czech).

5

Conclusion

The testing and comparison of diﬀerent speaker-dependent phrase boundary detection approaches on Czech speech corpora showed that, in general, the CRF model is able to outperform the others. However, the massive usage of neural networks forces the author to test more NN approaches for the Czech phrasing task. Some results are promising which indicates that more experiments (with diﬀerent settings) should be performed to ﬁnd out the optimal solution. As a future work, it is also planned to apply the presented methods on other largescale corpora build for the purposes of TTS ARTIC – both Czech and English. Acknowledgments. The work has been supported by the grant of the University of West Bohemia, project No. SGS-2016-039, and by Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

On the Comparison of Diﬀerent Phrase Boundary Detection Approaches

261

References 1. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 999888, 2493–2537 (2011) 2. Fernandez, R., Rendel, A., Ramabhadran, B., Hoory, R.: Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: Proceedings of Interspeech 2014, pp. 2268–2272. ISCA, September 2014 3. Gregory, M.L.: Using conditional random ﬁelds to predict pitch accents in conversational speech. In: Proceedings of ACL 2004. ACL, East Stroudsburg, pp. 677–684 (2004) 4. Hanzl´ıˇcek, Z.: Correction of prosodic phrases in large speech corpora. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 408–417. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-455105 47 5. Hirschberg, J., Prieto, P.: Training intonational phrasing rules automatically for English and Spanish text-to-speech. Speech Commun. 18(3), 281–290 (1996) 6. J˚ uzov´ a, M.: CRF-based phrase boundary detection trained on large-scale TTS speech corpora. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 272–281. Springer, Cham (2017). https://doi.org/ 10.1007/978-3-319-66429-3 26 7. J˚ uzov´ a, M.: Prosodic phrase boundary classiﬁcation based on Czech speech corpora. In: Ekˇstein, K., Matouˇsek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 165–173. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-642062 19 8. J˚ uzov´ a, M., Tihelka, D., Vol´ın, J.: On the extension of the formal prosody model for TTS. In: Text, Speech and Dialogue. Lecture Notes in Computer Science, Springer, Heidelberg (2018) 9. Koehn, P., Abney, S., Hirschberg, J., Collins, M.: Improving intonational phrasing with syntactic information. In: Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 1289–1290 (2000) 10. Laﬀerty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random ﬁelds: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001) 11. Leg´ at, M., Matouˇsek, J., Tihelka, D.: A robust multi-phase pitch-mark detection algorithm. Proc. Interspeech 2007, 1641–1644 (2007) 12. Leg´ at, M., Matouˇsek, J., Tihelka, D.: On the detection of pitch marks using a robust multi-phase algorithm. Speech Commun. 53(4), 552–566 (2011) 13. Louw, A., Moodley, A.: Speaker speciﬁc phrase break modeling with conditional random ﬁelds for text-to-speech. In: Proceedings of 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), pp. 1–6 (2016) 14. Matouˇsek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: Proceedings of Interspeech 2008, pp. 1626–1629. ISCA (2008) 15. Matouˇsek, J., Leg´ at, M.: Is unit selection aware of audible artifacts? SSW 2013. In: Proceedings of the 8th Speech Synthesis Workshop, pp. 267–271. ISCA, Barcelona (2013) 16. Matouˇsek, J., Tihelka, D.: Classiﬁcation-based detection of glottal closure instants from speech signals. In: Proceedings of Interspeech 2017, pp. 3053–3057. ISCA (2017)

262

M. J˚ uzov´ a

17. Matouˇsek, J., Tihelka, D., Romportl, J.: Current state of Czech text-to-speech system ARTIC. In: Sojka, P., Kopeˇcek, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006). https://doi.org/10. 1007/11846406 55 18. Matouˇsek, J., Romportl, J.: Recording and annotation of speech corpus for Czech unit selection speech synthesis. In: Matouˇsek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 326–333. Springer, Heidelberg (2007). https://doi. org/10.1007/978-3-540-74628-7 43 19. Matouˇsek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of Interspeech 2013, pp. 1511–1515. ISCA (2013) 20. Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of 2013 NAACL HLT, pp. 746–751 (2013) 21. Mishra, T., Jun Kim, Y., Bangalore, S.: Intonational phrase break prediction for text-to-speech synthesis using dependency relations. In: Proceedings of ICASSP 2015, pp. 4919–4923 (2015) ˇ 22. Palkov´ a, Z.: Rytmick´ a v´ ystavba prozaick´eho textu. Studia CSAV; ˇcis. 13/1974, Academia (1974) 23. Parlikar, A., Black, A.W.: Data-driven phrasing for speech synthesis in low-resource languages. Proc. ICASSP 2012, 4013–4016 (2012) 24. Prahallad, K., Raghavendra, E.V., Black, A.W.: Learning speaker-speciﬁc phrase breaks for text-to-speech systems. In: The 7th ISCA Tutorial and Research Workshop on Speech Synthesis, pp. 162–166 (2010) 25. Read, I., Cox, S.: Stochastic and syntactic techniques for predicting phrase breaks. Comput. Speech Lang. 21(3), 519–542 (2007) 26. Romportl, J., Matouˇsek, J.: Formal prosodic structures and their application in NLP. In: Matouˇsek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/ 11551874 48 27. Romportl, J.: Structural data-driven prosody model for TTS synthesis. In: Proceedings of Speech Prosody 2006, pp. 549–552. TUDpress, Dresden (2006) 28. Romportl, J.: Automatic prosodic phrase annotation in a corpus for speech synthesis. In: Proceedings of Speech Prosody 2010. University of Illionois, Chicago (2010) 29. Romportl, J., Matouˇsek, J.: Several aspects of machine-driven phrasing in text-tospeech systems. Prague Bull. Math. Linguist. 95, 51–61 (2011) 30. Rosenberg, A., Fernandez, R., Ramabhadran, B.: Modeling phrasing and prominence using deep recurrent learning. In: Proceedings of Interspeech 2015, pp. 3066– 3070. ISCA (2015) 31. Sun, X., Applebaum, T.H.: Intonational phrase break prediction using decision tree and n-gram model. Proc. Eurospeech 2001, 3–7 (2001) 32. Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009) 33. Taylor, P., Black, A.W.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12(2), 99–117 (1998) 34. Tihelka, D.: Symbolic prosody driven unit selection for highly natural synthetic speech. In: Proceedings of Interspeech 2005 - Eurospeech, pp. 2525–2528. ISCA (2005) 35. Tihelka, D., Hanzl´ıˇcek, Z., J˚ uzov´ a, M., V´ıt, J., Matouˇsek, J., Gr˚ uber, M.: Current state of text-to-speech system ARTIC: A decade of research on the ﬁeld of speech technologies. In: Text, Speech and Dialogue. Lecture Notes in Computer Science, Springer, Heidelberg (2018)

On the Comparison of Diﬀerent Phrase Boundary Detection Approaches

263

36. Tihelka, D., Kala, J., Matouˇsek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: Proceedings of Interspeech 2010, pp. 174–177. ISCA (2010) 37. Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semisupervised learning. In: Proceedings of ACL 2010, pp. 384–394. ACL (2010)

Word-Initial Consonant Lengthening in Stressed and Unstressed Syllables in Russian Tatiana Kachkovskaia(B) and Mayya Nurislamova Saint Petersburg State University, St. Petersburg, Russia [email protected], [email protected]

Abstract. This paper deals with consonant lengthening eﬀects caused by word-initial position in interaction with stress-induced lengthening. Experiment 1, based on a 30-h speech corpus, showed that in general word-initial lengthening is more pronounced in stressed syllables than in unstressed. The lengthening eﬀect is also stronger for consonants in CV syllables compared with CCV syllables. Additionally, it was shown that consonant duration serves to signal word stress, and the reduction pattern for consonants is similar to that for vowels. Experiment 2, based on controlled laboratory data, showed that not all the speakers choose the strategy of signaling word boundaries and word stress with consonant lengthening; presumably, it depends on the speaking style. It was also shown that in CCV syllables the ﬁrst consonant might be responsible for signaling word boundary, while the second–lexical stress. Keywords: Consonant duration · Word-initial lengthening Word stress · Prosodic boundaries · Russian

1

Introduction

Recent studies for various languages show that phrase boundaries are marked at both ends–initially and ﬁnally. In terms of tone, the beginning of an intonational phrase is often signalled by declination reset, while the end of an IP might be marked by a speciﬁc boundary tone, or contain a complex or wide melodic movement–in cases when the nucleus occurs phrase-ﬁnally. Similar eﬀects are observed for duration, intensity and spectral characteristics, although the latter two are considered weaker cues [1–3]. Lengthening at ends of utterances and intonational phrases–ﬁnal lengthening–is considered a universal phenomenon [4]. For Russian, however, it is known that the lengthening eﬀect is highly dependent on whether the phrase is followed by a pause or not [5]. The phenomenon of lengthening at the other end–the beginning of a phrase–is called “initial strengthening” [8,9]. For IPs and utterances it is not yet considered universal, as very few languages have been analysed so far. For Russian it was documented in [6] for utterance-initial vowels. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 264–273, 2018. https://doi.org/10.1007/978-3-319-99579-3_28

Word-Initial Consonant Lengthening in Russian

265

Similar eﬀects for word boundaries have been found in a few languages as well [7]. The aim of this study was to ﬁnd out whether this it also observed in Russian for word-initial consonants. When dealing with segmental duration in Russian, one should bear in mind the phenomenon of vowel reduction, as it has great impact on vowel duration. We know of two factors that inﬂuence vowel reduction in Russian: position of the syllable relative to the stressed syllable (1st pretonic vs. other pretonic and all post-tonic), and position relative to word boundary (absolute-initial vs. nonabsolute-initial vowels)1 [10,11]. Although we do not have much data on consonant duration in this respect, we might hypothesize that durational changes in vowels may be accompanied by consonant duration changes as well. In most papers on the temporal organization of speech much attention is paid to vowel duration. However, in terms of articulatory mechanisms, consonant lengthening is not at all impossible. Moreover, there is evidence for consonant lengthening in previous studies. For Russian this evidence includes (a) cases of contrastive stress [12] and (b) phrase-ﬁnal lengthening [13]. Therefore, our hypothesis that in word-initial position we might expect some consonant lengthening is based on the following observations: 1. evidence from other languages; 2. evidence for utterance-initial vowel lengthening in Russian; 3. evidence for consonant lengthening in other prosodically strong contexts. In order to test this hypothesis, we have taken two steps. First we performed a corpus-based experiment. Then, based on those results, we recorded laboratory material–a set of phrases designed to eliminate other prosodic factors that are beyond this study but may inﬂuence consonant duration. At this stage we conﬁned ourselves to syllables /sa/, /ta/ and /sta/ as the most frequent syllables with a fricative and a plosive, and the respective consonant cluster. The decision to include the syllable /sta/ was also motivated by our interest in the phenomenon of consonant compression in longer syllables. This syllable is easier to interpret on purely acoustic basis since it does not contain an articulatory overlap, as opposed to syllables such as /pl/, /dr/ etc.

2

Experiment 1

2.1

Method

At this stage we used a 30-h segmented speech corpus, CORPRES [14]. The corpus contains ﬁctional texts recorded from 8 speakers, all native Russians with standard pronunciation. The recordings are manually segmented into sounds and annotated prosodically. In order to obtain a general impression of the processes governing consonant duration, we have chosen all 3-syllable words2 which were produced in 1 2

The case of word-ﬁnal vowels is more complicated–for a detailed discussion see [11]. Three syllables is the minimal word duration which makes it possible to compare pretonic syllables in word-initial and word-medial position.

266

T. Kachkovskaia and M. Nurislamova

prosodically neutral context–i.e. not under nuclear stress and not in the initial or ﬁnal position within the intonational phrase. This way we eliminated the inﬂuence of other processes: lengthening caused by prosodic prominence and IP-boundary eﬀects, which are beyond the present study. Absolute duration values are hard to compare across speakers and across diﬀerent consonant types. This is why we calculated consonant duration in normalized form using the following formula [15]: ˜ = d(i) − μp d(i) σp ˜ is the normalized duration of segment i, d(i) is its absolute duration, where d(i) and μp and σp are the mean and standard deviation of the duration of the corresponding phone p. The mean and standard deviation values were calculated over the whole corpus for each speaker separately. 2.2

Results and Discussion

Figure 1 shows normalized duration values for consonants in CV syllables in four possible conditions: 1. Stressed syllable (a) in word-initial position (e.g. /p/ from /pa/ in (b) in word-medial position (e.g. /d/ from /da/ in 2. First pretonic syllable (a) in word-initial position (e.g. /p/ from /pa/ in (b) in word-medial position (e.g. /k/ from /ka/ in

(apiary)) (task)) (bucket hat)) (imperative show))

Fig. 1. Normalized consonant duration in CV syllables in 3-syllable words; data are averaged across the whole corpus.

Word-Initial Consonant Lengthening in Russian

267

−2

normalized duration −1 0 1 2

First consonant in CCV

initial stressed

medial stressed

initial pretonic

medial pretonic

Fig. 2. Normalized duration of the ﬁrst consonant in CCV and CCV syllables in 3-syllable words in four contexts; data are averaged across the whole corpus.

For stressed syllables (see left pane in Fig. 1) the average normalized duration values were 0.544 for word-initial syllables and 0.024 for word-medial syllables. The diﬀerence was statistically signiﬁcant (Welch’s t-test, p < 0.001), and the sample sizes were 1156 and 2568, respectively. For pretonic syllables (see right pane in Fig. 1) the average normalized duration values were −0.189 for word-initial syllables and −0.326 for word-medial syllables. The diﬀerence was statistically signiﬁcant as well (Welch’s t-test, pD>N C>D>N C>N>D C>D>N C>D>N

Note: C – comfort, N-neural, D – discomfort.

Table 2. Confusion matrices for emotion recognition by listeners with ASD and MR informants breakdown. Listeners

Native

Foreign

Emotional state

Comfort Neutral Discomfort Comfort Neutral Discomfort

Informants ASD Comf Neut 20 39 37 37 7 40 23 13 35 26 12 18

Disc 41 26 53 64 39 70

MR Comf 37,5 20 8 43 20 15

Neut 38,5 56 41 31 56 45

Disc 24 24 51 26 24 40

The second task for listeners was to determine the six basic emotional states “fear – anger – sadness – natural – joy – surprise” and diﬃcult to answer, when listening to the speech test sequence (Fig. 6). The listeners attributed the greatest number of speech samples of ASD informants to the state sadness, while the speech of the informants with MR to the neutral state. They identiﬁed the fear state and surprise equally in speech of ASD and MR informants. Neutral state of MR informants recognized by listeners better vs. other states F(2.39) = 3.3487 p < 0.0455 (Wilks’ Lambda - 0.85344). In the task of determining the age of the informants, according to speech samples the listeners correctly estimated 42% of ASD and 47% of MR informants who are in the age range of 17–28 years. Further, some listeners incorrectly estimated informants’ age as lower than 16 years (for ASD – 6%, for MR – 18%), or higher than 35 years.

364

E. Lyakso et al.

Fig. 6. Percentage of listener’s answer to attributed speech samples of ASD (black color) and MR (gray color) informants to emotional six states “fear – anger – sadness – neutral – joy – surprise” and diﬃcult to answer.

The most frequent words in the speech of all informants were words that reﬂected a positive emotional state: /love – 0.43/ in the speech of ASD informants, 0.21 in the speech of MR informants; /like – 0.8/ in the speech of MR informants. Words reﬂecting the negative emotional state were absent in the situation of dialogue in informants with ASD, in MR informants were represented by the words: /bad – 0.08/, /unfortunately – 0.08/, /unpleasantly – 0.04/, /fearfully – 0.04/. The most frequent words reﬂected emotional state in the picture description situation were: /kind – 0.27/, /kind-hearted – 0.27/ in informants with ASD and /ﬁght – 0.22/, /regret - 0.17/, /angry - 0.08/, /love 0.08/ in MR informants speech.

4

Discussion

The results of the study showed a worse level of speech formation in adults with ASD, in comparison with MR ones. In studying the speech features of ASD children we showed their ability to clearly pronounce vowels in words [11] and increase the clarity of articulation in learning with the age of children [5]. The listeners recognized the emotional state of informants with ASD worse than emotional state of children on the base of listening speech samples [14]. It is found that individuals with severe or profound intellectual disabilities may exhibit more subtle facial expressions of internal states, which are poorly interpreted by adults if they do not have experience of caring for, or communicating with this population [15]. The adults with intellectual disabilities may be vulnerable to deﬁciencies in the awareness and understanding of their emotional experience, problems with adequate relaying this information to others [16]. Studies aimed at identifying the relationship between the state of persons at an early age and in transition to adulthood began to be investigated [17, 18]. In special study about young people with intellectual disability transitioning to adulthood it was found that people with Down syndrome experience less behavioral problems than people with intellectual disability of another cause across all subscales of emotional and behavioral problems,

Speech Features of Adults with Autism Spectrum Disorders and Mental Retardation

365

except for communication disturbance [19]. The transition to adulthood is of greatest concern to the parents of children with autism and the least concern for parents of chil‐ dren with Down syndrome [17]. Our study is the ﬁrst step towards investigating the problem of the transition from childhood to adulthood for Russian people with atypical development. The ﬁndings from this study provide valuable information for health and other professionals working with people with intellectual disability.

5

Conclusions

We revealed the speciﬁcity of speech features in informants with ASD and MR. For ASD informants replicas in dialogues are simple, complex replicas are absent, the “yesno” answers are predominant, they do not use gestures as substitution or complemen‐ tation of verbal answers, their replicas are less adequate vs. MR informants. More phonetic disturbances at the level of the word and the phrases were described for ASD informants vs. MR ones. Articulation of unstressed vowels of ASD informants is clearer than articulation of the stressed vowels that cause diﬃculties in the speech samples meaning recognition. Attribution of the emotional speech to states “comfort – neutral – discomfort” of the ASD informants is diﬃcult for listeners. Acknowledgements. This study is ﬁnancially supported by the Russian Science Foundation (project 18-18-00063).

References 1. Klein Tasman, B.P., van der Fluit, F., Mervis, C.B.: Autism spectrum symptomatology in children with Williams syndrome who have phrase speech or ﬂuent language. J. Autism Dev. Disord. (2018). https://doi.org/10.1007/s10803-018-3555-4 2. Chen, X., et al.: Speech and language delay in a patient with WDR4 mutations. Eur. J. Med. Genet. (2018). https://doi.org/10.1016/j.ejmg.2018.03.007 3. Walton, K.M., Ingersoll, B.R.: The inﬂuence of maternal language responsiveness on the expressive speech production of children with autism spectrum disorders: a microanalysis of mother-child play interactions. Autism 19(4), 421–432 (2005) 4. Fusaroli, R., Lambrechts, A., Bang, D., Bowler, D.M., Gaigg, S.B.: Is voice a marker for autism spectrum disorder? A systematic review and meta-analysis. Autism Res. 10(3), 384– 407 (2017). https://doi.org/10.1002/aur.1678 5. Lyakso, E., Frolova, O., Grigorev, A.: Perception and acoustic features of speech of children with autism spectrum disorders. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 602–612. Springer, Cham (2017). https://doi.org/ 10.1007/978-3-319-66429-3_60 6. Saad, A.G., Goldfeld, M.: Echolalia in the language development of autistic individuals: a bibliographical review. Pro. Fono. 21(3), 255–260 (2009) 7. Lee, L., Rianto, J., Raykar, V., Creasey, H., Waite, L., Berry, A., Xu, J., Chenoweth, B., Kavanagh, S., Naganathan, V.: Health and functional status of adults with intellectual disability referred to the specialist health care setting: a ﬁve-year experience. Int. J. Fam. Med., Article ID 312492, 9 (2011). https://doi.org/10.1155/2011/312492

366

E. Lyakso et al.

8. Taylor, J.L., Mailick, M.R.: A longitudinal examination of 10-year change in vocational and educational activities for adults with autism spectrum disorders. Dev. Psychol. 50(3), 699– 708 (2014). https://doi.org/10.1037/a0034297. pmid:24001150 9. Autism Spectrum Australia. We Belong: Investigating the experiences, aspirations and needs of adults with Asperger’s disorder and high functioing autism (2012) 10. Lyakso, E., et al.: EmoChildRu: emotional child Russian speech corpus. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 144–152. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_18 11. Lyakso, E., Frolova, O., Grigorev, A.: A comparison of acoustic features of speech of typically developing children and children with autism spectrum disorders. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 43–50. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_4 12. Lyakso, E.E., Grigor’ev, A.S.: Dynamics of the duration and frequency characteristics of vowels during the ﬁrst seven years of life in children. Neurosci. Behav. Physiol. 45(5), 558– 567 (2015). https://doi.org/10.1007/s11055-015-0110-z 13. Roy, N., Nissen, S.L., Dromey, C., Sapir, S.: Articulatory changes in muscle tension dysphonia: evidence of vowel space expansion following manual circumlaryngeal therapy. J. Commun. Disord. 42(2), 124–135 (2009). https://doi.org/10.1016/j.jcomdis.2008.10.001 14. Kaya, H., Salah, A.A., Karpov, A., Frolova, O., Grigorev, A., Lyakso, E.: Emotion, age, and gender classiﬁcation in children’s speech by humans and machines. Comput. Speech Lang. 46, 268–283 (2017). https://doi.org/10.1016/j.csl.2017.06.002 15. Adams, D., Oliver, Ch.: The expression and assessment of emotions and internal states in individuals with severe or profound intellectual disabilities. Clin. Psychol. Rev. 31, 293–306 (2011). https://doi.org/10.1016/j.cpr.2011.01.003 16. McClure, K.S., Halpern, J., Wolper, P.A., Donahue, J.J.: Emotion regulation and intellectual disability. J. Dev. Disabil. 15(2), 38–44 (2009) 17. Blacher, J., Kraemer, B.R., Howell, E.J.: Family expectations and transition experiences for young adults with severe disabilities: does syndrome matter? Adv. Mental Health Learn. Disabil. 4(1), 3–16 (2010). https://doi.org/10.5042/amhld.2010.0052 18. Thompson, C., Bölte, S., Falkmer, T., Girdler, S.: To be understood: transitioning to adult life for people with autism spectrum disorder. PLoS ONE 13(3), e0194758 (2018). https:// doi.org/10.1371/journal.pone.0194758 19. Foley, K.-R., Taﬀe, J., Bourke, J., Einfeld, S.L., Tonge, B.J., Trollor, J., Leonard, H.: Young people with intellectual disability transitioning to adulthood: do behaviour trajectories diﬀer in those with and without down syndrome? PLoS ONE 11(7), e0157667 (2016). https:// doi.org/10.1371/journal.pone.0157667

Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise Thomas Manzini(B) and Alan Black(B) Carnegie Mellon University, Pittsburgh, PA 15213, USA {tmanzini,awb}@cs.cmu.edu

Abstract. This paper explores how diﬀerent synthetic speech systems can be understood in a noisy environment that resembles radio noise. This work is motivated by a need for intelligible speech in noisy environments such as emergency response and disaster notiﬁcation. We discuss prior work done on listening tasks as well as speech in noise. We analyze three diﬀerent speech synthesizers in three diﬀerent noise settings. We measure quantitatively the intelligibility of each synthesizer in each noise setting based on human performance on a listening task. Finally, treating the synthesizer and its generated audio as a black box, we present how word level and sentence level input choices can lead to increased or decreased listener error rates for synthesized speech. Keywords: Speech

1

· Synthesized · Noise · Radio · Intelligibility

Introduction

Synthetic speech systems have undergone a great deal of research in the past years. Other research eﬀorts have attempted to predict the intelligibility of different synthesizers in diﬀerent settings [16,17]. However, to the author’s best knowledge, all work in this area has been done from the perspective of improving synthesized audio [9,11], rather than the synthesizer inputs themselves. This paper aims to determine if intelligibility can be predicted from the content fed to the synthesizer. In this work we explore how to predict if certain words and sentences will be understood by users and how these predictions can be used to formulate or reformulate a sentence for speech in a noisy environment. This is done by treating the synthesizer as a black box and measuring only the inputs and the outputs. Our work is speciﬁcally motivated by automated disaster response. Much work has been done using artiﬁcial intelligence to handle emergency and disaster situations [7,8]. The integration of speech is a necessary and natural expansion of this research. We foresee speech systems needing to operate in noisy environments where synthetic speech may be broadcast over a radio frequency or near rescue equipment. Both present multiple diﬀerent issues regarding types of noise. In this work, we use the noisy environment of a radio channel as a test bed for intelligibility. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 367–376, 2018. https://doi.org/10.1007/978-3-319-99579-3_39

368

2

T. Manzini and A. Black

Related Work

Several works have explored multiple diﬀerent types of noise and multiple diﬀerent types of speech [16,17]. There are two relevant concepts within the ﬁeld that are at play in this work: the ﬁeld of speech in noise and the work surrounding listening tests. 2.1

Speech in Noise

The intelligibility of speech in noise is the measure of how well audio - containing either natural or synthetic speech - can be understood in a noisy environment. A noisy environment can range from the chatter of a restaurant [16] to the sounds of helicopters and the battleﬁeld [18]. In all of these environments a listener may confuse or misinterpret speech because of noise. In past works, authors have shown several key concepts. First, when measuring the kinds of errors list listeners make, [12] has shown that while keyword error rate (KER) may be a more accurate measure, simple word error rate (WER) follows KER closely and is less time intensive to calculate. As such, we use WER for our measurements. At the same time, [19] has shown that there are instances where there are disparities between WER and other metrics, such as concept error rate. We see in our data that WER tends to follow concept error rate. 2.2

Listening Tests

Listening tests are a common way to evaluate the intelligibility of a voice [10,13,15]. Compared to automated methods and metrics, human evaluation is traditionally regarded as the most eﬀective method for evaluation. As a result Listening tests have been used to evaluate synthesizers and intelligibility of both synthetic and natural speech in noise.

3

Experimentation

We explore the eﬀect of three diﬀerent types of noise on three diﬀerent synthesizers. This is in an attempt to understand the how humans understand diﬀerent synthesizers generally, as opposed to possibly overﬁtting to one synthesizer or one noise setting. 3.1

Structure

We asked English speaking listeners who over eighteen years old to transcribe audio from a series of thirty diﬀerent audio ﬁles. These audio ﬁles were generated by selecting random sentences from the Smart-Home dataset [14] and having one of three diﬀerent synthesizers generate an associated audio ﬁle then one of three diﬀerent noise levels was applied to the audio. The result was captured and stored for the listening task. This was done thirty times for each listening task,

Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise

369

resulting in thirty unique audio ﬁles for each listener per task. While our research is motivated by emergency/disaster response use case, we selected this dataset because it has a demonstrably diverse vocabulary that would, in theory, lend itself well to determining the intelligibility of various words. During the initial stages of this research we did explore other datasets, including one of radio traﬃc, between emergency medical services (EMS) Personnel and their dispatch center, but found that it was not lexically diverse enough for the purposes of this research. Two diﬀerent types of listening tasks were performed: one to generate training data and another for testing data. In the training task forty-ﬁve listeners listened to 450 diﬀerent audio ﬁles. Each ﬁle was labeled by three diﬀerent listeners. In the testing task a ﬁfty listened to 150 diﬀerent audio ﬁles. Each ﬁle was labeled by ten diﬀerent listeners. These two diﬀerent tasks were done so that the test data would be the most representative of the behavior of users in a noisy environment. 3.2

Synthesizers

We used three diﬀerent synthesizers for our experimentation: the E-Speak Synthesizer [5], the Flite Synthesizer [1], and the Google Synthesizer [6]. All synthesizers used an English-speaking male voice, but these three synthesizers each have their own speciﬁc settings. E-Speak. We used the E-Speak Synthesizer with primarily the default settings. We speciﬁed two unique settings when generating our sound ﬁles: the use of the en voice which corresponds to an English speaking male and the use of a voice speed of level 120 (down from 175). This was done to better align the speeds of the voices of the diﬀerent synthesizers. Flite. We used the CMU Flite (Festival Lite) synthesizer with the default settings. We speciﬁed that the synthesizer must use the cmu us eey.f litevox voice that came prepackaged with the standard release of Flite. Google. We used the Google Text to Speech system deﬁned within the Python gTTS module. We speciﬁed that the synthesizer must use the en − us voice that came with the release of gTTS. All other settings were left at default values. 3.3

Noise

We used three diﬀerent noise levels each consisting of three diﬀerent ﬁlters applied at diﬀerent values. First we impose an ambient noise ﬁlter designed to replicate radio static. For this ﬁlter, we take the original sound and at each time step sample a value from a random normal distribution centered at the original sound. The standard deviation of this normal distribution was varied at each noise level. Next we perform a low pass ﬁlter with a variable threshold.

370

T. Manzini and A. Black

Finally we perform and high pass ﬁlter with a variable threshold. Varying the parameters to these three diﬀerent ﬁlters provide several knobs we can turn to increase or decrease the noise within the audio ﬁles. We make no claim about how well these diﬀerent ﬁlters replicate the noise present on a radio channel, as that can vary based on the radio manufacturer, the frequency used, and the type of system in use. We only state that this noise is subjectively similar to that of an active radio channel. Further work would be required to determine the best noise ﬁlters needed to replicate each speciﬁc radio channel. We chose three diﬀerent noise settings that would be presented to users. These noise settings were not intended to be ranked by diﬃculty, but were intended to represent three distinct kinds of noise that could cause a listener to make transcription errors. We believe that the reasons why certain noise settings are more likely to cause listening errors are out of scope of this work and could be the subject for further research. Noise Level 1. Random noise ﬁlter standard deviation: 0.3; Low pass frequency cutoﬀ: 300.0 Hz; High pass frequency cutoﬀ: 2500.0. Noise Level 2. Random noise ﬁlter standard deviation: 0.4; Low pass frequency cutoﬀ: 400.0 Hz; High pass frequency cutoﬀ: 2000.0. Noise Level 3. Random noise ﬁlter standard deviation: 0.5; Low pass frequency cutoﬀ: 500.0 Hz; High pass frequency cutoﬀ: 1500.0.

4

Listening Test Results

We presented users with diﬀerent audio ﬁles and recorded their precision/word error rate. We include the complete breakdown of user performance below in Table 1. For all experiments we segment the data based on both synthesizer and noise level. We collected approximately ﬁfty diﬀerent sentences at each diﬀerent noise level and synthesizer combination for training and approximately sixteen diﬀerent sentences at each noise level for testing. Table 1. Precision (1.0 - WER) of word level transcription per noise and synthesizer on the training data. Transcription precision score Noise level 1 Noise level 2 Noise level 3 Average Espeak Flite Google

0.227 0.346 0.542

0.196 0.375 0.639

0.242 0.343 0.559

Average

0.372

0.403

0.381

0.222 0.355 0.580

Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise

4.1

371

Listening Test Results Discussion

Table 1 demonstrates the word error rates of the diﬀerent synthesizers and noise levels. We see a clear trend among the diﬀerent synthesizers that the highest performing synthesizer was the Google synthesizer, followed by Flite, and ﬁnally by ESpeak. These results indicate that the quality of the synthesizer plays a role in its intelligibility in noise. Empirically, it appears that Noise Setting 2 is the most intelligible noise setting, followed by Setting 3, and then Setting 1. It should be noted that results for settings 1 and 3 are very similar and diﬀer somewhat. At the same time there does appear to be some variation between the diﬀerent synthesizers. While Noise Setting 2 is the least intelligible for Espeak, it is the most intelligible for Google. From this observation, we can conclude that the intelligibility of a given synthesizer in a given noise setting depends primarily on the synthesizer and not on the noise setting. We make no claim regarding why certain noise settings are more intelligible than others, we believe this to be an avenue of further research.

5

Predictive Results

We make intelligibility predictions at the sentence level and the word level. At the sentence level, a model could estimate which paraphrasings are most likely to be understood by listeners. At the word level a model could rank synonyms of speciﬁc words so that they are more likely to be understood. At both levels of granularity we explore the application of point-wise and pair-wise ranking for estimating intelligibility. While list-wise reranking is an obvious extension to this work, we do not have enough sentence level, or word level data to make listwise reranking models feasible. We present the results of this predictive exercise below. Work in this ﬁeld often uses metrics such as the DAU metric [3] or the Glimpse proportion measure [2] to attempt to model intelligibility. These metrics are based oﬀ the audio features of your synthesizer. Since we attempt to predict intelligibility based on non-audio features these metrics are out of the scope of this work. 5.1

Sentence Level Intelligibility Prediction

At the sentence level, we try to determine if one sentence is more intelligible than another. We explore this in two ways: ﬁrst, we trained a machine learning model to estimate the average word error rate of a given sentence, and second, we trained a pair-wise reranking model to attempt to determine if one particular sentence is more intelligible than another. Sentence Level Word Error Rate Estimation. At the sentence level we attempt to train a machine learning model to predict the average word error rate of a given sentence. In order to do this, we construct a feature vector that

372

T. Manzini and A. Black

contains a number of sentence level features. We trained a simple linear model with sigmoid activation. We found that we achieved the best results when using simple models. We used several diﬀerent features to estimate the word error rate of a sentence but few were eﬀective given the low amount of data. Our features included average word rank, average word length, sentence length, word count, and percent of unique characters. We deﬁne word rank as the ranked position of a term, based on how frequently that term appears in the Corpus of Contemporary American English [4]. We deﬁne word length as the length of a particular word in characters. We deﬁne average word rank and average word length as the average of these respective values. Other features were explored but eventually discarded. The results of this model on the test set are presented in Table 2. Table 2. Performance of our linear error estimator for sentence level error estimation. Point wise reranking - sentences Synthesizer Noise level MSE (Test) Spearman’s R (Test) Espeak

1

0.0227

0.2258

Espeak

2

0.0256

−0.3728

Espeak

3

0.0311

−0.4650

Flite

1

0.0477

−0.1225

Flite

2

0.0451

0.4621

Flite

3

0.0587

−0.1863

Google

1

0.0505

0.1176

Google

2

0.0915

−0.2943

Google

3

0.0701

0.0662

Pair-Wise Sentence Reranking. We constructed a linear model with tanh activation to estimate which sentence is the most intelligible. We do this by feeding one feature representation of each sentence into the linear model. The model then estimate if the ﬁrst sentence is more intelligible (labeled +1), the second sentence is more intelligible (labeled −1), or if the intelligibility of the sentences are equal (labeled 0). We then train this linear model and evaluate it on the test set. The results of this evaluation are presented in Table 3. 5.2

Word Level Intelligibility Prediction

At the sentence level we are attempting to estimate which words would be most intelligible, either on their own, or when compared to another word. For use in a real world setting the models discussed here could be used to estimate the intelligibility of synonyms of diﬀerent words in a sentence so as to maximize the

Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise

373

Table 3. Performance of our linear pair-wise sentence level reranking model. Pair wise reranking - sentence Synthesizer Noise level MSE (Test) MSE variance (Test) Espeak

1

0.3441

0.1211

Espeak

2

0.5960

0.1217

Espeak

3

0.4641

0.2591

Flite

1

0.7074

0.1496

Flite

2

0.3979

0.351

Flite

3

0.6095

0.3688

Google

1

0.6679

0.1638

Google

2

0.4076

0.1440

Google

3

0.5080

0.1966

intelligibility of a sentence overall. As a result, estimating which words are going to be the most intelligible is an obvious initial step to estimating the overall intelligibility of a sentence, phrase, or other unit of speech. Word Error Rate Estimation. Working at the word level we have access to signiﬁcantly more data. Here we trained a machine learning model to attempt to estimate the WER of a particular word in a given sentence. This is diﬀerent from the sentence level task of the same name because we have features for the word, but also features for the context of the word (eg. the surrounding words). We construct a linear model with sigmoid activation to attempt estimate the error. We used several diﬀerent word level features regarding the words themselves, and their surrounding contexts. Our word level features included: word rank, percent of vowels in the word, percent of consonants in the word, length of the word, and the percent of unique characters in the word. We deﬁne word rank in the same manner described in Sect. 5.1. Our context level features included: the same word level features for both the previous and next word, the length of total number of words in the sentence, and the number of total unique words in the sentence. The results of this evaluation can be found in Table 4. Pair-Wise Word Reranking. To perform pairwise reranking, we changed the layout of our model slightly. We now pass two times the number of features to our model, one for the ﬁrst word and one for the second word. The word level features that are fed to the model are similar as in the above section, but they have had the features regarding sentence context removed, and contain only features regarding the neighboring words. Like the sentence pair wise reranking schema the model then has to estimate if the ﬁrst word is more intelligible (labeled +1), the second sentence is more intelligible (labeled −1), or if the intelligibility of the sentences are equal (labeled 0). We trained this linear model and evaluated it on the test set. The results are presented in Table 5.

374

T. Manzini and A. Black

Table 4. Performance of our linear error estimator for word level error estimation. Point wise reranking - words Synthesizer Noise level MSE (Test) Spearman’s R (Test) Espeak

1

0.0429

−0.2516

Espeak

2

0.0426

0.0676

Espeak

3

0.0318

−0.1120

Flite

1

0.0932

−0.1612

Flite

2

0.1032

0.1346

Flite

3

0.0468

−0.1589

Google

1

0.0902

0.2173

Google

2

0.1182

0.3114

Google

3

0.0801

0.0178

Table 5. Performance of our linear pair-wise word level reranking model. Pair wise reranking - words Synthesizer Noise level MSE (Test) MSE variance (Test)

6

Espeak

1

0.1946

0.1151

Espeak

2

0.1546

0.0901

Espeak

3

0.1610

0.0959

Flite

1

0.1896

0.0972

Flite

2

0.2022

0.1167

Flite

3

0.2009

0.1052

Google

1

0.1783

0.0941

Google

2

0.1794

0.1031

Google

3

0.1840

0.1076

Results Discussion

From our data, we can see that the sentence level error estimation and pair-wise reranking methods are ineﬀective at the current data scale. For the Spearman’s correlation we can see that there is no consistent behavior between the diﬀerent error models. In the case of the pair-wise reranking for sentences we still see poor performance. Not only is the MSE fairly high for a problem like this, the variance of the MSE is much larger than would be anticipated. We believe that these are problems that could be solved with additional data, but at the moment our models are not capable of performing this task at the sentence level. For the word level point-wise reranking we can estimate intelligibility for the Google synthesizer to some extent. This is indicated by the Spearman’s correlation which is either positive or near zero. However, this is not the case for other synthesizers. The pair-wise word ranking is more stable than the sentence

Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise

375

level reranking. For all synthesizers and for all noise levels we see model behavior indicative of estimating the correct word in a reranking context. At the same time, the variances of the MSE are within a reasonable bound and upon closer inspection we do not see a many outliers that could skew these results. Based on our results and given the data that we have presented here, we ﬁnd that we are positively able to rerank individual terms based on lexical features to estimate their intelligibility. Our results demonstrate that this methodology works best in the pairwise reranking context for this particular data scale. We believe that additional labeled data will improve performance.

7

Future Work

The most signiﬁcant piece of future work is more data with intelligibility labels for diﬀerent noise settings and synthesizers. Additional evaluations on diﬀerent features and diﬀerent models for predicting intelligibility on the lexical level would be useful.

8

Conclusion

This work has explored the intelligibility of three diﬀerent synthesizers in three diﬀerent noise settings. We have evaluated these synthesizers in these noise settings on a human listening task and we have measured performance along metrics that reﬂect intelligibility. Further we have explored methods that have shown some predictive power regarding how predictable intelligibility is on a lexical level. We show that even with limited data you are able to rerank words and estimate which word will be more intelligible in a given context. Acknowledgments. We would like to acknowledge several people for their help and support on this work. Particularly Carolyn Penstein, Rajat Kulshreshtha, Abhilasha Ravichander, and the oﬃcers of CMU EMS. As well as the several people who helped edit this work, especially Elise Romberger. Finally, thank you to reviewers reading and examining our experiments, methodology, and submission.

References 1. Black, A.W., Lenzo, K.A.: Flite: a small fast run-time synthesis engine. In: 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis (2001) 2. Cooke, M.: A glimpsing model of speech perception in noise. J. Acoust. Soc. Am. 119(3), 1562–1573 (2006) 3. Dau, T., P¨ uschel, D., Kohlrausch, A.: A quantitative model of the “eﬀective” signal processing in the auditory system. i. model structure. J. Acoust. Soc. Am. 99(6), 3615–3622 (1996) 4. Davies, M.: The corpus of contemporary American English (Coca): 450 million words, 1990–2012. Brigham Young University (2002) 5. Duddington, J.: eSpeak text to speech (2012)

376

T. Manzini and A. Black

6. Durette, P.N.: gTTS: a python interface for google’s text to speech api (2017). https://github.com/pndurette/gTTS. Accessed 15 Apr 2018 7. Fiedrich, F., Burghardt, P.: Agent-based systems for disaster management. Commun. ACM 50(3), 41–42 (2007) 8. Imran, M., Castillo, C., Lucas, J., Meier, P., Vieweg, S.: AIDR: Artiﬁcial intelligence for disaster response. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 159–162. ACM (2014) 9. Kamath, S., Loizou, P.: A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In: ICASSP, vol. 4, pp. 44164–44164. Citeseer (2002) 10. Killion, M.C., Niquette, P.A., Gudmundsen, G.I., Revit, L.J., Banerjee, S.: Development of a quick speech-in-noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners. J. Acoust. Soc. Am. 116(4), 2395– 2405 (2004) 11. McAulay, R., Malpass, M.: Speech enhancement using a soft-decision noise suppression ﬁlter. IEEE Trans. Acoust. Speech Signal Process. 28(2), 137–145 (1980) 12. Park, Y., Patwardhan, S., Visweswariah, K., Gates, S.C.: An empirical analysis of word error rate and keyword error rate. In: Ninth Annual Conference of the International Speech Communication Association (2008) 13. Pichora-Fuller, M.K., Schneider, B.A., Daneman, M.: How young and old adults listen to and remember speech in noise. J. Acoust. Soc. Am. 97(1), 593–608 (1995) 14. Ravichander, A., Manzini, T., Grabmair, M., Neubig, G., Francis, J., Nyberg, E.: How would you say it? eliciting lexically diverse dialogue for supervised semantic parsing. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 374–383 (2017) 15. Schmidt-Nielsen, A.: Intelligibility and acceptability testing for speech technology. Technical report, Naval Research Lab, Washington DC (1992) 16. Valentini-Botinhao, C., Yamagishi, J., King, S.: Can objective measures predict the intelligibility of modiﬁed hmm-based synthetic speech in noise? In: Twelfth Annual Conference of the International Speech Communication Association (2011) 17. Valentini-Botinhao, C., Yamagishi, J., King, S.: Evaluation of objective measures for intelligibility prediction of hmm-based synthetic speech in noise. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5112–5115. IEEE (2011) 18. Varga, A., Steeneken, H.J.: Assessment for automatic speech recognition: Ii. noisex92: a database and an experiment to study the eﬀect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993) 19. Wang, Y.Y., Acero, A., Chelba, C.: Is word error rate a good indicator for spoken language understanding accuracy. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2003, pp. 577–582. IEEE (2003)

End-to-End Speech Recognition in Russian Nikita Markovnikov(B) , Irina Kipyatkova, and Elena Lyakso St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), Saint-Petersburg, Russia [email protected], [email protected], [email protected]

Abstract. End-to-end speech recognition systems incorporating deep neural networks (DNNs) have achieved good results. We propose applying CTC (Connectionist Temporal Classiﬁcation) models and attentionbased encoder-decoder in automatic recognition of the Russian continuous speech. We used diﬀerent neural network models such Long shortterm memory (LSTM), bidirectional LSTM and Residual Networks to provide experiments. We got recognition accuracy a bit worse than hybrid models but our models can work without large language model and they showed better performance in terms of average decoding speed that can be helpful in real systems. Experiments are performed with extra-large vocabulary (more than 150K words) of Russian speech. Keywords: End-to-end models Speech recognition

1

· Deep learning · Russian speech ·

Introduction

Automatic speech recognition (ASR) systems are traditionally built using acoustic model (AM) by applying hidden Markov models (HMM) with the Gaussian mixture model (GMM) and language model (LM). These models show good recognition accuracy but they consist of multiple parts that are tuned independently. So, errors in one part can involve errors in the other. Also, scenarios of the standard recognition need a large amount of memory and capacity that does not allows to use such systems locally at some devices and needs remote computation at servers. There is an end-to-end approach that has recently been adopted with using deep neural networks (DNN). This approach allows to implement models easily using only one neural network that is tuned with gradient descent only and one loss function. End-to-end models often demonstrate better performance in terms of a speed and an accuracy. Potentially these models require less amount of memory that allows to use them at mobile devices locally. But, they need more training data to be learned properly. Our goal was to build end-to-end models for recognition of continuous Russian speech, to tune and compare them with hybrid baseline models in terms of recognition accuracy and computing characteristics as training and decoding speed. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 377–386, 2018. https://doi.org/10.1007/978-3-319-99579-3_40

378

N. Markovnikov et al.

The performance of the models was evaluated in a term of word error rate (WER) and character error rate (CER). The rest of the paper is organized as follows. In Sect. 2, we survey related works. In Sect. 3, we describe architectures of CTC-models and attention-based encoder-decoder models. In Sect. 4, we discuss our experimental setup and datasets. In Sect. 5, we describe our implementation details of the models. In Sect. 6 we present results that we got using trained models. In Sect. 7 we provide short analysis of the results. Finally, we conclude and discuss future work in Sect. 8.

2

Related Work

Papers [21,23] present an end-to-end system created with help of Eesen toolkit [21] where decoding of CTC-models had been performed using weighted ﬁnite state transformators (WFST) [22]. Eesen implementation provides eﬀective search using OpenFST library [1]. In paper [23] Eesen bidirectional LSTM [12] (BLSTM) neural networks were used for speech recognition for Serbian that is located in the same language group with Russian. So, they achieved WER equal to 14.68% that is not the best one comparing with the hybrid systems. But, CER was 3.68%. In paper [26] an end-to-end model using CTC was described. It was shown that such system able to work without LM well. Training dataset [19] was made up of audio tracks of Youtube videos with duration more than 650 h. So, the best WER without using LM was 13.9% but with LM it was equal to 13.4%. In paper [3] an attention-based model using LM was proposed. WFST was used to build integration of end-to-end model with the language model. At a decoding step, a launched output search was performed that minimized encoderdecoder model and LM. So, they got WER equal to 11.3% and CER equal to 4.8%. Independently similar attention-based end-to-end model called Listen, Attend and Spell (LAS) was proposed in [4]. Encoder was pyramidal-shaped BLSTM and decoder used stack of LSTMs. Also, model was recomputed using LM after decoding step. So, on a Google Voice Search WER was equal to 10.3%. So, as we can see, end-to-end models can work well with or without LM for languages with strict word order (e.g. English). The Russian language is characterized by a high degree of a grammatical freedom and a complex mechanism of the word formation. So, we need to use external LM models to increase accuracy. Anyway, there is no end-to-end models’ usage for Russian, so we decided to develop it.

3 3.1

Model Architecture CTC

CTC [9] is a function that allows recurrent neural networks to be learned without initial alignment of input and output sequences. Output layer contains one unit

End-to-End Speech Recognition in Russian

379

per each output symbol and one special blank symbol. Output vector wm is normalized using softmax-function that is interpreted as a probability of appearing of k-th symbol at m time as follows k ) exp(wm P (k, m|x) = k k exp(wm ) k where x denotes input feature sequence of length T and wm denotes k-th component of wm . Let α denotes a sequence of indices of blanks and symbols with T length according to x. Then, P (α|x) can be written as

P (α|x) =

T

P (αt , t|x).

t

Let B is an operator that removes repetitions of symbols and then blanks. So, the probability of an output sequence w is P (α|x). P (w|x) = α∈B−1 (w)

That sum can be computed using dynamic programming and neural network would be learned to minimized CTC-function: CT C(x) = − log P (w∗ |x), where w∗ denotes the target sequence. Also, in [9] a forward-backward algorithm using gradient of CTC-function was proposed. Decoding can be evaluated using proposal that arg maxP (w|x) ≈ B(α∗ ) w

∗

where α = arg maxα P (α|x). But in [10] decoding method using beam search algorithm that let to integrate language model was proposed. 3.2

Attention-Based Encoder-Decoder Model

Encoder-Decoder. Encoder-Decoder networks are used for problems where lengths of the input and the output sequences are variable [6,27]. Encoder is a neural network that transforms input x = (x1 , . . . , xL ) into the intermediate state h = (h1 , . . . , hL ) and extract features. Decoder is usually a recurrent neural network (RNN) that uses an intermediate state for generating output sequences. Encoder can be any neural network as multilayer perceptron (MLP), LSTM, BLSTM, convolutional network (CNN) [17], etc. RNN with Attention Mechanism. In paper [7] the using of an Attentionbased Recurrent Sequence Generator (ARSG) as a decoder was proposed. ARSG is a RNN that stochasticly generates output sequence (y1 , . . . , yT ) using input

380

N. Markovnikov et al.

h of length L = L . ARSG consists of RNN and a subnetwork called attentionmechanism. Attention mechanism chooses a subsequence of the input and then use it for updating hidden states of RNN and predicting the next output. On i-th step ARSG generates an output yi focusing on separate components of h: αi = Attend(si−1 , αi−1 , h) gi =

L

αi,j hj

j=1

yi ∼ Generate(si−1 , gi ) where si−1 is (i−1)-th state of RNN called Generator, αi ∈ RL denotes attention weights vector gi called glimpse. Step ﬁnishes with computing new generator state as following si = Recurrency(si−1 , gi , yi ).

4

Experimental Setup

4.1

Dataset

In this work, we use the training speech corpus collected at SPIIRAS as in [16]. The corpus consists of three parts: – recordings of 50 native Russian speakers. Each speaker pronounced a set of 327 phrases; – recordings of 55 native Russian speakers where each speaker pronounced 105 phrases; – an audio part of the audio-visual speech corpus HAVRUS [28]. 20 native Russian speakers (10 male and 10 female speakers) participated in the recordings. Each of them pronounced 200 Russian phrases. The total duration of the entire speech corpus is more than 30 h. To test the system, we used a speech database of 500 phrases pronounced by 5 speakers. The phrases were taken from the materials of Russian online newspaper “Fontanka.ru”1 that was not used in the training data. Our language model was learned using data from Russian news sites [15]. The dataset for the training of language model contains approximately 300 millions words. As a language model n-gram model with Kneser-Ney smoothing [5] with n = 2 and n = 3 was used. Vocabulary size was 150000 collocations. 4.2

Baseline

The baseline is hybrid DNN-HMM acoustic models implemented using Kaldi [24] and CNTK2 toolkit as in [20]. Bigram language model was used for a decoding. The best results shown by BLSTM, ResNet [11] and RCNN [18] are presented in Table 1. 1 2

http://www.fontanka.ru/. https://docs.microsoft.com/ru-ru/cognitive-toolkit/.

End-to-End Speech Recognition in Russian

381

Table 1. The best results of baseline models in terms of WER, average speed of a training (features per second) and a decoding (utterances per second).

5

Model

WER, % Decode Train

BLSTM

23.08

0.211

450.7

ResNet

22.17

0.105

121.4

RCNN

22.56

0.162

325.1

RCNN + residual unit + max-pooling + BLSTM 22.07

0.197

502.3

Implementation

Firstly, a simple speech recognition system using CTC-loss function was implemented using TensorFlow3 . Code and details can be found at a repository4 . This system does not allow using a language model but it uses less memory. Secondly, we used Eesen [21] toolkit at TensorFlow branch where Kaldi methods use TensorFlow neural networks implementation. That system allows to use language models in Kaldi format without additional converting. Thirdly, we used Tensor2Tensor5 framework to conduct experiments with attention-based models. That framework provides a common approach to build sequence-to-sequence model in particular speech recognition systems. All models were tuned using weak-conﬁgured machine. So, experiments were provided using NVIDIA GeForce GT 730M with 2 GB memory, CUDA library, available CPU memory was 16 GB with 4 cores.

6

Results

6.1

Results Corresponding to CTC-models

Firstly, we will describe results corresponding to CTC loss function-based models. Results corresponding CTC-models are presented in Table 2. In experiments we used two types of features (extracted from audio with 1 channel and 16000 MHz frequency): 1. 13-dimensional Mel Frequency Cepstral Coeﬃcient (MFCC) [8] features that were extracted using window length equal to 0.025 and window step equal to 0.01 together with 3 additional features representing pitch with their ﬁrst and second-order derivatives normalized via mean subtraction and variance normalization; 2. 40-dimensional ﬁlterbank [25] features with the same properties.

3 4 5

https://www.tensorﬂow.org/. https://github.com/mikita95/asr-e2e. https://github.com/tensorﬂow/tensor2tensor.

382

N. Markovnikov et al.

The whole training data were split into training (95%) and cross-validation (5%) sets. We used several neural network types as MLP, LSTM, Bidirectional LSTM, ResNet. The setting of the neural networks provided the best results without using any LM were as follows: – MLP had 4 hidden layers with 512 nodes using ReLU activation function with initial learning rate equal to 0.007 and decay factor equal to 1.5. – LSTM had 4 layers with 512 units in each with dropout equals 0.5 with initial learning rate equal to 0.001 and decay factor equal to 1.5. – ConvLSTM used convolutional layers before LSTM described above to simplify input features. It has one 2D convolutional layer with 8 ﬁlters, 2 × 2 kernel, no padding and ReLU activation function. Then, it has dropout with keep probability equal to 0.5. Then, LSTM had 4 layers with 128 units in each layer with dropout equal to 0.5. – BLSTM used 4 layers with 512 units and dropout with keep probability equal to 0.5. – ResNet had an architecture presented in Figure 1. It had 9 residual blocks with batch-normalization [13].

Fig. 1. ResNet.

The Momentum algorithm was used for the optimization with momentum equal to 0.9. OpenFST library was used to create WFST for decoding models in Eesen toolkit. Similar to [23] every system component as CTC labels (T ), lexicon (L) and language model (G) was transformed into one search graph as following: T LG = T ◦ min(det(L ◦ G)) where min denotes minimization, det is a determinization and ◦ denotes function composition. For 3-gram LM it was diﬃcult to provide composition because of a lack of memory and long computation time, so we conducted experiments not with every model.

End-to-End Speech Recognition in Russian

383

Table 2. CTC-models results. Model

CER, % WER % Decode Train

Models without using LM, MFCC features, implementation (1) MLP

55.42

71.64

0.252

96.7

LSTM

38.58

52.47

0.266

304.9

Conv+LSTM (+L2-weight delay) 36.92

49.23

0.278

315.2

BLSTM

36.73

48.86

0.282

391.7

ResNet

35.69

48.24

0.267

142.6

Models using 2-gram LM, MFCC features, implementation (2) MLP

48.49

62.04

0.174

125.3

LSTM

26.12

38.71

0.181

402.8

Conv+LSTM

25.77

36.93

0.193

391.1

BLSTM

22.98

35.21

0.102

407.2

ResNet

22.35

34.96

0.173

293.2

Models using 3-gram LM, FBANK features, implementation (2)

6.2

MLP

26.57

37.19

0.098

104.7

BLSTM

15.79

26.17

0.026

381.5

ResNet

14.96

25.53

0.083

201.9

Results Corresponding to Attention-Based Models

In this section we describe results corresponding to attention-based models that we used. To tune these models we used MFCC-type features that we extracted working with CTC models. We did not use any language model integration here. So, results corresponding CTC-models are presented in Table 3. We provided experiments with LSTM and BLSTM models. Our model used 4 layers of 128 units with initial decreasing dropout keep probability equal to 0.9 in the encoder. As a decoder we used attention-based LSTM as in the encoder. We used Bahdanau-style attention mechanism [2] with 128 hidden layers size. The output at each time step was the attention value, so the attention tensor was propagated to the next time step via the state to the top LSTM output. Batch size was equal to 36 with learning rate equal to 0.05. The Adam algorithm [14] was used for the optimization with β1 = 0.85, β2 = 0.997 and = 10−6 . We initialized the weights randomly from the uniform distribution from an interval [−1; 1] without scaling variance.

7

Discussion

As we expected we found out that CTC model without LM works mediocre for the Russian language. Model makes mistakes in constructing words and sentences from the recognized characters but obtained phonemic transcription is quite

384

N. Markovnikov et al. Table 3. Attention-based models’ results. Model

CER, % WER % Decode Train

LSTM

19.15

28.47

0.279

389.2

BLSTM 19.08

27.83

0.285

401.8

similar to the original. So, the best result is CER equal to 14.96% and WER equal to 25.53% with ResNet model and using of an external LM. As we have mentioned in the introduction Russian is characterized by a high degree of grammatical freedom and a complex mechanism of the word formation. So, LM is an important part of a recognizer of the Russian language. But, we found that using attention-based models for the Russian language without integrating with LM allows to achieve good results. So, BLSTM version of attention-based model showed CER equal to 19.08% and WER equal to 27.83%. But, we could not surpass our baseline that is using hybrid DNN-HMM models with LM. However, our models showed better performance in terms of decoding speed that can be helpful in real systems. So, attention-based model without using of LM showed decoding speed equal to 0.285 utterances per second that is better on 19% than the fastest baseline ResNet model.

8

Conclusion

In this work, we consider the task of Russian speech recognition using end-to-end models such as CTC model and attention-based encoder-decoder. We used various neural network architectures: multilayer perceptron, LSTMs and theirs modiﬁcations and residual convolutional networks. The best result was shown by residual convolutional networks. We achieved recognition accuracy a bit worse than baseline hybrid models. But we showed that end-to-end models can work well for the Russian speech without language model and they showed better performance in terms of average decoding speed. In the future we will provide experiments on using other types of features, integrating language model into the attention-based system and perform experiments with transfer learning technique. Acknowledgments. This research is supported by the Russian Science Foundation (project No. 18-11-00145).

References 1. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: a general and eﬃcient weighted ﬁnite-state transducer library. In: Implementation and Application of Automata, pp. 11–23 (2007) 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014). http://arxiv.org/abs/1409. 0473

End-to-End Speech Recognition in Russian

385

3. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016) 4. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016) 5. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–394 (1999). https://doi. org/10.1006/csla.1999.0128. http://www.sciencedirect.com/science/article/pii/ S0885230899901286 6. Cho, K., van Merrienboer, B., G¨ ul¸cehre, C ¸ ., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014). http://arxiv.org/abs/1406.1078 7. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015) 8. Ganchev, T., Fakotakis, N., Kokkinakis, G.: Comparative evaluation of various MFCC implementations on the speaker veriﬁcation task. Proc. SPECOM 1, 191– 194 (2005) 9. Graves, A., Fern´ andez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classiﬁcation: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning, pp. 369– 376. ACM (2006) 10. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1764–1772 (2014) 11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016) 12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 13. Ioﬀe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 (2015). http://arxiv.org/ abs/1502.03167 14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980 15. Kipyatkova, I., Karpov, A.: Lexicon size and language model order optimization ˇ for Russian LVCSR. In: Zelezn´ y, M., Habernal, I., Ronzhin, A. (eds.) Speech and Computer, pp. 219–226. Springer, Cham (2013) 16. Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using Kaldi. In: Ronzhin, A., Potapova, R., N´emeth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 246–253. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7 29 17. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. Handb. Brain Theory Neural Netw. 3361(10), 1995 (1995) 18. Liang, M., Hu, X.: Recurrent convolutional neural network for object recognition. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3367–3375, June 2015. https://doi.org/10.1109/CVPR.2015.7298958

386

N. Markovnikov et al.

19. Liao, H., McDermott, E., Senior, A.: Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 368–373. IEEE (2013) 20. Markovnikov, N., Kipyatkova, I., Karpov, A., Filchenkov, A.: Deep neural networks ˇ zka, J. (eds.) in Russian speech recognition. In: Filchenkov, A., Pivovarova, L., Ziˇ AINL 2017. CCIS, vol. 789, pp. 54–67. Springer, Cham (2018). https://doi.org/10. 1007/978-3-319-71746-3 5 21. Miao, Y., Gowayyed, M., Metze, F.: EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 167–174. IEEE (2015) 22. Mohri, M., Pereira, F., Riley, M.: Weighted ﬁnite-state transducers in speech recognition. Comput. Speech Lang. 16(1), 69–88 (2002) 23. Popovi´c, B., Pakoci, E., Pekar, D.: End-to-End large vocabulary speech recognition for the Serbian language. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 343–352. Springer, Cham (2017). https://doi. org/10.1007/978-3-319-66429-3 33 24. Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, No. EPFL-CONF-192584, IEEE Signal Processing Society (2011) 25. Ravindran, S., Demirogulu, C., Anderson, D.V.: Speech recognition using ﬁlterbank features. In: The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, vol. 2, pp. 1900–1903, November 2003. https://doi.org/10.1109/ ACSSC.2003.1292312 26. Soltau, H., Liao, H., Sak, H.: Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition (2016) 27. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp. 3104–3112 (2014) ˇ 28. Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Zelezn´ y, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., N´emeth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 338–345. Springer, Cham (2016). https://doi.org/10.1007/978-3-31943958-7 40

Correction of Formal Prosodic Structures in Czech Corpora Using Legendre Polynomials Martin Matura(B) and Mark´eta J˚ uzov´ a Department of Cybernetics and New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic {mate221,juzova}@kky.zcu.cz

Abstract. Naturalness is a very important aspect of speech synthesis that is necessary for a pleasant and undemanding listening and understanding of synthesized speech. However, in a unit selection, unexpected changes in F0 caused by units transitions can lead to an inconsistent prosody. This paper proposes a two-phased classificationbased method that improves the overall prosody by correcting a formal prosodic description of speech corpora. For speech data representation, the authors decided to use Legendre polynomials. Keywords: Anomaly detection · One-class SVM Formal prosodic grammar · Prosodemes Unit selection speech synthesis

1

· Multiclass SVC

Introduction

In human speech, the fundamental frequency values varies within a sentence. The F0 contour, in general, is closely related to the position of stressed syllables and also to the phrasing of the sentence. The F0 movements (increases/decreases), especially at the phrase-ﬁnal position, have a communication function in the particular language – the mismatch in these movements can cause the misunderstanding of the sentence’s meaning [15,24]. Therefore, it is evident that the correct prosodic description of speech corpora is one of the crucial issues in text-to-speech synthesis. In general, in the unit selection method, the join and target costs are computed to ensure that the optimal sequence of units is selected. These costs control the smoothness of the concatenated neighbouring units, as well as the unit’s suitability for the required position in the synthesized sentence. In our TTS ARTIC [11,20], besides concatenation smoothness, the symbolic prosody features, called prosodemes (Sect. 3, [17,18]), are used in the target cost to ensure the synthesized speech keeps the required communication function (i.e. listeners are able to distinguish declarative sentences from questions) [10]. However, due to some inaccuracies in the formal prosodic description of speech data, speech c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 387–397, 2018. https://doi.org/10.1007/978-3-319-99579-3_41

388

M. Matura and M. J˚ uzov´ a

units are sometimes used in a diﬀerent context than they were pronounced by a speaker and than they belong to. This may be manifested in the synthetic speech e.g. by unnatural dynamic melody or by inappropriate stress perception. The presented paper focuses on the symbolic prosodic labels in our speech corpora and, using powerful Legendre polynomials (Sect. 2), oﬀers the two-phase algorithm for their correction. The initial experiments were carried out in [14] and showed that the description of an F0 contour based on the Legendre polynomials is suﬃcient for classiﬁcation-based approaches.

2

Legendre Polynomials

To describe the F0 contours, the authors used Legendre polynomials [9] – contrary, e.g. to usage of Gaussian mixture models by the author of [7], or HMM models used in [5,6] for the correction of wrongly labelled formal prosodic structures in speech corpora. These polynomials are frequently encountered in physics and other technical ﬁelds. Legendre polynomials are deﬁned by Eq. 1, n

Ln (x) = 2

n k=0

n+k−1 n 2 x , k n k

(1)

and they form an orthogonal basis (i.e., non-correlated) suitable for modelling of F0 contours [4,23]. An F0 contour is described by coeﬃcients as a linear combination of these polynomials. Because of the orthogonality, the coeﬃcients can be estimated using cross-correlation at a time lag of 0 (i.e., a mutual energy of F0 contour and Legendre polynomial). The ﬁrst four polynomials L0 (x), L1 (x), L2 (x) and L3 (x) (see Fig. 1a) match linguistic interpretation as L0 (x) responds to mean value of the pitch, L1 (x) to rise or fall depending on the positive or negative sign of the coeﬃcient (the slope is determined by its absolute value), L2 (x) to peak or valley and L3 (x) to the wave shape of F0 contour. For the purposes of the presented experiment, the authors used mPraat toolbox for Matlab [1] and for each F0 contour, the frequency values has been transferred to semitone scale, interpolated the contour in 1,000 equidistant points and estimated the ﬁrst four Legendre coeﬃcients (for example, see Fig. 1b, coefﬁcients are 10.7407 (mean value), −2.6880 (falling slope), −1.5522 (valley shape), 0.1685 (only a slight wave curvature)).

3

Symbolic Prosody Features in Speech Corpora

The authors of [17,18] introduced a new formal prosodic model to be used in text-to-speech systems to control the appropriate usage of intonation schemes within the synthesized sentence, the original idea was based on the Czech classical phonetic view described in [15]. This grammar parses the given text sentence

Correction of Formal Prosodic Structures

389

Fig. 1. Setup of the experiment.

in a derivation tree and each prosodic word (PW, i.e. a group of words with only one words stress) is assigned with an abstract prosodic unit, a prosodeme, marked as PX . The former grammar was focused mainly on the diﬀerentiation of phrase-ﬁnal and other PW in the sentence since phrase-ﬁnal words are, in general, characterized by a distinct increase/decrease of an F0 , they have a certain communication function. However, as showed in [8], the phrase-initial PW s also distinguish from the following words, especially by the increase of the F0 [24]. Recently, based on these observations, the grammar was extended to describe the phrase-initial PW s by a new prosodeme type (P0.1 , see below). In our TTS ARTIC [11,20], we distinguish the following prosodeme types assigned to each PW (see also Fig. 2): – – – – –

P1 – prosodeme terminating satisfactorily (the last PW s of declar. sentences) P2 – prosodeme terminating unsatisfactorily (the last PW s of questions) P3 – prosodeme non-terminating (the last PW s in intra-sentence phrases) P0 – null prosodeme (assigned otherwise) P0.1 – special type of null prosodeme (assigned to the ﬁrst PW in phrases)

The prosodeme types are used in speech synthesis to ensure the required communication function on the phrase level of synthesized sentences [10,22] – the usage of a correct prosodeme type is controlled by the target cost computation in the unit selection method. Unfortunately, despite the professional speakers were recording the speech corpora for the purposes of TTS, the prosodic description of recorded sentences (based on the formal prosody grammar applied on texts of segmented sentences) sometimes do not correspond to the real F0 contours. The problems mainly appear within the null prosodeme where a “neutral” speech is expected, but the speaker could pronounce a word in an unexpected way regarding prosody. This inaccurate description (and thus the wrong usage of some speech units in the synthesis itself) may lead to an unnatural excessive increase or decrease of the F0 contour in a non-phrase-ﬁnal prosodic word with

390

M. Matura and M. J˚ uzov´ a

Fig. 2. The illustration of the tree built using the extended prosodic grammar [8, 18] for the Czech sentence “It will get colder and it will snow heavily, so he did not come.”

the null prosodeme which could be manifested by an inappropriate stress or an unnatural melody or, eventually, it may result in a misunderstanding due to not keeping the required communication function. In the presented paper, the experiments are carried out on two large speech corpora – AJ and MR [12,20]. The male synthetic voice, built from AJ corpus, is widely used in commercial products for its high naturalness. On the other hand, the female synthetic voice, built from MR corpus, is not very consistent in prosody (her prosody is very dynamic) – given the original prosodic description baseline, synthesized sentences quite often contain an unnatural intonation pattern (especially in the null prosodeme). The complete statistics of the corpora are listed in Table 1. Table 1. Number of prosodic words labelled by a specific prosodeme type. Corpus No. of sentences No. of PW s P0

4

P1

P2

AJ

12,277

84,733

35,781

MR

12,308

83,486

41,728 11,017 905

P3

P0.1

9,850 922 12,141 26,039 7,953 21,883

Correction Process

The basic idea behind the correction process is simple. With inconsistent prosody, the speech created by the unit selection does not sound naturally and it is unpleasant to listen due to the speech artefacts. If we were able to correct wrongly marked prosodic words, we might achieve more ﬂuent and consistent prosody, which would lead to a better synthesis. The correction process has two

Correction of Formal Prosodic Structures

391

phases and a choice of a suitable data description is a principal issue. Despite prosodemes (Sect. 3) being the only symbolic prosody features, each prosodeme type corresponds to the speciﬁc changes in the F0 contour – these could be modelled by the Legendre polynomials (Sect. 2) whose ﬁrst four coeﬃcients are used as the only representation of our data in the presented experiment. In the ﬁrst phase, anomalies among the null prosodemes are detected (Sect. 4.1). In the second phase, the detected outliers are classiﬁed by a multiclass classiﬁer that gives them new labels (Sect. 4.2). Both phases are described below in detail. After the correction, the evaluation by listening tests was performed (see Sect. 5). 4.1

Phase One: Anomaly Detection

Anomaly (or novelty) detection [2,13] is a well-known approach which is used to ﬁnd items that do not have the same or similar properties as other items in a dataset. Our previous study [14] showed that, among other classiﬁcation methods, the One-class Support Vector Machine (OCSVM) is the most suitable for this experiment. We are using the implementation of OCSVM from scikitlearn [16] which is based on libsvm [3] with radial basis function as a kernel and γ = 0.1. The parameter ν, which inﬂuences an upper bound on the fraction of training errors, was set to 10% – this value is the authors’ estimation of possible wrongly labelled P W s in corpora. Since we are looking for anomalies only in our closed dataset, we can aﬀord to train the OCSVM model on the whole dataset to get the best decision function possible. We trained two OCSVM models. The ﬁrst one was trained by using 35,781 P0 prosodemes from AJ corpus and the second one by using 41,728 P0 prosodemes from MR corpus. After training the models, we tested how these models react to the diﬀerent types of prosodemes and also to the training data. We detected anomalies in each group of prosodemes using the OCSVM model to obtain the number of outliers for each group. Since the model was trained with P0 prosodemes, where we supposed 10% of anomalies, we expected the number of outliers to be about 10% for P0 and signiﬁcantly higher for the other groups. The results shown in Table 2 conﬁrm our assumption – most of the P1 prosodemes were correctly detected as anomalous by OCSVM model trained on P0 . All the results are described in [14]. Table 2. Number of anomalies detected by OCSVM. Corpus P0

4.2

P1

AJ

3,578 (10.0%)

8,508 (86.4%)

MR

4,174 (10.0%) 10,317 (93.6%)

Phase Two: Outliers Classification

By detecting the anomalies in P0 prosodemes, we obtained a group of outliers whose F 0 does not have “neutral” contour. These outliers can be either strongly

392

M. Matura and M. J˚ uzov´ a

penalized to exclude them from speech synthesis process as described in Sect. 5.1 (see [14]), or classiﬁed to another prosodeme class – as mentioned in Sect. 3, apart from P0 , we distinguish another 4 diﬀerent prosodeme types: P1 , P2 , P3 and P0.1 . To perform the multi-class classiﬁcation of the P0 outliers, we had to train an appropriate model for each corpus. We collected all available prosodeme data from one corpus to cover all the prosodeme types and then we trained a Support Vector Classiﬁer (SVC) from scikit-learn as our multi-class model. SVC uses one-vs-all decision function and since our data are not evenly distributed among all types of prosodemes, we set the parameter for class weight to “balanced”, which means the weight of each class is adjusted inversely proportional to the class frequencies in input data. As in the previous case, we were working on the closed dataset and therefore we could used all data to train the classiﬁcation model. The classiﬁcation and relabelling of P0 outliers was done again for both corpora. We classiﬁed 3,578 outliers in AJ corpus and 4,174 outliers in MR corpus; the classiﬁcation results are listed in Table 3. Table 3. Classification of P0 outliers. Corpus P0 outliers P0 AJ

3,578

MR

4,174

P1

P2

P3

1,559 (43.6%) 189 (5.3%) 328 (9.2%) 328 (9.2%)

P0.1 1,174 (32.8%)

988 (23.7%) 385 (9.2%) 145 (3.5%) 817 (19.6%) 1,839 (44.1%)

It is obvious, that most of the P0 outliers (76.3%) from MR corpus were labelled as a diﬀerent type of prosodeme. However, 23.7% of them were given the P0 label again. These outliers were picked by the OCSVM model as anomalies, because their properties were somehow diﬀerent from the other P0 data. Nevertheless, the properties of these outliers are still more similar to the P0 prosodeme than to another prosodeme type, hence the SVC labelled them as P0 . The situation for AJ corpus is analogous with the diﬀerence that even more outliers were labelled back to P0 . This is probably caused by a diﬀerent prosody consistency of each corpus. The intonation of AJ speaker was more consistent and precise compared to the MR speaker and therefore, classiﬁer marked them back to type P0 more often than in the case of MR corpus. The evaluation of the prosodeme corrections will be further described in Sect. 5.2.

5

Evaluation

To evaluate the process proposed in Sect. 4, we carried out two listening tests (see Sects. 5.1 and 5.2) in our new listening test framework. Both tests had the same structure, both were 3-scale preference listening test. The listeners were comparing sentences generated by our baseline TTS system ARTIC (with original corpora, TTS-base) and those generated by a modiﬁed system TTS-new

Correction of Formal Prosodic Structures

393

build on the ﬁxed corpora (based on the classiﬁcation described in Sects. 4.1, 4.2 respectively). They were instructed to use earphones and to compare the overall quality of samples A and B in each pair by selecting one from these options: – Sentence A sounds better. – I cannot decide. – Sentence B sounds better. The answers where normalized for each listener and pair of samples in the listening test to p = 1 where the TTS-new output was preferred, to p = −1 where the TTS-base output was preferred and p = 0 otherwise. These values were used for the ﬁnal computation of the listening test score s, deﬁned by Eq. 2, p∈T p , (2) s= |T | where T is a set of all answers from all listeners. The positive value of s indicates the improvement of the overall quality when using TTS-new. 5.1

Evaluation of the Phase One

First, the authors evaluate the phase one, the anomaly detection using OCSVM in Sect. 4.1, directly in the unit selection speech synthesis itself [14]. In this evaluation, the modiﬁed TTS-new represents a system which highly penalizes units originated from anomalous PW s (those detected by OCSVM) during the Viterbi search [21]. This “ban” should ensure that these “strange” (anomalous) units are not used in the synthesis and it may, hopefully, increase the naturalness of speech synthesis. On the other hand, about 10% of all P0 units are dropped by this approach – this should, however, not be a big problem since the corpora are quite large and they were carefully designed [12] to cover all the diﬀerent units suﬃciently. In any case, this approach results in a diﬀerent sequence of units compared to that generated by TTS-base. To select the sentences for the listening test, we synthesized 6,000 sentences by TTS-base and TTS-new and we randomly selected 20 sentences for each voice so that they fulﬁlled the criterion of having 8 or more anomaly units (similarly to [19], but the selection criterion was the number of anomalous units occurrences in TTS-base sentences in this experiment). Thus, the whole listening test contained 40 pairs of synthesized sentences, each pair included two variants of the same sentence – one generated by TTS-base and one generated by the modiﬁed system TTS-new. The results of the listening test, gained from 16 listeners (5 of them being speech experts), are presented in Table 4. TTS-new was preferred for both voice corpora, the results are statistically signiﬁcant (as proved in [14]). The positive score values s indicate that the penalizing of outlier speech units (those originated from PW outliers detected by OCSVM using Legendre polynomials coeﬃcients) leads to more natural synthetic speech.

394

M. Matura and M. J˚ uzov´ a Table 4. Results of the first listening test. Corpus TTS-base better Same quality TTS-new better score s

5.2

AJ

62 (19.4%)

76 (23.7%)

182 (56.9%)

0.375

MR

104 (32.5%)

79 (24.7%)

137 (42.8%)

0.103

Total

166 (25.9%)

155 (24.2%)

319 (49.9%)

0.239

Evaluation of the Phase Two

The results presented in the previous section indicate the improvement of the quality of speech synthesis when penalizing the units originated from P0 words detected as outliers by OCSVM. However, the outliers were in the phase two relabelled by a multi-class SVM classiﬁer (described in Sect. 4.2) and so they could be used in the synthesis with the new label assigned. In this case, the TTS-new uses the same penalization of a mismatch of prosodeme types in the target cost computation as in the baseline TTS-base, the only diﬀerence of the two systems are the data with prosodeme labels – TTS-new uses the relabelled speech corpora, TTS-base uses the original speech corpora presented in Sect. 3. Again, when designing sentences for the second listening test, we followed the methodology described in [19] with the selection criterion based on the number of relabelled units occurrences in TTS-new sentences. By this procedure, we randomly selected 10 sentences for the each non-null prosodeme type for both voices (80 sentences in total) to ﬁnd out how the relabelled units performed in new prosodic contexts. This listening test was ﬁnished by 16 listeners, 6 of them being speech synthesis experts. The results listed in Table 5 show that the relabelled prosodemes did not cause any serious problem in the synthesized sentences, the outputs of TTS-new were sometimes even much better evaluated by the listeners contrary to the TTS-base outputs.

6

Conclusion and Future Work

In the presented paper, we examined the usage of the Legendre polynomials for correction of formal prosody grammar. The corpora we have been working with contained inconsistencies in the prosody description – some prosodic words were labelled as “neutral” (P0 ) in the meaning of prosody even though their F 0 did not have a neutral contour. Therefore, we proposed the two-phased correction method to correct these wrongly labelled prosodemes. To represent our data, we took only the ﬁrst four coeﬃcients of the Legendre polynomials and then we trained One-Class Support Vector Machine (OCSVM) detector and multi-class Support Vector Classiﬁer (SVC). In the ﬁrst phase, outliers among the P0 prosodemes were detected by the OCSVM and then, in the second phase, we classiﬁed them with the multi-class SVC so we get the new labels for the P0 outliers. Afterwards, we conducted

Correction of Formal Prosodic Structures

395

Table 5. Results of the second listening test. Corpus prosodeme TTS-base better Same quality TTS-new better score s AJ

P0.1 P1 P2 P3

18 25 24 38

(11.3%) (15.6%) (15.0%) (23.8%)

111 103 46 65

(69.4%) (64.4%) (28.8%) (40.6%)

31 32 90 57

(19.4%) (20.0%) (56.3%) (35.6%)

0.081 0.044 0.413 0.119

MR

P0.1 P1 P2 P3

17 44 26 49

(10.6%) (27.5%) (16.3%) (30.6%)

124 68 72 47

(77.5%) (42.5%) (45.0%) (29.4%)

19 48 62 64

(11.9%) (30.0%) (38.8%) (40.0%)

0.013 0.025 0.225 0.094

AJ corpus - total

105 (16.4%)

325 (50.8%)

210 (32.8%)

0.164

MR corpus - total

136 (21.5%)

311 (48.6%)

193 (30.2%)

0.089

total

241 (18.8%)

636 (49.7%)

403 (31.5%)

0.127

two listening tests to evaluate the beneﬁt of this approach. By the ﬁrst test, we veriﬁed that the synthetic speech sounds better if we are not using the anomalous P0 prosodemes. In the second test, we found out that if we relabel the anomalies to a diﬀerent prosodeme type, we can still use them and the quality of speech will not decrease. Hence we do not need to penalize the anomalies or throw them away, which would be a waste of data. Furthermore, in some cases the synthesized speech even gets better with these relabelled prosodemes. As a future work, we would like to test this method on our other corpora (Czech, English, Russian, etc.) and we also want to compare the quality of synthesized speech without all the anomalies and with the relabelled variants of them. Acknowledgements. The work has been supported by the Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506 and by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

References 1. Boˇril, T., Skarnitzl, R.: Tools rPraat and mPraat. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 367–374. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5 42 2. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1–58 (2009) 3. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). Software http://www.csie.ntu. edu.tw/∼cjlin/libsvm

396

M. Matura and M. J˚ uzov´ a

4. Grabe, E., Kochanski, G., Coleman, J.: Connecting intonation labels to mathematical descriptions of fundamental frequency. Lang. Speech 50(Pt 3), 281–310 (2007) 5. Hanzl´ıˇcek, Z.: Classification of prosodic phrases by using HMMs. In: Kr´ al, P., Matouˇsek, V. (eds.) TSD 2015. LNCS (LNAI), vol. 9302, pp. 497–505. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24033-6 56 6. Hanzl´ıˇcek, Z.: Correction of prosodic phrases in large speech corpora. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 408–417. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5 47 7. Hanzl´ıˇcek, Z., Gr˚ uber, M.: Initial experiments on automatic correction of prosodic annotation of large speech corpora. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 481–488. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10816-2 58 8. J˚ uzov´ a, M., Tihelka, D., Vol´ın, J.: On the extension of the formal prosody model for TTS. In: TSD. Lecture Notes in Computer Science. Springer (2018) 9. Legendre, A.M.: Recherches sur l’attraction des sph´ero¨ıdes homog`enes. In: M´emoires de math´ematique et de physique, present´es ` a l’Acad´emie royale des sciences, par divers s¸cavans & lˆ us dans ses assembl´ees, Paris, pp. 411–435 (1785) 10. Matouˇsek, J., Leg´ at, M.: Is unit selection aware of audible artifacts? In: SSW 2013. Proceedings of the 8th Speech Synthesis Workshop, pp. 267–271. ISCA, Barcelona, Spain (2013) 11. Matouˇsek, J., Tihelka, D., Romportl, J.: Current state of Czech text-to-speech system ARTIC. In: Sojka, P., Kopeˇcek, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006). https://doi.org/10. 1007/11846406 55 12. Matouˇsek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: LREC 2008, pp. 1296–1299. ELRA, Marrakech, Morocco (2008) 13. Matouˇsek, J., Tihelka, D.: Anomaly-based annotation errors detection in tts corpora. In: INTERSPEECH, pp. 314–318. ISCA, Dresden, Germany (2015) 14. Matura, M., J˚ uzov´ a, M.: Using anomaly detection for fine tuning of formal prosodic structures in speech synthesis. In: TSD. Lecture Notes in Computer Science, Springer (2018) ˇ 15. Palkov´ a, Z.: Rytmick´ a, v´ ystavba prozaick´eho textu. Studia CSAV; ˇc´ıs. 13/1974. Academia (1974) 16. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 17. Romportl, J.: Structural data-driven prosody model for TTS synthesis. In: Proceedings of the Speech Prosody 2006 Conference, pp. 549–552. TUDpress, Dresden (2006) 18. Romportl, J., Matouˇsek, J.: Formal prosodic structures and their application in NLP. In: Matouˇsek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10. 1007/11551874 48 19. Tihelka, D., Gr˚ uber, M., Hanzl´ıˇcek, Z.: Robust methodology for TTS enhancement evaluation. In: Habernal, I., Matouˇsek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 442–449. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3642-40585-3 56 20. Tihelka, D., Hanzl´ıˇcek, Z., J˚ uzov´ a, M., V´ıt, J., Matouˇsek, J., Gr˚ uber, M.: Current state of text-to-speech system ARTIC: A decade of research on the field of speech technologies. In: TSD. Lecture Notes in Computer Science (2018)

Correction of Formal Prosodic Structures

397

21. Tihelka, D., Kala, J., Matouˇsek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: INTERSPEECH, pp. 174–177. ISCA, Makuhari, Japan (2010) 22. Tihelka, D., Matouˇsek, J.: Unit selection and its relation to symbolic prosody: a new approach. In: INTERSPEECH, vol. 1, pp. 2042–2045. ISCA, Bonn (2006) 23. Vol´ın, J., Tykalov´ a, T., Boˇril, T.: Stability of prosodic characteristics across age and gender groups. In: INTERSPEECH, pp. 3902–3906. ISCA, Stockholm, Sweden (2017) 24. Vol´ın, J.: Extrakce z´ akladn´ı hlasov´e frekvence a intonaˇcn´ı gravitace v ˇceˇstinˇe. Naˇse ˇreˇc 92(5), 227–239 (2009)

On the Contribution of Articulatory Features to Speech Synthesis Martin Matura(B) , Mark´eta J˚ uzov´ a , and Jindˇrich Matouˇsek Department of Cybernetics and New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic {mate221,juzova,jmatouse}@kky.zcu.cz

Abstract. There are several features that are used for the unit selection speech synthesis. Among the most used for computing a concatenation cost are energy, F0 and Mel-frequency cepstrum coeﬃcients (MFCC) that usually give a good description of a speech signal. In our work, we focus on a usage of articulatory features. We want to determine whether they are correlated with MFCC and in that case, if they can replace MFCC or bring a new information into the process of speech synthesis. To obtain the articulatory data, we used electromagnetic articulograph AG501 and then we examined the correlation of two sequences of join costs each described by diﬀerent features. Keywords: Articulatory features Electromagnetic articulograph

1

· Join cost · Correlation

Introduction

For the unit selection speech synthesis, a good description of speech units is a crucial issue for a high quality of resulting synthesized speech. The process of choosing the best unit sequence is controlled by Viterbi search [18], a searching algorithm based on ﬁnding the lowest cost path through the graph with concatenation costs (join cost, JC ) on edges and target costs (TC) on nodes. Fundamental frequency, energy and Mel-frequency Cepstral coeﬃcients (MFCC) are the most common features for the join cost computation and indicate how well the neighbouring units can be joined together [2,9] – in other words, it ensures the smooth transition between units regarding prosodic and acoustic features. On the other hand, the target cost ensures the selection of an appropriate unit to the required position regarding also prosodic and phonetic contexts. The speech itself, when created by a human, is basically a result of appropriate movements of human articulators (lips, tongue, palate, etc.) in the form of an airﬂow. The airﬂow is shaped by the articulators according to the sounds present in the produced speech. The articulatory data, obtained by an electromagnetic articulograph, represents the movement and changes in the position of human articulators, hence they are promising candidates for another features describing the speech unit. Apart from MFCC, which describe a frequency spectrum that c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 398–407, 2018. https://doi.org/10.1007/978-3-319-99579-3_42

On the Contribution of Articulatory Features to Speech Synthesis

399

is closely related to the speech, but can be inaccurate due to masking eﬀects, the articulatory data capture the changes between articulators, which create speech, directly and that is why they could contribute to the selection of a better-related speech units. There are not many studies reported the usage of the articulatory features in unit selection speech synthesis. Nevertheless, in the recent study [15], the quality of speech synthesis was tested with diﬀerent types of features – articulatory, acoustic and articulatory-acoustic, and the study shows that articulatory features have a potential to become another reasonable set of features used for the join cost computation. The articulatory data have also been used for the speech recognition task [22], and there are many studies concerning acoustic-toarticulary inversion mapping, e.g. [8,19,20]. Unfortunately, based on our experience, obtaining a set of articulatory features is a quite demanding task (Sect. 2). Therefore, in the ﬁrst instance, we want to ﬁnd out if there is a dependence between articulatory features (AF) and MFCC and whether AF can bring a new information into the process of speech synthesis of Czech language. Once we prove the contribution of AF (a new information compared to MFCC), the usage of articulatory features will be tested in our TTS system ARTIC [11,17] because we expect it may improve the selected speech unit sequence as reported in [15]. For this purpose, an electromagnetic articulograph was used to record our own dataset since there is no publicly available articulatory corpus for Czech language, like MOCHA-TIMIT [21] or mngu0 [14] for English or MSPKA [1] for Italian. The scope of this paper covers the testing of the dependence of MFCC and AF. Since these features are hardly comparable (AF represents the real movements of human articulators during the speech, MFCC are computed from the frequency spectrum), they are used for the join cost computation of the sequence of units during the speech synthesis. The emerged sequences of the two diﬀerent join costs are used for the correlation coeﬃcient computation. 1.1

Join Cost in Unit Selection

As described in [9], the join cost in the unit selection speech synthesis, which ensures the smooth transition of speech units during the Viterbi search [18], consists of three sub-components – the diﬀerence in energy (E ), the diﬀerence in fundamental frequency F0 (both together constitute a prosodic component of the join cost) and the Euclidean distance of 12 MFCC, an acoustic component of JC. Values of all features are calculated pitch-synchronously [6,7,10] and the total join cost is calculated as an average of the values of the sub-components. For two units candidates ci−1,j and ci,k (for units ui−1 and ui ), the join cost is deﬁned as follows (Eq. 1): JC(ci−1,j , ci,k ) = wF0 ∗ JCF0 (ci−1,j , ci,k ) + wE ∗ JCE (ci−1,j , ci,k ) + wM F CC ∗ JCM F CC (ci−1,j , ci,k ),

(1)

400

M. Matura et al.

ci−1,j is the j th candidate for the unit ui−1 in the synthesized sentence and ci,k is the k th candidate for the unit ui . For 2 units candidates ci−1,j , ci,k , the MFCC join cost component JCM F CC is deﬁned by Eq. 2 (the Euclidean distance of 12-dimensional MFCC vectors): 12 (ci−1,j (n) − ci,k (n))2 (2) JCM F CC (ci−1,j , ci,k ) = n=1

When a synthesized sentence is included in the speech corpus, the units candidates originated from that recorded sentence must be selected because the best possible quality of synthesized speech is obtained by simply playing back the original speech as far as target speciﬁcation also matches – it is the basic (and the most obvious) requirement for the selection algorithm. Hence the costs have to be deﬁned to be equal to 0 for the neighbouring units. The energy and F0 values are continuous in the continuous speech (except for unvoiced segments) and also the Euclidean distance of MFCC vectors of two neighbouring units is zero. Naturally, the same principle holds also true for the articulatory features (the courses of sensors coordinates) – so the distance of the coordinates could be, without any doubt, used in the JC computation as a new component JCAF .

2

Data Acquisition

The electromagnetic articulograph allows a digital recording and representation of articulatory movements over time during the process of speech creation. It uses induction coils above a speaker’s head that produce electromagnetic ﬁeld. This ﬁeld induces a current in tiny coils (sensors) in the mouth of a speaker which allows us to determine the location of sensors; this issue is described in more details in [5]. It is important for the speaker to keep the head inside the spherical measuring area under the induction coils, otherwise the results can be distorted. To obtain the articulatory trajectories for our research, we use 3D electromagnetic articulograph AG501 (EMA), which is more precise than AG500 [16], with seven sensors attached to the speaker by a physiological adhesive and sampling frequency equal to 250 Hz. Three sensors are used as reference points and four sensors measure trajectories of articulators. The reference sensors are glued to the places on the speaker’s head which do not move while he/she is speaking (upper incisors, temporal bone behind ears). They are used to capture the head movements and afterwards for a subsequent post-processing of articulation data since it is necessary to perform head-correction calculations to eliminate the head movements from the articulatory trajectories. The remaining four sensors measure the articulatory trajectories and they are attached to the lower incisors (LI), tongue tip (TT), tongue body (TB) and tongue dorsum (TD), as shown in Fig. 1. Articulation data are also usually obtained from sensors placed on the lower and the upper lip, unfortunately we did not have enough sensors to carry out measurement of lip’s position.

On the Contribution of Articulatory Features to Speech Synthesis

401

Fig. 1. Midsagittal view of a human mouth with the placement of EMA sensors which capture the articulator’s trajectories.

As also reported e.g. in [1,13], the process of recording with EMA is not a simple task for speakers. The main problem lies in de-attaching of the measuring sensors from the articulators – over the recording time, the physiological adhesive starts to peel away from the soft tissue due to the constant movement and friction inside the mouth. Once the sensor falls oﬀ, it is practically impossible to place it back to the exact same spot as before and the recording has to be stopped. Because of this issue, it is important to properly select the sentences for the recording since only hundreds of sentences should suﬃciently cover all required speech units. The authors decided to use a high-coverage multilevel Czech text corpus designed for a voice banking process of laryngectomized patients [4], the text corpus building process is described in details in [3]. The primary requirement for that set of sentences was to maximize the coverage of appropriate speech units, no matter the number of sentences that would ﬁnally be recorded since there was a limited time for recording of these patients. The building of synthetic voices for the patients (lasting for several years on the author’s department) have proved that the unit selection method could be used from only approx. four or ﬁve hundred of recorded sentences (depending on the patients voice quality). In any case, the main idea of the text corpus building perfectly matches the issue of articulatory data recording – nobody knows in advance how long the speaker will be able to record with all the sensors. The msak0 speaker in the MOCHA-TIMIT uttered 460 sentences but during the session some sensors had to be re-attached. Thus, to ensure the longest possible recording time, it is also very important to carefully prepare the speaker’s articulators before attaching the sensors to them. The pilot (female) professional speaker, whose data were used for the presented experiment, was asked to brush the teeth and tongue ﬁrst, then we dry-cleaned the tongue and glued the sensors to the desired positions shown in Fig. 1. Nevertheless, in spite of our careful preparation, we were not able to record more than 380 sentences (35 min of speech data) in one continuous recording session without de-attaching one of

402

M. Matura et al.

the sensors. As reported in [14], they were able to record over 1300 sentences (67 min of speech data) in one recording session – therefore we are now working on some improvements of the recording and pre-recording process to be able to obtain more speech data for our future experiments. After the recording, a database of speech units with the articulatory features was created. We deprived the recordings of a noise and modiﬁed the articulatory data with the head-correction post-processing and then assigned them to the corresponding speech units. As the articulatory features, we selected only X and Y coordinates (Fig. 2) of the all four sensors (LI, TT, TB, TD). Note that the rotations of sensor coils was not considered and we decided to leave out the Z coordinates since they did not show much movements – the side-by-side movement of the articulators is not so usual during the speech creation. We also

Fig. 2. Trajectories of X and Y coordinates of 4 sensors – LI, TT, TB, TD.

On the Contribution of Articulatory Features to Speech Synthesis

403

performed an automatic segmentation process of the recording sentences [12] and unit selection features generation. The prepared database was used in the experiment described in the following sections. The Fig. 2 shows 224 ms of the X and Y coordinates contours of the articulators sensors (56 values of the EMA sensors, with the sampling frequency 250 Hz). The presented speech segment corresponds to two phonemes ([i], [s]), the vertical line represents the boundary determined by the automatic pitch-synchronous segmentation process of the recorded data [12].

3

Experiment

In recent years, the articulatory features have started to be used both for acoustic-to-articulatory mapping (whose results are subsequently used in speech synthesis) [8,19,20] and for the unit selection itself as new features (to replace or extend MFCC) [15]. However, based on the authors best knowledge, there is no reported study concerning the correlation of the articulatory features (AF) and MFCC – whether the AF really bring a new information into the process of speech synthesis and thus, whether it is really worth using them. Hence, we decided to test the contribution of AF and performed a correlation comparison to ﬁnd diﬀerences or similarities in MFCC and AF behaviour. Due to the diﬀerence of these two features, it makes a little sense to just compare their contours in the recorded sentences – the AF represent the real movements of human articulators while MFCC are computed from the speech frequency spectrum by applying the mel triangular ﬁlters. Moreover, we want to compare their contribution in the speech synthesis itself so we decided to compute the correlation coeﬃcient of the sequences of join costs JC representing a synthesized sentence. Firstly, the join cost would be computed using MFCC and then AF would be used. Since the prosodic components JCF0 and JCenergy of the join cost are not related to the third one (and we want to omit their inﬂuence from the total JC computation), we focused only on one join cost sub-component – JCM F CC , JCAF respectively. For the two units candidates ci−1,j and ci,k , JCM F CC was deﬁned in Sect. 1.1 by Eq. 2 and the concatenation cost component characterized by AF was calculated as a mean of Euclidean distances of the corresponding X and Y coordinates (dimension n is 2) of 4 articulatory features (dimension m is 4): 4 2 1 (ci−1,j (n) − ci,k (n))2 (3) JCAF (ci−1,j , ci,k ) = 4 m=1 n=1 To be able to compute the correlation of JCM F CC and JCAF , it was necessary to have sequences of units of the synthesized sentences described by both JCM F CC and JCAF . To do that, we ﬁrstly synthesized (in a scripting interface of our TTS system ARTIC [17]) a set of randomly selected text sentences by using only acoustic component for the join cost computation: JC := JCM F CC .

(4)

404

M. Matura et al.

Note that the handling of target cost TC was the same as in the “raw” unit selection in TTS ARTIC, i.e. it is composed of prosodic word position features, phonetic context features and symbolic prosodic features [9]. Then we computed the JCAF costs for the ﬁxed units sequences from the previous step and the mean values meanJCAF and meanJCM F CC for both sequences. For the correlation of the obtained sequences of JCM F CC and JCAF , the Pearson’s coeﬃcient r, which is the most commonly used linear correlation coefﬁcient, deﬁned by Eq. 5, was used: m (JCM F CC − meanJCM F CC )(JCAF − meanJCAF ) r = m i=1 , (5) m 2 2 (JC M F CC − meanJCM F CC ) i=1 i=1 (JCAF − meanJCAF ) where m is number of unit concatenations. The correlation coeﬃcient r has a value between +1 and −1, where 1 represents a total positive linear correlation, 0 means no linear correlation, and −1 is a total negative linear correlation.

4

Results

The correlation was tested on ten and hundred randomly selected sentences. The resulting correlation coeﬃcients are listed in Table 1. Table 1. Correlation coeﬃcients of JCM F CC and JCAF sequences. Number of sentences

Average length in Correlation coeﬃcient Zeros phonemes Mean Mean σ Minimum Maximum

10

42

0.8524 0.0619 0.7384

0.9115

24

100

41

0.8109 0.0648 0.6403

0.9204

22

High numbers of the mean values of correlation coeﬃcient r and small standard deviations in the table show quite a large dependency between the sequences of JCM F CC and JCAF . At ﬁrst, it was quite surprising because we expect articulatory features should carry a diﬀerent information than MFCC, but high numbers suggested otherwise. However, the reason for this large dependence is hidden in the basic principle of the unit selection – the zero costs values for neighbouring units (see Sect. 1.1). The unit selection algorithm is trying to select the most suitable units to be concatenated and the best units are those which were originally neighbours – such units have a join cost equal to zero no matter what features are used for the join cost computation. And those units can noticeably distort the result for the correlation coeﬃcient computation – if two zeros are compared, the correlation is always equal to 1. We found out that the sequences of concatenation costs are from more than 50% composed of zeros (due to concatenating of neighbouring units) as shown

On the Contribution of Articulatory Features to Speech Synthesis

405

in the Table 1. It is obvious that these huge amounts of zeros could noticeably increase the correlation of the two sequences so we removed the corresponding zeros from both sequences and recalculated the correlation coeﬃcients again – the results are presented in Table 2. Now, the values in the table are close to 0 which indicates that AF and MFCC are not correlated. Table 2. Correlation coeﬃcients of JCM F CC and JCAF sequences without zeros. Number of sentences Correlation coeﬃcient Mean σ Minimum Maximum 10 100

0.1755 0.3158 −0.3047

0.5400

−0.0068 0.2497 −0.4872

0.5399

The sequences of JCAF and JCM F CC for two selected sentences are also drawn in Fig. 3 – one sentence with the maximal and the second with the minimal value of correlation coeﬃcient r. It can be clearly seen from the ﬁgures that the sequences of costs are not much correlated – some higher JCM F CC correspond to lower JCAF and vice versa (we intentionally connected the non-zero values in the graphs by a line so that the very small correlation of the data would be obvious for the reader). These illustrations together with the results listed in Table 2 prove that the MFCC and AF features are not correlated and evince that the articulatory data are able to bring a new information into the speech synthesis process and their usage might improve the overall quality of the synthesized sentences.

Fig. 3. Comparison of JCAF and JCM F CC sequences. The x-axis represents the number of units transition in the synthesized sentence.

406

5

M. Matura et al.

Conclusion

The presented paper tries to answer the question whether the real articulatory data brings any new information to the process of speech synthesis when compared to acoustic MFCC features used in speech synthesis for decades. We have managed to prove certain independence of these two diﬀerent features by the Pearson’s correlation coeﬃcient values close to zero for randomly selected sentences. Now, as we conﬁrmed the usefulness of the articulatory features, we are working on the utilization of AF in our Czech TTS system ARTIC, both as a replacement of MFCC and the enlargement of the feature space, similarly to [15]. However, the limited amount of recorded data does not allow us to perform acceptable experiments now, so we are improving the recording and pre-recording process and planning to record more voices with EMA sensors used. The future work also include the experiments with acoustic-to-articulatory mapping to gain more data. Acknowledgments. This research was supported by the Czech Science Foundation (GA CR), project No. GA16-04420S and by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

References 1. Canevari, C., Badino, L., Fadiga, L.: A new Italian dataset of parallel acoustic and articulatory data. In: INTERSPEECH. ISCA (2015) 2. Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: ICASSP, vol. 1, pp. 373–376. IEEE (1996) 3. J˚ uzov´ a, M., Tihelka, D., Matouˇsek, J.: Designing high-coverage multi-level text corpus for non-professional-voice conservation. In: Ronzhin, A., Potapova, R., N´emeth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 207–215. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7 24 4. J˚ uzov´ a, M., Tihelka, D., Matouˇsek, J., Hanzl´ıˇcek, Z.: Voice conservation and TTS system for people facing total laryngectomy. In: INTERSPEECH. ISCA (2017) 5. Kaburagi, T., Wakamiya, K., Honda, M.: Three-dimensional electromagnetic articulography: a measurement principle. J. Acoust. Soc. Am. 118(1), 428–443 (2005) 6. Leg´ at, M., Matouˇsek, J., Tihelka, D.: A robust multi-phase pitch-mark detection algorithm. INTERSPEECH 1, 1641–1644 (2007) 7. Leg´ at, M., Matouˇsek, J., Tihelka, D.: On the detection of pitch marks using a robust multi-phase algorithm. Speech Commun. 53(4), 552–566 (2011) 8. Liu, Z.C., Ling, Z.H., Dai, L.R.: Articulatory-to-acoustic conversion with cascaded prediction of spectral and excitation features using neural networks. In: INTERSPEECH, pp. 1502–1506. ISCA (2016) 9. Matouˇsek, J., Leg´ at, M.: Is unit selection aware of audible artifacts? In: SSW 2013, Proceedings of the 8th Speech Synthesis Workshop, pp. 267–271. ISCA, Barcelona (2013)

On the Contribution of Articulatory Features to Speech Synthesis

407

10. Matouˇsek, J., Tihelka, D.: Classiﬁcation-based detection of glottal closure instants from speech signals. In: INTERSPEECH, pp. 3053–3057. ISCA (2017) 11. Matouˇsek, J., Tihelka, D., Romportl, J.: Current state of Czech text-to-speech system ARTIC. In: Sojka, P., Kopeˇcek, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006). https://doi.org/10. 1007/11846406 55 12. Matouˇsek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: INTERSPEECH, pp. 1626–1629. ISCA (2008) 13. Richmond, K.: A multitask learning perspective on acoustic-articulatory inversion. In: INTERSPEECH, pp. 2465–2468. ISCA, August 2007 14. Richmond, K., Hoole, P., King, S.: Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: INTERSPEECH. ISCA (2011) 15. Richmond, K., King, S.: Smooth talking: articulatory join costs for unit selection. In: ICASSP, pp. 5150–5154. IEEE (2016) 16. Stella, M., Stella, A., Sigona, F., Bernardini, P., Grimaldi, M., Fivela, B.G.: Electromagnetic articulography with AG500 and AG501. In: INTERSPEECH, pp. 1316–1320. ISCA (2013) 17. Tihelka, D., Hanzl´ıˇcek, Z., J˚ uzov´ a, M., V´ıt, J., Matouˇsek, J., Gr˚ uber, M.: Current state of text-to-speech system ARTIC: A decade of research on the ﬁeld of speech technologies. In: TSD. Lecture Notes in Computer Science. Springer, Heidelberg (2018) 18. Tihelka, D., Kala, J., Matouˇsek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: INTERSPEECH, pp. 174–177. ISCA (2010) 19. Toda, T., Black, A., Tokuda, K.: Acoustic-to-articulatory inversion mapping with gaussian mixture model. In: INTERSPEECH. ISCA (2004) 20. Toutios, A., Margaritis, K.: Acoustic-to-articulatory inversion of speech: a review. In: Proceedings of the International 12th TAINN (2003) 21. Wrench, A.: The mocha-timit articulatory database (1999). database available at http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html 22. Wrench, A.A., Richmond, K.: Continuous speech recognition using articulatory data. In: INTERSPEECH, pp. 145–148. ISCA (2000)

QuARTCS: A Tool Enabling End-to-Any Speech Quality Assessment of WebRTC-Based Calls Martin Meszaros1,2(B) , Franziska Trojahn1,2(B) , Michael Maruschke2 , and Oliver Jokisch2 1

immmr GmbH, Winterfeldtstraße 21, 10781 Berlin, Germany {martin.meszaros,franziska.trojahn}@immmr.com www.immmr.com 2 Institute of Communications Engineering, Leipzig University of Telecommunications (HfTL), Gustav-Freytag-Straße 43-45, 04277 Leipzig, Germany www.hft-leipzig.de

Abstract. Recently, the use of Web Real-Time Communication (WebRTC) technology in communication applications has been increasing signiﬁcantly. The users of IP-based telephony require excellent audio quality. However, in WebRTC-based audio calls the audio assessment is challenging due to the speciﬁc functioning principles of WebRTC, such as security requirements, diversity of the endpoints and varying client implementations. In this article, we illustrate the challenges in established methods of audio quality assessment with regard to WebRTC and discuss necessary modiﬁcations in the measurement technique. We present Quality Analyzer for Real Time Communication Scenarios (QuARTCS) as a novel method to overcome the measurement shortcomings and demonstrate the basic functioning by preliminary call samples.

Keywords: WebRTC

1

· Audio quality assessment · Opus codec · VoIP

Introduction

The popularity of Internet-based communication is steadily increasing. The demands in regard to quality, availability and type of service have adapted to the changes in daily lifestyle: Multiple services have to be available on all devices, from any place and at any time. While voice-based telecommunication is no longer limited to telephones but also available on computers and tablets, it still needs to be easy-to-use for naive users and to provide interoperability with legacy solutions such as the Public Switched Telephone Network (PSTN). In particular, the prevalence of Voice over IP (VoIP) communication services based on WebRTC is rising signiﬁcantly. Their success relies on a good usability c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 408–418, 2018. https://doi.org/10.1007/978-3-319-99579-3_43

QuARTCS: End-to-Any Speech Quality Assessment

409

and highest possible quality. WebRTC enables Internet Protocol (IP) and webbrowser-based real-time communication using audio, video and auxiliary data without additional plugins or software installation. By default, WebRTC utilizes the Opus codec, standardized by the Internet Engineering Task Force (IETF) in RFC 6716 [1]. The Opus codec oﬀers Full High Deﬁnition (HD) audio coding, by supporting a Fullband (FB) frequency range from 20 Hz to 20 kHz with low delays from 5 ms to 66.5 ms. However, the audio quality depends on several network-related parameters such as network bandwidth, packet loss, delay and jitter. Beyond the network-related parameters, WebRTC exhibits its own conﬁguration of process variables, which may inﬂuence the call quality too. Consequently, the overall quality measurement, estimation and adjustment in the network are complex tasks. Therefore, the quality has to be monitored to guarantee a satisfying user experience, represented by e.g. the intelligibility of the call partner, the call continuity and the one-way delay. In this article, we compare several, frequently used methods of audio quality assessment. Furthermore, we illustrate the challenges that arise for audio assessment in WebRTC-based communication and provide a novel solution approach for both, developers and providers. As a result, application developers can identify the reasons of degraded quality by locating the network segments with the biggest inﬂuence instead of detecting degradations in the overall audio quality only. In Sect. 2, we summarize established methods of audio assessment within the described application environment. Moreover, Sect. 3 is dedicated to the shortcomings in monitoring WebRTC-based calls. Subsequently, we present the QuARTCS method, with its functioning principles for the acquisition of degraded audio signals from Secure Real-Time Transport Protocol (SRTP) streams – captured during an active WebRTC audio call at multiple measurement points in Sect. 4 – followed by preliminary results in Sect. 5 and some conclusions.

2 2.1

Methods of Speech Quality Assessment Subjective Quality Assessment by Listeners

The ITU-Telecommunication Standardization Sector (ITU-T) recommendation P.800 describes several “methods for subjective determination of transmission quality” [3]. Absolute Category Rating (ACR) listening tests represent a commonly used method, in which the degraded audio signal is played to a group of probands, who rate the quality on a ﬁve-point opinion scale. The mean value of all individual ratings is called Mean Opinion Score (MOS)-ACR. Besides listening tests, several instrumental methods for the assessment of audio quality exist. Figure 1 illustrates common steps of two communicating VoIP endpoints (not depicted in the ﬁgure) as well as the general functioning principle of subjective and objective audio quality assessments. In contrast to the ACR listening test, where only the degraded audio signal is taken into account during the assessment, a reference-based objective assessment algorithm additionally requires the original reference audio sample.

410

M. Meszaros et al.

Fig. 1. Principle of subjective and objective audio quality assessment (derived from Maruschke et al. [2]).

2.2

Objective, Instrumental Quality Assessment

The ITU-T standardized several objective assessment methods for audio quality, which do not require a human rater, e.g. the well-known Perceptual Evaluation of Speech Quality (PESQ) algorithm [4] resulting to the measure Mean Opinion Score (MOS)-Listening Quality Objective - Narrowband (LQOn ). However, this assessment method is limited to Narrowband (NB) speech with a frequency range from 300 Hz to 3.4 kHz1 . Meanwhile, real-time audio codecs enable a frequency range up to FB (e.g. the Opus codec), which also led to advanced audio assessment algorithms, such as Perceptual Objective Listening Quality Assessment (POLQA) [6]. The perceptual model of POLQA (deﬁned in ITU-T P.863 version 2) supports SuperWideband (SWB) speech with a frequency range from 50 Hz to 14 kHz, delivering a MOS-Listening Quality Objective - Super-Wideband (LQOsw ) measure. However, studies show that POLQA can even be used for a FB assessment of music or voice signals under certain conditions [2,7]. Recently, an update of ITU-T P.863 was introduced with version 3, which supports speech with a frequency range from 20 Hz to 20 kHz POLQA [6]. Apart from that, single-ended methods of assessment have been developed, which do not require a reference sample and which can therefore be utilized in a more ﬂexible way, as it is limited to the access to the receiving communication party. The ITU-T P.563 algorithm from 2004 is the ﬁrst standardized method supporting a single-ended, objective assessment [8]. However, it allows speech quality assessments for NB telephony only. Beyond the chosen assessment method, VoIP calls pose a challenge, since the degraded audio samples have to be acquired after the network transmission. 2.3

Audio Injection and Recording Methods

To guarantee reproducibility and to minimize a possible inﬂuence of characteristics of the transmitted speech material itself, it is advantageous to inject 1

An extension for the assessment of Wideband (WB) speech with a frequency range from 50 Hz to 7 kHz exists with ITU-T recommendation P.862.2. [5].

QuARTCS: End-to-Any Speech Quality Assessment

411

prerecorded audio samples into the sending endpoint. Especially reference-based assessment methods require well-deﬁned speech samples. According to the ITU-T recommendation P.863.1, an injection of reference samples in the sending endpoint can be done in three ways [9]: Acoustically by an artiﬁcial mouth (from a head and torso simulator) connected to the client [10]; Electrically by connecting an audio cable from a playback device to a line input of the client; Digitally by using Application Programming Interface (API) functions of the communication software (browser and web application) or methods provided by the operating system. Additionally, the audio signal has to be recorded to acquire the degraded audio signal at the receivers’ side after the network transmission – basically by utilizing one of the methods described for injection, with slight adaptions. However, performing the recording acoustically requires special equipment, and background noise has to be kept at a minimum to avoid additional distortion of the signal. Recording the degraded sample electrically requires an audio output at the receiving endpoint, for example a sound card with a 3.5 mm line output jack, and an external recorder has to be connected to the endpoint output. A drawback of this method lies in the additional Digital to Analog Conversion (DAC) at the receiving endpoint and Analog to Digital Conversion (ADC) at the recording device. As the connection between both devices is analog, the transmitted signal is prone to interferences through radio waves or ground loops [11]. The digital approach of recording the audio is far less applicable between diﬀerent devices, since modiﬁcations of the VoIP endpoint might be necessary. However, the advantage of this method lies in the non-modiﬁed recording of the degraded sample, which eliminates the described, potential signal distortions. For the digital recording of a VoIP call, one can use an alternative method: In general, the encoded voice is transmitted over the network within Real-Time Transport Protocol (RTP) packets. Thus, one can capture the network traﬃc with a packet sniﬀer like Wireshark [12]. To acquire degraded audio signals, the RTP payload has to be extracted and eventually to be decoded. Utilizing this approach allows an audio recording independent of the receiving endpoint as target of the audio recording.

3

Limitations of Call Assessments in WebRTC

WebRTC is standardized by two major standardization bodies, namely the World Wide Web Consortium (W3C), which is responsible for the JavaScript (JS) API and the IETF for the corresponding protocols [14,15]. Merely a browser that follows the WebRTC protocol speciﬁcations and implements the JS API, deﬁned by the W3C [14], is necessary. In some cases though, WebRTC native application, so-called “non-browsers”, are preferable over WebRTC browsers.

412

M. Meszaros et al.

Fig. 2. WebRTC triangle architecture [13].

These WebRTC non-browsers do not require implementations of the JS API but must comply with the protocol speciﬁcation [15]. A typical variant of the WebRTC architecture is depicted in Fig. 2. Two communication paths exist: – The signaling path between the web-/signaling server or servers. Each WebRTC-client (in this example provided through web browsers) can also be represented by non-browsers, – The media path between the communication parties. The web and signaling servers provide the web application, which can be downloaded by the client, and also handle the signaling ﬂow. The signaling protocol is not standardized, and various protocols, including standardized and proprietary ones, can be used but the inter-working with Session Initiation Protocol (SIP) over a signaling gateway must be possible. Therefore, the WebRTC media negotiation must include a representation of the same semantics as contained in Session Description Protocol (SDP) oﬀers/answers used in SIP based VoIP communication [15,16]. The clients in a WebRTC call are named WebRTC endpoints and can either be WebRTC browsers or WebRTC non-browsers. Usually, the media path is established directly between two endpoints in terms of a Peer-to-Peer (P2P) connection. Under certain conditions, for example when symmetric Network Address Translation (NAT) is used, the traﬃc might be relayed through a Traversal Using Relays around NAT (TURN) server [17]. In all cases, the media data must be sent over SRTP for every channel that is established [18,19]. This means, that encryption must be used for the media path and that a cipher suite including a key exchange mechanism is necessary. For WebRTC-based communication, a large variety of end devices (endpoints) can be used. Due to the heterogeneous nature of these end devices

QuARTCS: End-to-Any Speech Quality Assessment

413

in regard to hard- and software (e.g. operating systems, availability of audio jacks), a universal solution for capturing WebRTC audio signals does currently not exist. Hence, the use of a device-independent recording mechanism is stringently required. As described in Subsect. 2.3, the digital recording by capturing the network traﬃc is a suitable method for device agnostic acquisition of audio signals. Albeit, the traﬃc capturing method for the acquisition of the degraded audio signals, is still not trivial due to the encryption of the WebRTC-originated media streams.

4

QuARTCS Concept and Tooling

4.1

Design Principles

We developed QuARTCS as a tool, which allows the acquisition of degraded audio signals from SRTP streams captured during an active WebRTC audio media call at multiple measurement points along the network transmission path including the receiving endpoint. Consequently, the quality inﬂuences from one end to any point in the transmission path can be reﬂected by audio assessments (End-to-Any (E2A) assessment). The data acquisition includes the decryption of the SRTP packets of the captured stream, the payload extraction and the audio segmentation as preparation for an objective quality assessment, e.g. POLQA. 4.2

Functioning Details

Figure 3 illustrates the functioning principle of QuARTCS. In a ﬁrst instance, a reference sample will be injected into the WebRTC application running within the WebRTC client on Endpoint A. At one of the endpoints (A or B), the encryption key, cipher suite and SDP messages have to be obtained (referred to as Endpoint/reference information in Fig. 3). The reference information is logged in Endpoint A. Before a call is established, the traﬃc capturing has to be started. The capturing can be conducted at any network node in the network path between Endpoint A and B or directly at Endpoint B. This can be accomplished with traﬃc capturing tools such as Wireshark or tcpdump running along the WebRTC application on the endpoint2 [12,20]. Within the network path, the traﬃc can be captured by using a switch with mirroring port functionality, i.e., the actual traﬃc can be recorded with a third device connected to that mirroring port (cf. Meszaros and Maruschke [21]). During the call, the Reference sample can be looped by the sending endpoint to provide several test samples during one call. Consequently, it is encoded and transmitted over the network by Endpoint A, whilst at the same time the trafﬁc gets captured at the chosen capturing point(s). After the call ﬁnishes, the logged reference information, captured traﬃc as well as the Reference sample (if applicable) injected into Endpoint A, is delivered to QuARTCS. 2

The devices need enough processing power to handle the call as well as the capturing simultaneously to prevent negative eﬀects like a packet loss.

414

M. Meszaros et al.

Fig. 3. General functioning principle of QuARTCS.

The Traﬃc ﬁltering function of QuARTCS then ﬁlters the SRTP stream with direction from Endpoint A to Endpoint B according to information acquired from the SDP message by the Information parsing function. A possible ﬁlter condition can be the Synchronization Source (SSRC) identiﬁer of the stream [22]. Additionally, the Traﬃc ﬁltering has to incorporate a jitter buﬀer, resembling the jitter buﬀer functionality of the receiving endpoint. In the next step, the ﬁltered SRTP stream is passed to the SRTP decryption function, which uses the key and cipher suite provided by the Endpoint/Reference information to the Information parsing function, which generates an unencrypted RTP stream. After the decryption, the payload – corresponding to the encoded audio – can be extracted from the RTP stream. Afterwards, the encoded audio is decoded using the Audio decoding function3 . If one Reference sample gets looped throughout the communication session by the sending Endpoint A, the result of the Audio decoding function will be a concatenation of the Degraded sample. As a result, this concatenation has to be split into multiple Degraded sample ﬁles to have the same length as the Reference sample, which is accomplished by the Audio manipulation function. Finally, the Degraded samples are passed to the quality assessment model. In our example, the full-reference quality assessment model POLQA is utilized 3

The decoding function has to incorporate an appropriate decoder for the speciﬁc audio codec, that was used for the communication session.

QuARTCS: End-to-Any Speech Quality Assessment

415

to estimate MOS-LQOsw values by comparing the Degraded samples with the Reference sample. Equally, a single-ended assessment method, such as P.563, can be used instead of POLQA, if no reference is available. 4.3

Exemplary Speech Assessment

To verify the functioning of QuARTCS, we conducted a preliminary test with a setup depicted in Fig. 4. Endpoint A (callers’ PC) and all intermediary network devices were interconnected via an Ethernet connection supporting a maximum bit rate of 1 Gbit/s. Endpoint B (callee’s smartphone in diﬀerent positions) was connected to Access Point 1 via a 2.4 GHz IEEE 802.11n wireless connection. The network traﬃc was captured simultaneously at Switch 2 via a mirroring port, as well as directly at Endpoint B with tcpdump. Thereafter, a WebRTC call was established. During the call, a FB reference speech sample from ITU-T recommendation P.501 [23] was injected digitally into Endpoint A and repeated nine times by using API functions of the web browser. The repetition of the reference sample will result in 9 degraded samples that can be captured at each capturing point and consequently can be compared with the reference sample. After ﬁnishing the call, the traﬃc ﬁles from the two capturing points as well as the Endpoint/reference information (cf. Subsect. 4.2) acquired from Endpoint A was provided to the PC with QuARTCS and POLQA. The nine degraded samples acquired from the two capturing points, respectively, were evaluated with POLQA version 2.4 in SWB mode by comparing it with the injected speech sample as reference.

Fig. 4. Exemplary test design for verifying the functioning principle of QuARTCS.

5

Results and Discussion

Each of the nine samples, acquired with Switch 2 as capturing point, achieved a MOS-LQOsw of 4.75 – the maximum in POLQA version 2.4. Considering

416

M. Meszaros et al.

the samples obtained from Endpoint B as the capturing point, two out of nine achieved a lower rating than the maximum possible. Namely, sample 4 was rated with a MOS-LQOsw of 3.88, while sample 8 scores to 4.56. By analyzing the captured traﬃc itself, it can be observed, that no packet loss occurred in the network segment between Endpoint A and Switch 2. However, in the network segment between Switch 2 and Endpoint B, several packets where lost during the transmission of sample 4. While sample 8 was transmitted, even slightly more packets where lost. The fact that POLQA rated this sample higher than sample 4 anyway can be justiﬁed on the grounds that most of the packets where lost during a period of silence that was part of the injected sample. The preliminary tests showed that, QuARTCS allows an E2A assessment at multiple measurement points in the network transmission path simultaneously, including the receiving endpoint. This concept enables the identiﬁcation of network segments, which cause the most signiﬁcant degradations to the audio signal. As the tooling is accomplished by decrypting, extracting and analyzing the payload of the SRTP traﬃc, QuARTCS allows a quality assessment, which is independent of the endpoint characteristics and the WebRTC client implementation. The function blocks of QuARTCS work strictly modular and can easily be adapted to various audio codecs, provided that a standalone decoder is available. The digital acquisition of the degraded audio samples prevents an additional degradation due the measurement method itself. Additionally, QuARTCS is able to pre-process the degraded audio samples (e.g. providing time alignment) to fulﬁll the requirements of a speciﬁc audio assessment method and is not limited to the usage of a certain assessment method. Established methods such as PESQ, POLQA and ITU-T P.563 can be utilized [4,6,8]. Nevertheless, a challenge lies in the determination of the key required for the decryption of the SRTP packets, depending on the key exchange algorithm within the WebRTC application. For instance, if Session Description Protocol Security Descriptions for Media Streams (SDES) is used for key exchange, the key can be obtained from the SDP messages [24]. However, if Datagram Transport Layer Security (DTLS) is utilized, the acquisition of the key might not be possible without the modiﬁcation of the WebRTC application [25]. Additionally, the calculation of the one-way delay is not yet possible due to the encryption.

6

Conclusions

In this contribution, diﬀerent assessment methods for voice call quality were compared, and the limitations of a quality assessment in WebRTC-based audio calls were described. Subsequently, we presented QuARTCS as a novel concept and tooling to enable the assessment of WebRTC calls. We described the general working principles of QuARTCS and demonstrated the basic functioning with a preliminary test. Finally, we illustrated the advantages of our approach but also its limitations.

QuARTCS: End-to-Any Speech Quality Assessment

417

The future studies will address the limitations, namely the calculation of the one-way delay despite the encryption, as well as the determination of the encryption key if DTLS is used for key exchange.

References 1. Valin, J., Vos, K., Terriberry, T.: Deﬁnition of the Opus Audio Codec. RFC 6716 (Proposed Standard). RFC. RFC Editor, Fremont, CA, USA, September 2012. https://doi.org/10.17487/RFC6716 2. Maruschke, M., Jokisch, O., Meszaros, M., Trojahn, F., Hoﬀmann, M.: Quality assessment of two fullband audio codecs supporting real-time communication. In: Ronzhin, A., Potapova, R., N´emeth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 571–579. Springer, Cham (2016). https://doi.org/10.1007/978-3-31943958-7 69 3. ITU-T: Methods for Objective and Subjective Assessment of Quality-Methods for Subjective Determination of Transmission Quality. REC P.800, August 1996. http://www.itu.int/rec/T-REC-P.800-199608-I/en 4. ITU-T: Methods for Objective and Subjective Assessment of Quality Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for Endto-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. REC P.862, February 2001. http://www.itu.int/rec/T-REC-P.862200102-I/en 5. ITU-T: Wideband Extension to Recommendation P.862 for the Assessment of Wideband Telephone Networks and Speech Codecs. REC P.862.2, November 2007. https://www.itu.int/rec/T-REC-P.862.2-200711-I/en 6. ITU-T: Perceptual Objective Listening Quality Assessment (POLQA): An Objective Method for End-to-End Speech Quality Assessment of Wide-Band and Superwide-Band Telephone Networks and Speech Codecs. REC P.863. http:// www.itu.int/rec/T-REC-P.863/en 7. ITU-T Study Group 12: A Subjective ACR LOT Testing Fullband Speech Coding and Prediction by P.863. Contribution SG12-C.22, 19 January 2017. https://www. itu.int/md/T17-SG12-C-0022/en 8. ITU-T: Single-Ended Method for Objective Speech Quality Assessment in NarrowBand Telephony Applications. REC P.563, May 2004. https://www.itu.int/rec/TREC-P.563/en 9. ITU-T: Application Guide for Recommendation ITU-T P.863. REC P.863.1, September 2014. https://www.itu.int/rec/T-REC-P.863.1/en 10. ITU-T: Application Guide for Objective Quality Measurement Based on Recommendations P.862, P.862.1 and P.862.2. REC P.862.3, November 2007. https:// www.itu.int/rec/T-REC-P.862.3/en 11. Digital audio transmission for use in studio, stage or ﬁeld applications. US4922536 A. Hoque, T. I., 1 May 1990. http://www.google.com/patents/US4922536 12. Wireshark-Community: Wireshark · Go Deep, 30 November 2017. https://www. wireshark.org/. Accessed 13 Dec 2017 13. Maruschke, M., Jokisch, O., Meszaros, M., Iaroshenko, V.: Review of the Opus codec in a WebRTC scenario for audio and speech communication. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 348–355. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7 43

418

M. Meszaros et al.

14. Jennings, C., Narayanan, A., Burnett, D., Bergkvist, A.: WebRTC 1.0: Realtime Communication Between Browsers. W3C Editor’s Draft, 30 November 2017. http://w3c.github.io/webrtc-pc/ 15. Alvestrand, H.T.: Overview: Real Time Protocols for Browser-based Applications. Internet-Draft, Fremont CA, USA, 12 November 2017. https://tools.ietf.org/html/ draft-ietf-rtcweb-overview-19 16. Rosenberg, J., Schulzrinne, H.: An Oﬀer/Answer Model with Session Description Protocol (SDP). RFC 3264 (Proposed Standard). RFC. Updated by RFC 6157. RFC Editor, Fremont, CA, USA, June 2002. https://doi.org/10.17487/RFC3264 17. Takeda, Y.: Symmetric NAT Traversal using STUN. Internet-Draft, Fremont CA, USA, June 2003. https://tools.ietf.org/id/draft-takeda-symmetric-nat-traversal00.txt 18. Baugher, M., McGrew, D., Naslund, M., Carrara, E., Norrman, K.: The Secure Real-time Transport Protocol (SRTP). RFC 3711 (Proposed Standard). RFC. Updated by RFCs 5506, 6904. RFC Editor, Fremont, CA, USA, March 2004. https://doi.org/10.17487/RFC3711 19. Perkins, C., Westerlund, M., Ott, J.: Web Real-Time Communication (Web-RTC): Media Transport and Use of RTP. Internet-Draft, Fremont, CA, USA, 18 September 2016. https://tools.ietf.org/html/draft-ietf-rtcweb-rtp-usage-26 20. The Tcpdump Team: Tcpdump/Libpcap Public Repository, 3 September 2017. http://www.tcpdump.org. Accessed 13 Dec 2017 21. Meszaros, M., Maruschke, M.: Verhaltensanalyse von Einplatinencomputern Beim Transcoding von Echtzeit-Audiodaten. In: Elektronische Sprachsignalverarbeitung 2016. Tagungsband Der 27. Konferenz, vol. 81, pp. 237–245 (2016) 22. Lennox, J., Ott, J., Schierl, T.: Source-Speciﬁc Media Attributes in the Session Description Protocol (SDP). RFC 5576 (Proposed Standard). RFC. RFC Editor, Fremont, CA, USA, June 2009. https://doi.org/10.17487/RFC5576 23. ITU-T: Test Signals for Use in Telephonometry. REC P.501, March 2017. https:// www.itu.int/rec/T-REC-P.501-201703-I/en 24. Andreasen, F., Baugher, M., Wing, D.: Session Description Protocol (SDP) Security Descriptions for Media Streams. RFC 4568 (Proposed Standard). RFC. RFC Editor, Fremont, CA, USA, July 2006. https://doi.org/10.17487/RFC4568 25. Rescorla, E., Modadugu, N.: Datagram Transport Layer Security Version 1.2. RFC 6347 (Proposed Standard). RFC. Updated by RFCs 7507, 7905. RFC Editor, Fremont, CA, USA, January 2012. https://doi.org/10.17487/RFC6347

Automatic Phonetic Segmentation and Pronunciation Detection with Various Approaches of Acoustic Modeling Petr Mizera and Petr Pollak(B) Faculty of Electrical Engineering, Czech Technical University in Prague, K13131, Technicka 2, 166 27 Praha 6, Czech Republic {mizera,pollak}@fel.cvut.cz www.fel.cvut.cz www.noel.feld.cvut.cz/speechlab

Abstract. The paper describes HMM-based phonetic segmentation realized by KALDI toolkit with the focus on study of accuracy of various acoustic modeling such as GMM-HMM vs. DNN-HMM, monophone vs. triphone, speaker independent vs. speaker dependent. The analysis was performed using TIMIT database and it proved the contribution of advanced acoustic modeling for the choice of a proper pronunciation variant. For this purpose, the lexicon covering the pronunciation variability among TIMIT speakers was created on the basis of phonetic transcriptions available in TIMIT corpus. When the proper sequence of phones is recognized by DNN-HMM system, more precise boundary placement can be then obtained using basic monophone acoustic models. Keywords: Automatic phonetic segmentation Pronunciation variability · GMM-HMM · DNN-HMM TIMIT

1

· KALDI

Introduction

Automatic phonetic segmentation is a procedure which deﬁnes boundary locations of particular phones in a given utterance and whose usage is necessary in situations when phone boundaries must be found for very huge corpora. It is typically used for a creation of subword units for the purpose of concatenative speech synthesis [8,13], for a determination of phone boundaries in huge speech corpora for the training of neural-networks-based speech recognition systems, or in other applications motivated by a study of pronunciation variability based on the analysis of phonetic segmentation results. Detailed analysis of particular phone realizations can also contribute to the clinical diagnostics of serious diseases which inﬂuence speech production, or to an analysis of pronunciation variability in spontaneous or informal speech [9]. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 419–429, 2018. https://doi.org/10.1007/978-3-319-99579-3_44

420

P. Mizera and P. Pollak

The basic solution applied to the determination of phone boundaries is based on forced-alignment of trained HMM models for a given utterance with available acoustic realization and known content, optimally, at phonetic level. This procedure is standardly used as a signiﬁcant step during the training of acoustic models of speech recognizers. It can be realized by various toolkits which implement HMM-based speech recognition, e.g. HTK [17], Sphinx [1], RWTH [14], or KALDI [12]. Especially KALDI is nowadays one of the most popular toolkits used world-wide by the speech research community. In this paper, we present the analysis of phonetic segmentation accuracy using KALDI toolkit. We use acoustic models available in the standard KALDI TIMIT recipe, however, we work with more common setup when the phonetic content is not known. Many previously published approaches based on TIMIT corpus worked with available phone boundaries and many of them used known phonetic content for each utterance as the input of forced-alignment. Finally, we analyzed the accuracy of boundary determination as well as the precision of the choice of proper pronunciation variant when the transcription is available at word level and higher pronunciation variability is supposed in realized utterances.

2

Method

As was mentioned above, KALDI toolkit is frequently used for speech recognition by research community and consequently it is under continuous development. Currently, it covers many contemporary advanced techniques used within particular modules of ASR, including advanced techniques of acoustic modeling, mainly DNN-based systems. However, the usage of KALDI for speech segmentation is not so frequent [7]. Mentioned availability of advanced acoustic modeling techniques in KALDI was the main reason for this study describing an analysis how they beneﬁt the precision of phonetic segmentation. 2.1

Phonetic Segmentation

Concerning the boundary determination, we used rather standard approach of forced-alignment. Its implementation using KALDI toolkit allowed us to study techniques using various approaches to acoustic modeling used in typical solutions (“recipes”) available within KALDI distribution. Generally, we selected AM models which were suitable for generating targets for DNN-HMM training. We experimented with often used GMM-HMM models [12], i.e. the basic and simplest AM based on monophones (marked in the following text by acronym mono), speaker-independent triphone AM using basic short-time cepstral features (acronym tri1), speaker-independent triphone model with LDA features (acronym tri2), speaker-dependent triphone AM obtained by fMLLR and speaker-adaptive training (acronym tri3). In the end, the most advanced AM used in this work was DNN-HMM model (acronym dnn) with the topology of neural network consisting of the input layer with 440 units followed by 6 hidden layers with 2048 neurons per layer. The process of building of DNN-HMM

Automatic Phonetic Segmentation and Pronunciation Detection

421

system started with the initialization of hidden layers by Restricted Boltzmann Machines and it was closed by frame cross-entropy training [4]. More advanced AM models based on time-delay neural networks with the lattice-free version of the maximum mutual information or long-short-termmemory networks [11] were not experimented. They help typically for an improvement of WER in speech recognition, but they are not so good for determination of phone boundaries using forced alignment1 . Speech features were computed in accordance to the setup used in KALDI recipes. As basic cepstral features (used in AMs mono and tri1), we used 13 melfrequency cepstral coeﬃcients including zeroth cepstral coeﬃcient, computed for the short-time frame with the length of 25 ms and shifted over the signal with the step of 10 ms. Cepstral-mean normalization was applied to this 13-element vector of static short-time features and they were completed by delta (dynamic) and delta-delta (acceleration) features to the ﬁnal length of 39. LDA features (used in AM tri2) were computed from the context obtained by splicing of 5 shorttime-feature vectors to both sides and followed by LDA and MLLT realizing decorrelation and the reduction of dimension to the length of 40. For AM tri3, it was followed by feature-space maximum likelihood linear regression (fMLLR) per each speaker (also called speaker adaptive training, SAT). Finally, these 40 dimensional fMLLR features with mean and variance normalization extended in both-side context were used as the input in dnn AM. 2.2

Impact of Pronunciation Lexicon

The accuracy of forced-alignment technique used for phonetic segmentation relies on the quality of inputs. Of course, it means the quality of acoustic data, however, it also depends strongly on the accuracy of input phonetic contents. Phonetic content of utterances transcribed usually at orthographical level can be obtained by grapheme-to-phoneme conversion or from a pronunciation lexicon, which can cover also a pronunciation variability [2] by including more pronunciation variants. This approach must be deﬁnitely used when phone boundaries should be determined for spontaneous and informal speech, higher diversity of language dialects, as well as in other situations when the level of pronunciation variability is rather high [18]. It can be obtained manually (for some very speciﬁc situations) or automatically (to extend regular pronunciations by particular phone substitutions or reductions on the basis of deﬁned rules [9,10]). In presented work, we analyzed the accuracy of phone boundaries determination in the case when the lexicon contained more pronunciation variants. For this purpose, we have created the lexicon containing all pronunciations which had appeared within phonetic transcription of TIMIT corpus (called further as timit-variants). It was obtained from available transcription at the word and phone level, i.e. as the new word pronunciation we took the sequence of all phones which lied within the word boundaries. Finally, the signiﬁcant majority 1

Discussed in KALDI community at https://groups.google.com/forum/#!topic/ kaldi-help/cSAm5iXGhZo.

422

P. Mizera and P. Pollak

of words from TIMIT had more than one pronunciation, so we could analyze also the ability of used AM to recognize the correct pronunciation variant for particular word realizations. In total, we obtained 19184 pronunciations for 6256 words, moreover, in some cases the number of pronunciation variants was very high (22 words have more than 20 pronunciations), as it is shown in more details in Table 1. This lexicon should simulate using TIMIT corpus a realistic situation of phonetic segmentation of informal speech when each word can have more pronunciations due to pronunciation variability in informal speaking style. Table 1. Lexicon timit-variants - statistics. No. of pronunciation variants 1 No. of words

2

3–5

6–10 11–20 21–50 65

631 3372 1516 637

78

21

1

When pronunciation lexica contain such a high number of pronunciation variants (20 and more), correct detection of the proper pronunciation variant is very important task and phonetic segmentation in this setup can also serve to detect proper pronunciation variants within an analyzed utterance. It can then play the important role in the research focused on pronunciation variability and it was also analyzed in this work.

3

Experiments

The experimental part of this research was focused on the analysis of phonetic segmentation accuracy from the following three aspects: the optimum choice of proper acoustic model, the impact of extended pronunciation lexicon, and ﬁnally, the accuracy of pronunciation variant detection when more variants are available in the lexicon. 3.1

Used Tools and Speech Databases

All experiments were realized on the basis of TIMIT corpus [3], used often as a standard for the evaluation of phoneme classiﬁcation, phoneme recognition, or phonetic segmentation for English. As it was mentioned above, designed acoustic model systems were built using KALDI toolkit. Table 2. TIMIT data sets used in presented evaluations. Data set

Speakers Sentences Hours Num. words Num. boundaries

TRAIN

462

3696

3.14

24

192

0.16

1570

7215

COMPLETE test set 168

1344

0.81

11025

50754

CORE test set

30132

-

Automatic Phonetic Segmentation and Pronunciation Detection

423

We started with a standard s5 recipe for TIMIT available in KALDI distribution and we optimized it with regard to improve the accuracy of automatic phonetic segmentation task. The published recipe has been designed mainly for phoneme recognition task and it works with reference train and CORE test sets. For the phonetic segmentation task, we generated TIMIT COMPLETE test set with 168 speakers and 1344 test sentences. The phonetically-compact sentences (marked as SX sentences) and phonetically-diverse ones (marked as SI sentences) were only used for our experiments. TIMIT phoneme set was reduced from 61 to 48 ﬁnal phonemes, which were used for acoustic modeling. The reduction to 39 phones was used ﬁnally for boundaries scoring as it is used standardly for English in KALDI recipes as well as by many other authors in ASR systems [6]. HMM topology consisted from 3 emitting states models for non-silence phonemes and 5 emitting states models for silence and direct phoneme transcription, which included also silence marks, was used for training AMs. Therefore silence appeared in training graphs and silence boundaries were scored, the optional silence was not used for our experiments. Finally, we used 50754 boundaries from COMPLETE test set and 7215 boundaries from CORE test set for our evaluations. The summary of used data sets is presented in Table 2. 3.2

Evaluation Criteria

The evaluation of phonetic segmentation accuracy was done using the criteria describing both the accuracy at the level of phone recognition correctness as well as the accuracy of phone boundary placement (as it was similarly used by other authors, e.g. [5,7]). First, the phone recognition correctness is evaluated standardly using Phone Error Rate computed on the basis of Levenshtein distance as S+D+I · 100 (1) N where N is the number of phones in the reference and S, D, and I are the numbers of substitutions, deletions, and insertions in aligned data. It is also suitable to evaluate Phone Correctness computed as P ER =

N −S−D · 100 (2) N because the evaluation of the accuracy of particular boundary placement makes sense just for correctly recognized phones. For further evaluations, all deleted phones are removed from the reference transcript, inserted phones from aligned transcript, and substituted phones are removed from both of them. The cleared transcripts are then used for the evaluation of boundary placement accuracy. When we have two pairs of reference and transcribed boundaries for each phone realization, i.e. begph,ref [i] and endph,ref [i] vs. begph [i] and endph [i], the following two criteria Phone Beginning Error (PBE) and Phone End Error (PEE) can be deﬁned as P Corr =

PBEph [i] = | begph [i] − begph,ref [i] | ,

(3)

424

P. Mizera and P. Pollak

PEEph [i] = | endph [i] − endph,ref [i] | .

(4)

The accuracy of phone boundary can be approximated using the rate of phone boundary error which is below the chosen threshold which can be deﬁned as Nph (PBEph [i] < thr) PBEph,thr = i=1 (5) Nph where ph is phone/class identiﬁcation, Nph is the number of phone/class realizations, and thr is the value of chosen error threshold. Similarly, same procedure is applied for the computation of PEEph,thr . Threshold values used for realized evaluations within this work were 5, 10, 20, or 30 ms respectively. All of these criteria can be computed with basic statistics for all particular phones, however, more often is the usage of their evaluation over deﬁned phone classes, which are generally language independent. We used phone classes for English according to [5], i.e. VOW - vowels, GLI - semivowels and glides, VFR voiced fricatives, UFR - unvoiced fricatives, NAS - nasals, STP - stops, UST unvoiced stops, and SIL - silence. Finally, we deﬁne PronER (Pronunciation Error rate) to evaluate pronunciation detection accuracy S · 100 (6) P ronER = N where N is the total number of words in the reference set and S is the numbers of incorrectly recognized (substituted) pronunciation variants. 3.3

Results

3.3.1 Direct Phonetic Segmentation As the TIMIT database contains transcriptions at the phone level, it enabled us to evaluate ﬁrstly the accuracy of phonetic segmentation with maximally precise inputs of HMM-based forced alignment. In fact, it means the optimum input of forced-alignment with 100% correct phonetic content when no phone needs to be recognized and PER is equal to 0 %. Obtained results are in the Table 3. Similarly, as in several other works (e.g. [7] or [16]), the best results were obtained for the simplest monophone AM, for both the core and complete test sets. Slightly lower accuracy of triphone- and DNN-based AMs might be caused due to the fact that input features are taken from larger context, which yields to higher uncertainty in determination of a boundary position. Furthermore, speaker dependent AMs are probably estimated with smaller accuracy due to the limited amount of data per speaker in TIMIT corpus. Concerning the monophone AM, we looked for its optimized setup. Same as in other published works‘ [7], it was conﬁrmed that smaller amount of Gaussian mixtures per state gave better results. The best ones were achieved for 2 mixtures per state, see Table 4. The number in acronyms mono144, mono288, etc. in Tables 3 and 4 represents the number of Gaussian components in whole HMM, e.g. 288 means 288 components for 2 mixtures per state, 3 emiting states per each monophone, and 48 phones in given HMM (2 × 3 × 48).

Automatic Phonetic Segmentation and Pronunciation Detection

425

Fig. 1. Phone Beginning Error PBE for particular phone classes: blue - monophone system, red - DNN-based system. (Color ﬁgure online) Table 3. Results of direct phonetic segmentation, P ER = 0, P Corr = 100. CORE SET COMPLETE SET 5 ms 10 ms 20 ms 30 ms 5 ms 10 ms 20 ms 30 ms mono 29.16 52.79 83.08 93.00 29.00 52.71 82.79 92.63 tri1

27.80

51.21

81.69

92.82

27.84

50.89

81.40

92.12

tri2

27.40

49.55

79.72

91.45

27.10

48.96

79.27

90.91

tri3

27.42

49.34

79.18

91.24

27.18

48.74

78.41

90.36

dnn

27.73

48.87

78.84

90.77

27.11

48.49

78.32

90.09

Table 4. Optimization of monophone AM for direct phonetic segmentation (P ER = 0, P Corr = 100). CORE SET COMPLETE SET 5 ms 10 ms 20 ms 30 ms 5 ms 10 ms 20 ms 30 ms mono144

31.05

54.57

82.51

mono288 31.68 55.80 84.70

92.17

31.37

54.67

81.90

91.73

93.79 32.02 56.39 84.55 93.11

mono432

30.45

54.73

84.74 93.74

31.03

55.32

84.46

93.06

mono720

29.76

53.50

83.53

93.35

29.95

53.70

83.48

92.99

mono1008 29.16

52.79

83.08

93.00

29.00

52.71

82.79

92.63

mono1440 28.18

51.50

81.80

92.82

28.13

51.31

81.80

92.30

Finally, the distribution of values of PBE for particular phone classes is presented in Fig. 1. Particular bars describe distribution of PBE determined by percentiles 0.25 and 0.75 and signiﬁcantly worse results are observed for DNN system, however, signiﬁcant increase of an error can be observed mainly for silence while deterioration within phone classes is not so critical.

426

P. Mizera and P. Pollak

3.3.2 Phonetic Segmentation with Pronunciation Variability The second analysis describes the phonetic segmentation when exact phone sequence is not available and phonetic content is obtained from a pronunciation lexicon. It is the most frequent approach for obtaining phonetic content of an utterance, however, the core issue is how well the variability of pronunciation is covered in the lexicon and how the proper choice of word pronunciation variant inﬂuences the accuracy of phonetic segmentation. Table 5. Phonetic segmentation with canonic lexicon. PER PCorr 5 ms

10 ms 20 ms 30 ms

CORE

mono 32.58 71.43 dnn 31.88 71.45 mono288-dnn 31.88 71.45

23.94 43.54 72.39 85.82 23.67 40.14 65.28 80.60 25.78 45.37 72.01 84.54

COMPLETE

mono 31.15 72.28 dnn 30.52 72.28 mono288-dnn 30.52 72.28

23.92 43.23 72.34 85.78 23.43 40.32 65.83 80.59 26.39 46.38 72.71 84.93

Table 6. Phonetic segmentation with TIMIT-variant lexicon. PER

PCorr 5 ms

10 ms 20 ms 30 ms

CORE

mono 12.24 89.69 dnn 9.58 92.03 mono288-dnn 9.58 92.03

28.77 51.61 82.26 92.58 27.64 48.55 78.03 90.05 31.28 54.94 83.81 93.09

COMPLETE

mono 12.06 89.62 dnn 10.00 92.06 mono288-dnn 10.00 92.06

28.82 52.11 82.25 92.30 27.16 48.28 77.73 89.55 31.91 55.98 84.17 92.93

We realized the experiments with 3 pronunciation lexica: the ﬁrst lexicon contained just canonic pronunciations, the second one contained all pronunciation variants realized by speakers in TIMIT corpus, and the third one was based on merging previous two lexica. Obtained results are shown in Tables 5, 6 and 7 and signiﬁcant decrease of PER was observed when lexicon contained pronunciation variants. Further, the usage of more advanced AM (DNN-based one) contributed to further decrease of achieved PER below 10%. Consequently, it means the increase of PCorr, i.e. more than 92% of all phones were correctly identiﬁed, however, the accuracy of boundary determination slightly decreased when DNN-based system was used. On the other hand, when the recognized phone sequence is realigned with optimized monophone system with 288 Gaussian components (acronym mono288-dnn), both the best PER and boundary placement accuracy were achieved [15].

Automatic Phonetic Segmentation and Pronunciation Detection

427

Table 7. Phonetic segmentation with canonic lexicon extended by TIMIT variants. PER CORE

PCorr 5 ms

10 ms 20 ms 30 ms

mono 12.43 89.48 dnn 9.76 91.88 mono288-dnn 9.76 91.88

28.79 51.69 82.25 92.64 27.65 48.51 77.99 90.00 31.33 55.00 83.84 93.12

COMPLETE mono 12.40 89.28 dnn 9.28 92.17 mono288-dnn 9.28 92.17

28.83 52.08 82.25 92.31 27.12 48.22 77.63 89.44 31.92 55.97 84.14 92.93

3.3.3 Pronunciation Recognition In the end, we analyzed the correctness of pronunciation variant selection mentioned above. In fact, it was already quantiﬁed a little by the decrease of PER described in previous section, however, for many words we had a rather high amount of pronunciation variants and the ability of the selection of correct pronunciation variant could be very important feature of such a system. From the results described in Table 8, we can observe signiﬁcant decrease of PronER (Pronunciation Error rate) when more advanced acoustic modeling and the lexicon covering pronunciation variants are used. The best results were obtained with DNN-based system, we observed signiﬁcant decrease of PronER; 76.34% were obtained for basic monophone system and CORE test set, while 31.89% were achieved for DNN-based system. The contribution of GMM-HMM systems with triphone-based models was proven too. The same trend in obtained results was observed also for COMPLETE set. Table 8. Pronunciation variant recognition. canonic timit canonic+variants PER PronER PER PronER PER PronER CORE

COMPLETE

mono tri1 tri2 tri3 dnn

32.58 32.80 32.55 32.46 31.88

76.34 76.28 76.28 76.28 76.34

12.24 11.49 11.16 10.24 9.58

39.48 37.82 35.97 33.48 31.44

12.43 11.74 11.31 10.42 9.76

40.18 38.46 36.54 34.06 31.89

mono tri1 tri2 tri3 dnn

31.15 31.79 31.45 31.30 30.52

74.22 74.21 74.22 74.21 74.22

12.06 11.89 11.17 10.75 10.00

40.39 37.06 35.77 33.82 31.46

12.40 11.45 11.00 10.23 9.28

41.44 37.87 36.60 34.56 32.19

428

4

P. Mizera and P. Pollak

Conclusions

The implementation of HMM-based phonetic segmentation realized by KALDI toolkit was described in this paper commonly with the analysis of an contribution of various acoustic modeling to ﬁnal accuracy of phone-boundaries determination. The evaluations were performed with TIMIT database and they proved the contribution of advanced acoustic modeling for the choice of proper pronunciation variant. We achieved more than 92% correctness of phone recognition within forced-alignment with DNN-HMM system together with the improvement of phone boundary placement realized in the second step by optimized monophone GMM-based systems; 83.84% of phone beginning boundaries were determined with the error smaller than 20 ms, for the error smaller than 30 ms it was 93.12%. These results were obtained without any further boundary correction, as it is not currently required by our application as well as it is related to results obtained without any boundary reﬁnement and published by other authors. For the purpose of pronunciation variability modeling, the lexicon covering pronunciation variants of particular words among TIMIT speakers was created on the basis of phonetic transcriptions available in this corpus. Acknowledgments. The research described in this paper was supported by internal CTU grant SGS17/183/OHK3/3T/13 “Special Applications of Signal Processing”.

References 1. CMUSphinx: Open source speech recognition toolkit. http://cmusphinx.github.io 2. Brunet, R.G., Murthy, H.A.: Pronunciation variation across diﬀerent dialects for English: a syllable-centric approach. In: 2012 National Conference on Communications (NCC) (2012) 3. Garofolo, J.S., et al.: TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web download. Linguistic Data Consortium, Philadelphia (1993) 4. Ghoshal, A., Povey, D.: Sequence-discriminative training of deep neural networks. In: Proceedings of the INTERSPEECH, Lyon, France (2013) 5. Kahn, A., Steiner, I.: Qualitative evaluation and error analysis of phonetic segmentation. In: 28. Konferenz Elektronische Sprachsignalverarbeitung, Saarbr¨ ucken, Germany, pp. 138–144 (2017) 6. Lee, K.F., Hon, H.W.: Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Audio Speech Lang. Process. 37(11), 1641–1648 (1989) 7. Matouˇsek, J., Kl´ıma, M.: Automatic phonetic segmentation using the KALDI toolkit. In: Ekˇstein, K., Matouˇsek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 138–146. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-642062 16 8. Matouˇsek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matouˇsek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10. 1007/978-3-540-39398-6 41

Automatic Phonetic Segmentation and Pronunciation Detection

429

9. Mizera, P., Pollak, P., Kolman, A., Ernestus, M.: Impact of irregular pronunciation on phonetic segmentation of Nijmegen corpus of casual Czech. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 499–506. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10816-2 60 10. Nouza, J., Silovsk´ y, J.: Adapting lexical and language models for transcription of highly spontaneous spoken Czech. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 377–384. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8 48 11. Peddinti, V., Wang, Y., Povey, D., Khudanpur, S.: Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Process. Lett. 25(3), 373–377 (2018) 12. Povey, D., et al.: The Kaldi speech recognition toolkit. In: Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, ASRU 2011 (2011) 13. Rendel, A., Sorin, A., Hoory, R., Breen, A.: Toward automatic phonetic segmentation for TTS. In: Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, pp. 4533–4536 (2012) 14. Rybach, D., et al.: The RWTH Aachen university open source speech recognition system. In: Proceedings of Interspeech 2009 (2009) 15. Stolcke, A., Ryant, N., Mitra, V., Yuan, J., Wang, W., Liberman, M.: Highly accurate phonetic segmentation using boundary correction models and system fusion. In: Proceedings of ICASSP, Florence, Italy (2014) 16. Toledano, D.T., G´ omez, L.A.H., Grande, L.V.: Automatic phoneme segmentation. IEEE Trans. Speech Audio Process. 11(6), 617–625 (2003) 17. Young, S., et al.: The HTK Book, Version 3.4.1. Cambridge (2009) 18. Yuan, J., Ryant, N., Liberman, M., Stolcke, A., Mitra, V., Wang, W.: Automatic phonetic segmentation using boundary models. In: Proceedings of INTERSPEECH, Lyon, France, pp. 2306–2310 (2013)

Improving Neural Models of Language with Input-Output Tensor Contexts Eduardo Mizraji1(&), Andrés Pomi1, and Juan Lin1,2 1

Group of Cognitive Systems Modeling, Biophysics Section, Facultad de Ciencias, Universidad de la República, Iguá 4225, 11400 Montevideo, Uruguay [email protected], [email protected], [email protected] 2 Department of Physics, Washington College, Chestertown, MD 21620, USA

Abstract. Tensor contexts enlarge the performances and computational powers of many neural models of language by generating a double ﬁltering of incoming data. Applied to the linguistic domain, its implementation enables a very efﬁcient disambiguation of polysemous and homonymous words. For the neurocomputational modeling of language, the simultaneous tensor contextualization of inputs and outputs inserts into the models strategic passwords that rout words towards key natural targets, thus allowing for the creation of meaningful phrases. In this work, we present the formal properties of these models and describe possible ways to use contexts to represent plausible neural organizations of sequences of words. We include an illustration of how these contexts generate topographic or thematic organization of data. Finally, we show that double contextualization opens promising ways to explore the neural coding of episodes, one of the most challenging problems of neural computation. Keywords: Matrix memories Tensor contexts Semantic spaces Episodic memory

Word strings

Gradually, it saw itself (like us) imprisoned in this sonorous web of Before, After, Yesterday, While, Now, Right, Left, Me, You, Those, Others. From “The Golem” by J.L. Borges

1 Introduction The procedures developed by the human brain to organize sequences of semantic elements that create meaningful phrases are yet an unsolved problem. Such a sequence can be metaphorically congruent to the search for the exit of an intricate labyrinth, with myriad galleries connecting thousands of semantic modules. In this labyrinth, the output of a module is speciﬁcally guided toward its next module, a process that generates a completely non-random sequence of words. This controlled guidance can be due to the existence of speciﬁc “keys” that select and open the next appropriate semantic target. © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 430–440, 2018. https://doi.org/10.1007/978-3-319-99579-3_45

Improving Neural Models of Language with Input-Output Tensor Contexts

431

Taking into account the extremely large number of possibilities offered by the semantic network, the possibility of building rapid meaningful phrases in natural language strongly suggests that these output keys explore all their potential targets in parallel. An interesting approach would be to consider the creation of a meaningful phrase as analogous to the production of a sequence of motor acts oriented toward a goal [1– 5]. This analogy would assume that before the construction of a phrase there exists an objective that induces a layout over which the words are organized. In this case the goal is a communicational task, and a complete discourse can be structured by a set of subtargets that organize their parts. In this work we shall try to model the emergence of different kinds of language organization, by representing semantic modules with matrix associative memories. The many remarkable properties of these matrix memories are described in [6–10]. As “mesoscopic models” they connect algorithms operating on complex symbolic data to the neuro-dynamic level [11]. In this formalism, to ﬁnd a path in the labyrinth of semantic modules would mean that outputs of matrix associative memories become inputs of particular memories that produce the words in the general layout of the phrase that is being created. Our contribution aims to ﬁll this framework by showing that the modulation of inputs and outputs of matrix memories by tensor contexts provides a procedure to explain how coherent sequences of words can be created. In addition, this formalism implies the possibility of building thematic clusters in semantic spaces.

2 Basic Models In what follows we describe some properties of matrix associative memories and how tensor contexts enlarge their computational abilities. 2.1

Matrix Associative Memories

A matrix memory associates an m-dimensional column input vector f i to an ndimensional output vector gi . Kohonen [10] shows that a memory can be characterized by the set Mem ¼ ðg1 ; f 1 Þ; ðg2 ; f 2 Þ; . . .; gQ ; f Q : ð1Þ This “learning set” represents the data to be stored in a matrix memory M. To ﬁnd the appropriate structure of this matrix, deﬁne two partitioned matrices G ¼ g1 g2 gQ ; F ¼ ½f 1 f 2 f Q ; and represent the associations between the Q pairs of vector patterns by the matrix equation G ¼ MF. Let In ¼ f1; 2; . . .; Qg be the set of indexes of stored pairs. Under this condition, the best solution in the sense of least squares, is in terms of the pseudoinverse F þ : M ¼ GF þ :

ð2Þ

432

E. Mizraji et al.

In the extremely simple case of an orthonormal set of inputs ff i g; i ¼ 1 to Q, Eq. (2) admits the closed expression: M¼

Q X i¼1

gi f Ti :

ð3Þ

For this matrix memory the recall operates as follows: Mf k ¼

Q X

gi hf i ; f k i;

ð4Þ

i¼1

with the scalar product being hf i ; f k i ¼ dik (dik is the Kronecker’s delta); hence if the index k 2 In, the recall is perfect, Mf k ¼ gk . 2.2

Input Tensor Contexts

Imagine we need to model a neural network capable to disambiguate homonymic or polysemic words. Networks with hidden layers trained with backpropagation, are the classical devices to deal with this kind of problem [12]. However, in such approach we generally lose the possibility of a transparent mathematical theory allowing to predict what is happening during training as well as the ﬁnal network structure. This opacity was the main motivation to develop a “transparent connectionist” alternative [13]. This alternative uses a kind of vector symbolic architecture based on tensor contextualization [11, 14, 15]. Let f i be one homonymic word, associated with two vectors gi1 and gi2 for two completely non-correlated concepts. For instance, the input can represent the word “bank” and one output would be “money” and the other would be “sand”. To retain the matrix format of the associative memory, we integrate the input with two vector contexts pi1 ; pi2 2 Rh using the Kronecker product , a tensor procedure adapted to the operations of matrix algebra [16]. In our example, we could consider that the ﬁrst context concerns ﬁnances and the second geography. The segment of a memory in our example can be expressed as: Mi ¼ gi1 ðpi1 f i ÞT þ gi2 ðpi2 f i ÞT :

ð5Þ

Consequently, when the memory receives an input and the corresponding context, the selection of the output happens via two scalar products: Mi ðp12 f i Þ ¼ gi1 hpi1 ; pi2 ihf i ; f i i þ gi2 hpi2 ; pi2 ihf i ; f i i:

ð6Þ

In a situation where both, the inputs and the contexts are orthonormal, we have a resolution of ambiguity, Mi ðp12 f i Þ ¼ gi2 :

ð7Þ

Improving Neural Models of Language with Input-Output Tensor Contexts

433

This format can be generalized [14, 17, 18] to a global memory module composed of a variety of specialized sub-modules, each having the required complexity for the contextualization of its inputs: X M¼ Mi : ð8Þ i

2.3

Input-Output Contexts

We can extend the previous approach by modulating both, inputs and outputs with vector contexts. This approach leads to memory matrices with the following general structure: H¼

X

p0ik gij

T pik f ij :

ð9Þ

i;j;k

From the properties of Kronecker products, the H matrix admits some interesting alternative representations. We illustrate two of them: H¼

X

p0ik pTik gij f Tij ;

ð10Þ

i;j;k

H¼

T X p0ik Idimðgij Þ gij f Tij pik Idimðf ij Þ :

ð11Þ

i;j;k

Note that inputs puv f ab with stored patterns, display outputs given by Hðpuv f ab Þ ¼ p0uv gab :

ð12Þ

These outputs are prepared to enter as inputs to a similar memory H’ with this particular pair [context - pattern] stored in its database. Memories with this structure accept many representational and computational potentialities to process the operations displayed by natural languages [19, 20]. In the next Sections we shall describe some of these operations.

3 Deterministic Semantic Strings In his “Principles of Psychology” (Vol. II, Chap. XXVI) James [21] writes that voluntary acts are based on consolidated memory traces created by previous involuntary acts. Similarly, the voluntary creation of phrases has as prerequisite the existence of word associations in previously ﬁxed memories–developed after experiential contact with word usage. As we mentioned before language production could be seen as the generation of meaningful phrases, and may be similar to the assembly of a sequence of motor actions

434

E. Mizraji et al.

aimed at reaching a goal [3, 4, 22]. The purpose of spoken or written phrases is to transmit information by means of expressions that can be understood. Neural modeling challenges us to reach this goal by triggering an appropriate chain of meaningful words. Let us suppose that a phrase could be represented by a string: Fða; nÞ ¼ haa1 ; aa2 ; . . .; aan i; aai 2 Semfa1 ; a2 ; . . .; ax g;

ð13Þ

with Sem being the very large set of words in a normal lexicon. The phrase can repeat words, and consequently it is possible to have aai ¼ aaj . Now, how do we insure that aa1 precedes aa2 ? Moreover, how does the meaning of the phrase guide the correct order of successive words while information is transmitted? A possible answer to the ﬁrst question would be to assume that the transition probabilities between words are responsible for the correct sequence, with a given word followed by its most probable successor. Within this framework, language production is mainly represented by a stochastic process with transition probabilities dependent on segments of previously used words. [23–26]. The second question seems to imply the existence of an anticipatory layout for the phrase. Here, we explore the following proposal. Imagine a small string of three words haa1 ; aa2 ; aa3 i representing a miniature phrase. Let us immerse these elements in contexts, generating a new string D E ptarg aa1 p1 ; p1 aa2 p2 ; p2 aa3 pend ;

ð14Þ

The neural vector ptarg is both, the context that triggers the sequence and concurrently, the target code. Contexts p1 and p2 are keys indicating the correct next element of the string, and context pend marks the end of the phrase. In this way, a good sequence of words is selected by the contextual string D

E ptarg ; p1 ; p2 ; pend :

ð15Þ

A recursive tensor input-output memory with the structure S ¼ ðpend aa3 Þðp2 aa2 ÞT þ ðp2 aa2 Þðp1 aa1 ÞT þ ðp1 aa1 ÞpTtarg

ð16Þ

can accomplish the procedure just described. In the general case, the ﬁnal output can be a “pure” string of words, haa1 ; aa2 ; . . .; aan i. The contexts used in an internal, hidden computation, are channeled by a ﬁlter Way Out Matrix (WOM) having the structure WOM ¼

X k

! pTk

IdimðaÞ :

ð17Þ

Improving Neural Models of Language with Input-Output Tensor Contexts

435

Fig. 1. This diagram illustrates how a context target enters a recursive semantic network S triggering a sequence of contextualized outputs. These outputs are ﬁltered by a WOM matrix that extracts the contexts and produces a pure word string.

The sum includes all the relevant contexts, and IdimðaÞ is an identity matrix with the same dimension as word vectors. Note that WOM pc ah ¼ ah :

ð18Þ

In Fig. 1 we illustrate this recursive model for a string of arbitrary length. The neurobiology of lexical strings production is far from being understood. We can consider the voluntary construction of utterances by our model in light of William James’ thought. Our model requires the previous existence of permanent memories of words and contextual markers, and a transitory working memory to install the appropriate string format. Finally, we mention that the target ‘feeds and builds’ contexts to generate meaningful strings in the same way that the target of a mechanical movement of our arm guides the intermediate steps needed to reach it.

4 Clustering by Contexts The memory H given in Eq. (10), with sets of different input-output associations sharing the same pair of input-output contexts can be factorized into clusters of associations induced by the contexts, H¼

X i

" p0i

pTi

X

# gij f Tij

:

ð19Þ

j

This partition suggests how scattered data may be organized in large neural networks. Contexts may create a topical coherence in a recall. Let us mention that an interesting formal parallelism between matrix memories and the Latent Semantic Analysis (LSA) has been described in [19]. In this direction, the structure of matrices (10) and (19) suggests the possibility of looking for the thematic clustering of textdocument matrices using, instead of a classical LSA based on SVD, a procedure that labels topics via the search of Kronecker factors.

436

E. Mizraji et al.

If we use as contexts unit vectors es (vectors with a 1 in position s and 0’s otherwise), the matrix H can be expressed as: X H¼ e0i eTi Mij ; ð20Þ i

with Mij ¼

X

gij f Tij

ð21Þ

being a classical Anderson-Kohonen associative memory matrix. By an adequate selection of dimensions for the context unit vectors, it is possible to generate a topographic pattern with different associative memories M placed as tiles into the “host” matrix H (Pomi, Mizraji and Lin, paper submitted). We illustrate this point with a simple example. Given the two unit column vectors e1 ¼ ½1 0T ; e2 ¼ ½0 1T and four associative memory matrices, MðmÞ 2 Rpq ; m ¼ 1; . . .; 4 H takes the form H ¼ e1 eT1 Mð1Þ þ e1 eT2 Mð2Þ þ e2 eT1 Mð3Þ þ e2 eT2 Mð4Þ :

ð22Þ

After computing the Kronecker products we ﬁnd

Mð1Þ H¼ Mð3Þ

Mð2Þ ; H 2 R2p2q : Mð4Þ

ð23Þ

Thus, the contexts create a computational layer composed by various memory modules located in speciﬁc topographies, each one able to receive and redirect information selectively channeled by the contexts. Kohonen [27] developed one of the most important and deep procedures to model the generation of topographic neural patterns. The approach we are describing here assumes cognitive supervised learning. One could imagine associative memories to be the result of active interactions between a trainable brain and an external instructor–an active human teacher or environmental experiences. Hence, emergent clusters of associative memories may explain how, after extensive vocabulary learning, complex semantic webs can be established. We want to mention that the results of Huth et al. [28] experimentally illustrate the existence of a remarkable topographic organization in the semantic web of the human brain.

Improving Neural Models of Language with Input-Output Tensor Contexts

437

5 Episodes Since the foundational characterization of episodic memories by Tulving (updated in [29]), the search for their neural bases became an important research objective [30–34]. Adapting ideas of these investigators, we shall assume that episodic memories result from the interaction of different classes of memories, fundamentally, a semantic memory and a context memory that stores episode markers. We illustrate the interaction between these memory modules in Fig. 2.

Fig. 2. This scheme adapts to our model one of the conceptions about episode storage and retrieval. LH: Left hemisphere, RH: Right hemisphere, SM: Semantic Memory, CM: Contexts Memory.

We are going to assume that the encoding happens mainly in a region capable of sustaining a semantic memory (e.g.: the left prefrontal cortex) and the recall involves a region that stores contextual markers (e.g.: the right prefrontal cortex). The model we want to comment is formally similar to the model that generates semantic strings. However, there is a crucial difference: in episodes we do not necessarily have a target. A contingent series of events is stored in the memory due to a variety of causes, among others, emotional impact, autobiographical importance, bizarre consequences, etc. In these episodic sequences, contexts provide a kind of positional information–an expression of the embryologist Lewis Wolpert–that places words in the precise positions needed to recreate the episode. Let us deﬁne an episode by a time sequence of contexts that intermingle with words selected from the semantic memory. The sequence of contexts can be generated by a cyclic memory structured as: C ¼ pout pTn þ pn pTn1 þ þ p1 pTin :

ð24Þ

Context vector pin marks the beginning of the sequence, and context pout marks the end. Within a recursive network, the reinjection of successive outputs of memory C creates the time pattern hpout ; pn ; . . .; pi ; pin i:

ð25Þ

438

E. Mizraji et al.

Intermingling these contexts with words ai extracted from the semantic memory, builds the episodic sequence hðpout an pn Þ; ðpn an1 pn1 Þ; . . .; ðp3 a2 p2 Þ ; ðp2 a1 pin Þi:

ð26Þ

We are going to model this situation by assuming that intermingling occurs because the semantic memory is structured with associative memories that can be approximated by matrices like E¼

X

T p0ik aij pik aij ;

ð27Þ

i;j;k

with the particularity that context markers are very sparse vectors (e.g.: unit vectors). The total set of stored episodes can be based on a semantic basis of N words, N being very large. A given memory cannot store all this variety due to dimensional limitations. But memories like (25) can surpass the dimensional limitations imposed by neuroanatomy and enlarge the variety of episodes via a multi-modular semantic organization. The ﬁnal step of the episodic recall can be a pure verbal string emerging from a WOM ﬁlter. We end this Section by mentioning that there is a close relationship between remembered episodes, and episodes created by the imagination. A ﬁctional story does not travel to the autobiographical past, but creates episodes that we can recall even if such episodes are placed in the far past or future. This shows an interesting point concerning the possible coincidence between the neural systems responsible for the recall of personal biographical episodes and the imaginary generation of ﬁctional facts (see [35, 36] for extensive references about this point), including the conception of innovative literary, philosophical, scientiﬁc, or technological scenarios.

6 Perspectives In this work we have assumed that a semantic unit, integrated with many contexts, could participate in a large variety of different linguistic tasks. The described models are written in terms of matrix algebra and Kronecker tensor products, which makes them operationally transparent and easily amenable to computer implementation, even though the dimensions involved in these linguistic tasks can be extremely large. In any case, the highly flexible production of organized, non-random sequences of words in a natural language is a marvelous and yet obscure process. The topical organization of a biological semantic web, with patches including elaborate pieces of language could plausibly be a basis for the hierarchical elaboration of complex thoughts. These thoughts are translated into linguistic codes and communicated. In a way, “deep learning” technological procedures involving a system of hierarchical computing levels, are already implemented by the human brain. We need to understand these codes, which in many cases, can be accompanied by linguistic productions. A simpliﬁed example of this kind of hierarchical processing is given in [20]. Finally, the recreation, or invention of episodes represents one of the most signiﬁcant signatures of

Improving Neural Models of Language with Input-Output Tensor Contexts

439

the human mind and is placed, by researchers like Tulving [29], at the highest levels of cognition. With tensor input-output contexts we have been able to formulate an elementary approach to the modeling of these open and crucial problems. Acknowledgments. AP and EM acknowledge partial ﬁnancial support by PEDECIBA and CSIC-UdelaR.

References 1. Luria, A.R.: The Working Brain. Basic Books, New York City (1973) 2. Kimura, D.: Neuromotor mechanisms in the evolution of human communication. In: Steklis, H.D., Raleigh, M.J. (eds.) Neurobiology of Social Communication in Primates, pp. 197–219. Academic Press, New York (1979) 3. Calvin, W.H.: A stone’s throw and its launch window: timing precision and its implications for language and hominid brains. J. Theor. Biol. 104, 121–135 (1983) 4. Calvin, W.H.: The unitary hypothesis: a common neural circuitry for novel manipulations, language, plan-ahead, and throwing? In: Gibson, K.R., Ingold, T. (eds.) Tools, Language, and Cognition in Human Evolution, pp. 230–250. Cambridge University Press, Cambridge (1993) 5. Ojemann, G.A.: Brain organization for language from the perspective of electrical stimulation mapping. Behav. Brain Sci. 6, 189–206 (1983) 6. Anderson, J.A.: A simple neural network generating an interactive memory. Math. Biosci. 14, 197–220 (1972) 7. Anderson, J.A.: An introduction to neural networks. MIT Press, Cambridge (1995) 8. Cooper, L.N.: A possible organization of animal memory and learning. In: Lundquist, B., Lundquist, S. (eds.) Proceedings of the Nobel Symposium on Collective Properties of Physical Systems, pp. 252–264. Academic Press, New York (1973) 9. Kohonen, T.: Correlation matrix memories. IEEE Trans. Comput. C-21, 353–359 (1972) 10. Kohonen, T.: Associative Memory: A System Theoretical Approach. Springer, Heidelberg (1977). https://doi.org/10.1007/978-3-642-96384-1. Chap. 3 11. Beim Graben, P., Potthast, R.: Inverse problems in dynamic cognitive modeling. Chaos Interdiscip. J. Nonlinear Sci. 19, 015103 (2009) 12. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature 323, 533–536 (1986) 13. Carmantini, G.S., Beim Graben, P., Desroches, M., Rodrigues, S.: A modular architecture for transparent computation in Recurrent Neural Networks. Neural Netw. 85, 85–107 (2017) 14. Mizraji, E.: Context-dependent associations in linear distributed memories. Bull. Math. Biol. 51, 195–205 (1989) 15. Smolensky, P.: Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artif. Intell. 46, 159–216 (1990) 16. Graham, A.: Kronecker Products and Matrix Calculus With Applications. Ellis Horwood, Chichester (1981) 17. Pomi, A., Mizraji, E.: Semantic graphs and associative memories. Phys. Rev. E 70, 066136 (2004) 18. Pomi, A.: A possible neural representation of mathematical group structures. Bull. Math. Biol. 78, 1847–1865 (2016) 19. Mizraji, E., Pomi, A., Valle-Lisboa, J.C.: Dynamic searching in the brain. Cogn. Neurodyn. 3, 401–414 (2009)

440

E. Mizraji et al.

20. Mizraji, E., Lin, J.: Modeling spatial-temporal operations with context-dependent associative memories. Cognit. Neurodyn. 9, 523–534 (2015) 21. James, W.: Principles of Psychology. The Great Books of the Western World, vol. 53. The University of Chicago (1890) 22. Nishitani, N., Schürmann, M., Amunts, K., Har, R.: Broca’s region: from action to language. Physiology 20, 60–69 (2005) 23. Jurafsky, D., Bell, A., Gregory, M., Raymond, W.D.: Probabilistic relations between words: evidence from reduction in lexical production. Typol. Stud. Lang. 45, 229–254 (2001) 24. Jurafsky, D.: Probabilistic modeling in psycholinguistics: linguistic comprehension and production. In: Bod, R., Hay, J., Jannedy, S. (eds.) Probabilistic Linguistics, p. 21. MIT Press, Cambridge (2003). Chap. 3 25. Nowak, M.A., Komarova, N.L., Niyogi, P.: Computational and evolutionary aspects of language. Nature 417, 611–617 (2002) 26. Chater, N., Manning, C.D.: Probabilistic models of language processing and acquisition. Trends Cognit. Sci. 10, 335–344 (2006) 27. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1997). https://doi.org/10.1007/ 978-3-642-97966-8 28. Huth, A.G., Nishimoto, S., Vu, A.T., Gallant, J.L.: A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron 76, 1210–1224 (2012) 29. Tulving, E.: Episodic memory. Annu. Rev. Psychol. 53, 1–25 (2002) 30. Baddeley, A.: Working memory: looking back and looking forward. Nat. Rev. Neurosci. 4, 829–839 (2003) 31. Jonides, J.R., et al.: The mind and brain of short-term memory. Ann. Rev. Psychol. 59, 193– 224 (2008) 32. Repovs, G., Baddeley, A.: The multi-component model of working memory: explorations in experimental cognitive psychology. Neuroscience 139, 5–21 (2006) 33. Eichenbaum, H.: Prefrontal–hippocampal interactions in episodic memory. Nature Rev. Neurosci. 18, 547–558 (2017) 34. Schapiro, A.C., Turk-Browne, N.B., Botvinick, M.M., Norman, K.A.: Complementary learning systems within the hippocampus: a neural network modelling approach to reconciling episodic memory with statistical learning. Philos. Trans. R. Soc. Lond B 372, 20160049 (2017) 35. Schacter, D.L., et al.: The future of memory: remembering, imagining, and the brain. Neuron 76, 677–694 (2012) 36. Schacter, D.L., Benoit, R.G., Szpunar, K.K.: Episodic future thinking: mechanisms and functions. Curr. Opin. Behav. Sci. 17, 41–50 (2017)

Sociolinguistic Variability of Predicate Groups in Colloquial Russian Speech Anﬁsa Naumova(&) Saint Petersburg State University, Universitetskaya nab. 11, St. Petersburg 199034, Russia [email protected]

Abstract. The paper is devoted to the study of linear and structural orders in the syntactic constructions of colloquial Russian speech. The quantitative and structural characteristics of predicate groups in the replicas of oral speech are examined with the aim of revealing their typical structures and further analysis in the sociolinguistic aspect. The paper contains the description of typical structures of predicate groups, presents their quantitative analysis and concerns correlation with speakers’ social characteristics. The study was based on the material of speech corpus ʻOne Day of Speechʼ, which is the largest resource for studying spoken language, being developed at St. Petersburg State University. 11 macro episodes of everyday communication for 10 respondents were analyzed, including 5 men and 5 women, who are representatives of 6 different professional groups. Manual syntactic and automatic morphological annotation of predicate groups was carried out and their analysis was conducted. The data obtained were veriﬁed using statistical methods and mathematically reliable conclusions were found, such as: (1) the size of predicate groups do not depend on the sex of the speaker; (2) the average size of predicate groups in speech of young people is greater than in that of the middle-aged; (3) the size of predicate groups is changed primarily due to the left distance; (4) the size of the most ranked POS-tagged syntactic structures is only 1–2 elements; (5) the number of verbal predicate groups in the female speech is 8% greater than that in the male. Keywords: Spoken Russian language Syntax Predicate groups

Everyday speech Speech corpus

1 Problem Statement The everyday speech of a person is influenced by a variety of factors that may refer not only to linguistics, but also to physiology, sociolinguistics, psycholinguistics, pragmatics, cognitive science, semiotics, and anthropology. Sociolinguistic aspect deﬁnes such sociological indicators as age, gender, profession, level of speech competence [1] for native speakers and level of language proﬁciency for foreigners, relations between speakers and others. All this determines the need for an interdisciplinary approach in the study of everyday spoken speech [2–4] and, in particular, the identiﬁcation of the most signiﬁcant sociological indicators. This problem has become one of the tasks of given © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 441–450, 2018. https://doi.org/10.1007/978-3-319-99579-3_46

442

A. Naumova

research. What factors have the most signiﬁcant effect on the way we speak? Does our speech depend only on ourselves or is it also determined by such causes that we can not change? How exactly can our social status affect our speech? This article attempts to answer some of these questions.

2 Speech Material As a research material, the corpus of oral texts “One Day of Speech” (the ORD corpus) [5–8] was used. This project is being developed at the philological faculty of St. Petersburg State University in order to analyze everyday speech at various levels in an interdisciplinary aspect. The principle of audio recording follows from the name of the case: the informant ﬁxes all his/her speech communication during the day on the recorder. The material obtained in this way is the most natural in comparison with the record in the laboratory. Also, the participants of the experiment ﬁll out social and psychological questionnaires, which allows entering into the corpus a certain set of metadata, including informants’ psycholinguistic and sociolinguistic characteristics. A lot of linguistic research has already been done on the material of this corpus (e.g., [9–12]). Today the ORD corpus has more than 1,250 h of audio recordings received from 128 informants and more than 1000 of their communicants, representatives of various social groups. The research material consists of 2800 macroepisodes of speech communication [13]. The ORD corpus gives the researcher an opportunity to study spontaneous everyday speech not only from a purely linguistic point of view, but also from the position of psycho- and sociolinguistics. For this study, a sample subcorpus was specially selected from the ORD corpus, in which gender and age social groups were balanced and various professional groups were represented. For this purpose, 11 macroepisodes referring to typical everyday settings were selected. Thus, the material of study consists of only urban speech of Saint-Petersburg’s citizens and each gender, social and professional group include 1–2 informants. All speech material was manually annotated on syntactic level, and 830 predicate groups were identiﬁed. The annotation was manually made by one expert. The guidelines were formed from the rules for coding utterances proposed by P.V. Rebrova [14]. These rules have been modiﬁed to meet the objectives of this study. Thus, a classiﬁcation was obtained for categories for which the following notation: D – discursive words; F – phraseological units; Inf – inﬁnitive; N – negation; Q – question words; S – subject (noun/pronoun); V – verb, conjugated form; Y – agreement; Z – other particles; A – attribute (adjective); B – adverbial modiﬁer; M – addressing; H – negative particle “not”; O1 – direct object; O2 – indirect object; O3 – object with a preposition. • poetomu *S/doma krasnogo netu // (woman, 28 years old, educator) therefore *laugh/there isn’t red (marker) at home // CONJ2 B O1 PRED

Sociolinguistic Variability of Predicate Groups

443

The distribution of predicate groups by gender, age and professions is shown in Table 1. The proposed professional categories may intersect with each other, as one person may be engaged simultaneously in more than one group (e.g., a lecturer in philosophy would be assigned to both humanities and education). Table 1. Distribution of predicate groups by gender, age and occupations. Gender groups Speech of male informants (332 predicate groups) Speech of female informants (498 predicate groups)

Age groups Speech informants up to 30 years inclusive (263 predicate groups) Speech informants from 31 to 45 years (339 predicate groups) Speech informants over 45 years (228 predicate groups)

Professional groups Speech of IT specialists (153 predicate groups) Speech of employees in the education sphere (244 predicate groups) Speech of ofﬁce workers (198 predicate groups) Speech of representatives of creative professions (88 predicate groups) Speech of representatives of the humanities (98 predicate groups) Speech representatives of law enforcement (49 predicate groups)

Many scientists admit that in oral speech there are no unambiguous criteria for the allocation of a sentence. For this reason, researchers propose other categories for the designation of “oral sentences”: statements [15], clauses [16], elementary discursive units [17] and others. In this paper, we decided to investigate oral speech from the point of view of predicate groups. The choice of the predicate group as the unit of study is explained by the fact that in this study we are oriented toward the syntactic aspect of speech. The concept of a predicate group is closest to the clause, however not only the verb, but also other parts of speech used in the role of the predicate can act here as a “sentence core”. So, in this article the predicate group is understood as a predicate and its environment, which is divided into syntactic elements. This can be, in particular, formally independent units (for example, particles or interjections) formally independent of the predicate, which fall into the linear chain of words of the predicate group. A syntactic element is a graphical word, as well as such cases of fusion of graphic words that can not be separated by the insertion of another word, phraseological fusion [18] and composite conjunctions. Various speech disfluency, breaks, ﬁllers, and non-verbal hesitations were not taken into account here as having only an indirect relation to the syntactic structure of utterances.

444

A. Naumova

3 Research Methodology The data were analyzed by standard statistical methods. Each predicate group was manually annotated, after which its size, the left and right distances of each group were measured. With the help of statistical methods, the most frequent syntactic constructions for the sample as a whole and for each social group were identiﬁed. Analysis of the syntactic structure of predicate groups was carried out by means of the morphological parser TreeTagger. All predicate groups were automatically annotated, after which a manual correction of the results was carried out, and the frequency lists were compiled. Thus, 12 ranked lists of POS-tagged syntactic structures of predicate groups in Russian oral speech were created for: (1) all speakers; (2) men; (3) women; (4) informants up to 30 years; (5) informants from 31 to 45 years; (6) informants older than 46 years; (7) representatives of the humanities; (8) IT professionals; (9) employees in the education sphere; (10) ofﬁce workers; (11) representatives of law enforcement and (12) representatives of creative professions. On the basis of the data obtained, a number of conclusions were made regarding possible trends in the speech of particular social groups. These ﬁndings were veriﬁed using statistical methods (such as the Student’s test and the Fisher’s test), which made it possible to identify unreliable conclusions from a mathematical point of view and to identify those that can be considered as statistically signiﬁcant.

4 Quantitative Analysis Using quantitative and statistical analysis, it was found that the average size of the predicate group is 4.28 elements. The average value of the left distance is 2.49 elements, the right – 0.8 elements. Thus, the average predicate group of Russian spontaneous oral speech can be quantitatively represented as 2:x:1 (where 2 is the number of elements before the predicate, x is a predicate and 1 is the number of elements after the predicate). The data obtained make it possible to calculate typical quantitative predicate groups for each social group presented in Table 2. The results obtained were compared within each social category: gender, age and professional sphere of informants. The study showed that the average predicate group size for men is 4.29 elements, the left distance is 2.44 elements, the right distance is 0.85 elements. Similar data for female speech are: 4.28 elements, 2.52 elements and 0.76 elements, respectively. Thus, a signiﬁcant dependence of the quantitative characteristics of predicate groups on the sex of the speaker is not traced. When analyzing differences in the size of predicate groups for speakers from different age groups, it was found that in the middle age group the average predicate group size was the smallest (3.9 elements), and in the younger group – the largest (4.67 elements). At the same time, the difference in the size of predicate groups in all age groups is created exclusively due to the left distance with almost the same right (0.8, 0.81 and 0.76 elements for the younger, older and middle groups, respectively).

Sociolinguistic Variability of Predicate Groups

445

Table 2. Schemes of typical predicate groups for each social group. Social group Men Women Younger age group (up to 30 years) Middle age group (31–45 years) Senior age group (from 46 years) Humanitarians Employees in the education sphere IT-specialists Ofﬁce workers Representatives of law enforcement Representatives of creative professions

Scheme of a typical predicate group 2:x:1 3:x:1 3:x:1 2:x:1 3:x:1 3:x:1 3:x:1 2:x:1 2:x:1 3:x:0 2:x:1

To determine the reliability of this conclusion, the Student’s test was used. When comparing the samples for the younger and middle age groups, the Student’s t-criteria turned out to be in the signiﬁcance zone at its critical level p 0.01, which means that the conclusions made on the basis of a comparison of the two given age groups are reliable. The greatest differences were revealed in the speech of representatives of various professional groups. The maximum average size of the predicate group was expectedly among employees in the education sphere (4.8 elements) and humanitarians (4.7 elements) in comparison with informants from other professions (from 3.69 elements to 4.03). Although the amount of the material does not allow us to speak with certainty about any regularities that the revealed differences may indicate, it is possible at this stage to assume that professional afﬁliation has the greatest influence on the quantitative characteristics of the speaker’s speech units. The rank distribution of the frequency of predicate groups also allowed us to see certain patterns. The most frequent predicate group in male speech consists of 5 elements, the second – from 4, the third – from 3. While female speech has a reverse picture. However, it should be decided, whether it is a trend or an accident, relying on a greater amount of material. Comparison of frequency of predicate groups in the speech of informants of different age groups showed an interesting result. The most frequent predicate group in the younger age group consists of 5 elements, while the most frequent predicate groups in both the middle and the older age groups consist of 3 elements. The rank distribution of the size of predicate groups by frequency is generally similar for informants from 31 years old, which makes it possible to identify young people as tending to use broader predicate groups. Examples show that their size is achieved most often due to discursive words: • to yest’ kak by dazhe yesli ya podnimayu ruku (woman, 20 years old, humanitarian) that is, as it were, even if I raise my hand

446

A. Naumova

• u nikh tseny tam voobshche // (man, 24 years old, ofﬁce worker) they have prices there absolutely (low) // However, Student’s test showed that this output is statistically signiﬁcant only for the pair youngest age group VS older age group. The greatest differences in the size of the predicate groups were found in the speech of different professional groups of speakers. The obtained results, however, should be checked on a larger sample of the material, since in calculating their reliability by the Student’s test, empirical values turn out to be in the zone of signiﬁcance only when comparing those professional groups that are represented by the largest amount of material. Despite the fact that professional groups showed the most signiﬁcant differences, it is difﬁcult to conclude about their speciﬁc features, as the material for some groups is not enough to obtain reliable data. However, based on the data obtained, it is possible to propose a hypothesis that the professional factor affects the quantitative characteristics of speech units more strongly.

5 Structural Analysis Not only quantitative but also qualitative characteristics of predicate groups are of interest when studying everyday syntax. First of all, it seems worthy to analyze the structure of predicate groups, in particular their POS-tagged syntactic structure. In a result of automatic annotation and further manual correction, a set of 830 POS-tagged structures was obtained. Predicate groups fall into two main types: (1) verbal predicate groups, and (2) nonverbal predicate groups. In this study, we consider a predicate group as verbal if it has a verb as a core. The non-verbal predicate groups are those, which have as a core other parts of speech (category of state, compound nominal predicate, etc.). Among the analyzed predicate groups 658 (79%) were verbal and 172 (21%) were non-verbal, that is, their ratio for all speakers is 4:1. The deviations from this ratio in different social groups are of interest. The number of verbal predicate groups in female speech is slightly higher than that of male (82% versus 74%), and the same indicators in different age groups were almost equal (81% in the older group, 79% in the younger and middle). The greatest difference in the number of verbal and non-verbal predicate groups is found among representatives of different professional groups (from 61% verbal predicate groups for representatives of law enforcement to 89% for ofﬁce workers). However, the reliability of the ﬁndings was checked by statistical methods. For this, mathematical calculations such as the Student’s test and the Fisher’s test were used. Checking the conclusion about the difference in the distribution of different types of predicate groups among age groups has shown that it can not be considered reliable. When checking the conclusion about the difference in the distribution of the different types of predicate groups between gender social groups, it turned out that it can be considered reliable: the empirical value of t by the Student’s test for these groups is 3.1 at a critical value of 2.56 (for p 0.01), and the empirical value of u by the Fisher’s test is 2.78 for a critical value of 2.31 (for p 0.01), therefore, both empirical values are in the signiﬁcance zone, since they are higher than the critical values.

Sociolinguistic Variability of Predicate Groups

447

As for age and professional groups, Student and Fisher’s criteria allow us to compare only two samples, so these groups should be compared in pairs. The empirical value of t by the Student’s test is 0.6 and 0.3 for younger-middle and middle-senior pairs respectively for the critical value of 2.58 (for p 0.01), and the empirical value of u by the Fisher’s test is 0.425 and 0.292 for the same pairs respectively for the critical value of 2.31 (for p 0.01), and therefore all empirical values are outside the zone of signiﬁcance. When checking the conclusion about the difference in the distribution of different types of predicate groups among professional social groups, it turned out that it can be considered reliable, but only for the pair representatives of law enforcement VS ofﬁce workers. In addition, a ranked list of 610 found POS-tagged syntactic structures was compiled (its part is shown in Table 3). Table 3. Frequency POS-tagged syntactic structures of predicate groups. # Structure 1 V 2 S-PRO V 3 PART V

Quantity Percentage Rank Predicate group scheme 45 5.42 1 x 17 2.05 2 1:x 16 1.93 3 1:x

4 CONJ S- 15 PRO V 5 PRAEDIC 13 6 SV 12

1.81

4

2:x

1.57 1.45

5 6

x 1:x

7 CONJ V 8 ADVPRO V 9 PART PART V

10 10

1.2 1.2

7 7

1:x 1:x

9

1.08

8

2:x

Example

Translation

skhodite // ya ponyala // ne znayu //

go // I understood // (I) don’t know // chto ty what do you noyesh’? want? prikol’no // cool // eksport export will nakroyetsya // break down // yesli budet if there will be seychas () (I)’ll look posmotryu / now / nu ne znayu / well (I) don’t know /

Thus, the most frequent syntactic structures of predicate groups were identiﬁed. By the type there are 8 verbal and 1 non-verbal among them. By the content, the most ranked structure consists of one verb. The 3rd, 7th and 9th ranks belong to structures consisting of a verb with an auxiliary part of speech. The 2nd and 4th ranks belong to structures in which an object is added to the verb. It also turned out that the size of the most ranked syntactic structures is only 1–2 elements, and their right distance is invariably zero. Within the structures of predicate groups, it is also interesting to make comparative analysis of different social groups. This was done in the following way: for each social group, a rating of the 10 most frequent syntactic structures was compiled for all speakers and for each social category, and then these ratings were compared.

448

A. Naumova

The ratio of the 10 most frequent POS-tagged syntactic structures of predicate groups in different age groups also allows us to make some observations. These ﬁndings were also checked for reliability using the Fisher’s test. It turned out that only a conclusion can be considered as statistically signiﬁcant: the structure “PRAEDIC” is rather typical for the middle-aged group (II rank), but rarely occur in speech of youth (XVII rank). The signiﬁcance of other observations was not conﬁrmed statistically. Comparison of professional groups also reveals some differences. According to Fisher’s test, some conclusions were unreliable and some were in the zone of uncertainty. However there were ﬁve statistically signiﬁcant observations: – The structure “V” is the most frequent in professional groups of IT specialists and ofﬁce workers, but is not typical for representatives of law enforcement. – The structure “S-PRO V” has II rank in the speech of ofﬁce workers, but is not typical for representatives of law enforcement. – The structure “PART V” has high ranks in speech of humanitarians and ofﬁce workers (I and III respectively), while it is not typical for employees in the education sphere and representatives of creative professions. – The structure “CONJ S-PRO V” is the most frequent in the professional group of employees in the education sphere, and at the same time it is not typical for humanitarians and representatives of creative professions. – The structure “PRAEDIC” is the most frequent in the speech of representatives of creative professions, but it rarely occurs in the speech of employees in the education sphere, IT specialists and ofﬁce workers. Such number of values in the zone of signiﬁcance for professional social groups allow us to speak about the greatest degree of influence of professional afﬁliation on the structural organization of predicate groups in everyday speech.

6 Conclusions The analysis of the data made it possible to come to a number of conclusions that were veriﬁed using statistical methods. According to them, the following observations can be considered reliable. 1. The quantitative characteristics of predicate groups in Russian oral speech, apparently, do not depend on the sex of the speaker. 2. The average size of predicate groups in speech of young people is greater than in that of the middle-aged. 3. The size of predicate groups is changed primarily due to the left distance, while the right distance for all informants, regardless of sex, age and occupations, ranges from 0 to 1 element. This result conﬁrms the hypothesis that left-branching verbal groups prevail in spoken Russian [19]. 4. The size of the most ranked POS-tagged syntactic structures is only 1–2 elements, and their right distance equals 0. 5. The number of verbal predicate groups in female speech is 8% greater than that in the male.

Sociolinguistic Variability of Predicate Groups

449

Besides, there is a number of other reliable observations mostly about professional groups, concerning more speciﬁc matters. It should be mentioned though, that these conclusions are based upon rather small volume of sociolinguistic speech material. At this stage, not all the observations described above seems to be sufﬁcient to identify the diagnostic features of the social groups under study, but they show well the potential of the methods used by which such diagnostic features can be identiﬁed. Acknowledgements. The research is supported by the Russian Foundation for Basic Research, project # 17-29-09175 “Diagnostic Features of Sociolinguistic Variation in Everyday Spoken Russian (based on the Material of Sound Corpus)”.

References 1. Bogdanova, N.: Uroven’ rechevoy kompetentsii kak real’naya sotsial’naya kharak-teristika govoryashchego, opredelyayushchaya yego rech’ [The level of speech competence as a real social characteristic of the speaker, which determines his speech]. In: Asinovskiy, A., Bogdanova, N. (eds.) XXXVIII Mezhdunarodnaya Filologicheskaya Konferentsiya [XXXVIII International Philological Conference] 2009, vol. 22, pp. 29–40. SaintPetersburg (2010). (in Russian) 2. Kanu, A.: Reflections in communications. An Interdisciplinary Approach. University Press of America, Lanham (2009) 3. Kreiman, J., Sidtis, S.: Foundations of voice studies. An Interdisciplinary Approach to Voice Production and Perception. Wiley, New York (2011) 4. Potapova, R., Potapov, V., Lebedeva, N., Agibalova, T.: Mezhdistsiplinar-nost’ v issledovanii rechevoy poliinformativnosti [Interdisciplinarity in the study of speech polyinformativity]. Yazyki slavyanskoy kul’tu-ry [World of Slavic Culture] 3, 82–95 (2016). (in Russian) 5. Asinovsky, A., Bogdanova, N., Rusakova, M., Ryko, A., Stepanova, S., Sherstinova, T.: The ORD speech corpus of Russian everyday communication “One Speaker’s Day”: creation principles and annotation. In: Matouˇsek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 250–257. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-64204208-9_36 6. Bogdanova-Beglarian, N., et al.: Sociolinguistic extension of the ORD corpus of Russian everyday speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 659–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-31943958-7_80 7. Bogdanova-Beglarian, N., Sherstinova, T., Blinova, O., Martynenko, G.: An exploratory study on sociolinguistic variation of Russian everyday speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 100–107. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_11 8. Sherstinova, T.: Macro episodes of Russian everyday oral communication: towards pragmatic annotation of the ORD speech corpus. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 268–276. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_33

450

A. Naumova

9. Zobnina, E.: Perspektivy ispol’zovaniya zvukovogo korpusa “Odin rechevoy den’” v prepodavanii russkogo yazyka kak inostrannogo [Prospects for the use of the sound building “One Speech Day” in teaching Russian as a foreign language]. Mir russkogo slova [The world of the Russian word] 4, 99–103 (2009). (in Russian) 10. Bayeva, E.: O sposobax sociolingvisticheskoj balansirovki ustnogo korpusa [na primere “Odnogo rechevogo dn’a”) [On Means of Sociolinguistic Balancing of a Spoken Corpus (Based on the ORD corpus)]. Vestnik Permskogo universiteta. Rossijskaja i zarubezhnaja ﬁlologia [Perm University Herald. Russian and Foreign Philology] 4(28), 48–57 (2014). (in Russian) 11. Ermolova, O.: “Odin Rechevoy Den’” govoryashchego s tochki zreniya pragmatiki [“One speech day” of the speaker from the point of view of pragmatics]. Vestnik Permskogo universiteta. Rossijskaja i zarubezhnaja ﬁlologija [Perm University Herald. Russian and Foreign Philology] 3(27), 21–30 (2014). (in Russian) 12. Bogdanova-Beglaryan, N., et al.: Russkiy yazyk povsednevnogo obshcheniya: osobennosti funktsionirovaniya v raznykh sotsial’nykh gruppakh. Kollektivnaja monograﬁja [Russian language of everyday communication: features of functioning in different social groups]. Layka [Laika], Saint-Petersburg (2016). (in Russian) 13. Bogdanova-Beglarian, N., Sherstinova, T., Blinova, O., Martynenko, G.: Linguistic features and sociolinguistic variability in everyday spoken Russian. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 503–511. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_50 14. Rebrova, P.V.: Strukturnyye i lineynyye poryadki v spontannoy rechi (na materiale korpusa « Odin rechevoy den’ »): dis…. mag. lingv [Structural and linear orders in spontaneous speech (on the material of the case “One Speech Day”): the dissertation of the master of linguistics] (typescript). Saint-Petersburg (2014). (in Russian) 15. Bakhtin, M.: Estetika slovesnogo tvorchestva [Aesthetics of verbal creativity]. Iskusstvo [Art], Moscow (1986). (in Russian) 16. Testelets, Y.: Vvedeniye v obshchiy sintaksis [Introduction to the general syntax]. RGGU, Moscow (2001). (in Russian) 17. Kibrik, A., Podlesskaya, V.: Rasskazy o snovidenijakh. Korpusnoe issledovanie ustnogo russkogo diskursa [Stories about dreams. Corpus study of Russian oral discourse]. Yazyki slavyanskikh kul’tur [Languages of Slavic cultures], Moscow (2009). (in Russian) 18. Vinogradov, V.: Izbrannyye trudy. Leksikologiya i leksikograﬁya [Selected works. Lexicology and lexicography]. Nauka [Art], Moscow (1977). (in Russian) 19. Bogdanova-Beglarian, N., Martynenko, G., Sherstinova, T.: The “One Day of Speech” corpus: phonetic and syntactic studies of everyday spoken Russian. In: Ronzhin, A., et al. (eds.) SPECOM 2015, LNAI, vol. 9319, pp. 429–437. Springer, Switzerland (2015). https:// doi.org/10.1007/978-3-319-23132-7_53

Building Real-Time Speech Recognition Without CMVN Thai Son Nguyen(B) , Matthias Sperber, Sebastian St¨ uker, and Alex Waibel Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Germany [email protected]

Abstract. Estimating cepstral mean and variance normalization (CMVN) in run-on and real-time settings poses several challenges. Using a moving average for variance and mean estimation requires a comparatively long history of data from a speaker which is not appropriate for short utterances or conversations. Using a pre-estimated global CMVN for speakers instead reduces the recognition performance due to potential mismatch between training and testing data. This paper investigates how to build a real-time run-on speech recognition system using acoustic features without applying CMVN. We propose a feature extraction architecture which can transform unnormalized log mel features to normalized bottleneck features without using historical data. We empirically show that mean and variance normalization is not critical for training neural networks on speech data. Using the proposed feature extraction, we achieved 4.1% word error rate reduction compared to global CMVN on the Skype conversations test set. We also reveal many cases when features without zero-mean can be learnt well by neural networks which stands in contrast to prior work.

Keywords: Real-time speech recognition Neural network

1

· Feature normalization

Introduction

Ceptral mean and variance normalization (CMVN) [22] and other normalization techniques (e.g., Cepstral mean normalization (CMN) [7]) are widely adopted in many neural network speech recognition systems due to several advantages. First, these techniques as shown in [22] make the recognizer more robust by canceling out environmental changes. Second, they help reducing the environment mismatch (e.g. background noises or microphones) between training and testing conditions. Last, the acoustic features after normalization have zero mean which is found critical for neural network training [13]. In oﬄine situations, CMVN is usually applied at the utterance level or more ideally at the speaker level when many utterances of the same speaker are available. However, these approaches are not appropriate for real-time situations, c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 451–460, 2018. https://doi.org/10.1007/978-3-319-99579-3_47

452

T. S. Nguyen et al.

because they require a certain amount of history to be available for the current speaker, and cannot handle unexpected speaker changes. Instead, mean and variance can be continuously computed over a moving window of some hundred frames (e.g., 3 s [1,17]). However, moving windows require the availability of historical data of at least a window-size, so that a delay must be introduced to handle the beginning of a new utterance. A third approach, computing mean and variance globally (e.g., [19,26]) for all training and test data, avoids the delay but reduces the recognition performance due to potential data mismatch. CMVN can also be recursively updated in real-time as in [17], but this approach does not handle multiple speakers. Peddinti et al. [14] proposed to use mel-frequency cepstral coeﬃcients (MFCC) without normalization for real-time speech recognition, as currently implemented in the Kaldi toolkit [15]. In their approach, i-vectors [3] which supply the information about the mean oﬀset of the speaker’s data are provided to every input so that the network itself can do feature normalization. However, i-vectors still require a certain amount of data of about 6 s per speaker. In this paper, we investigate and employ feature extraction approaches which exhibit comparable performance to CMVN but do not require speaker historical data and are therefore better suited for real-time situations. Our contributions are summarized as follows: – We contrast diﬀerent CMVN methods and point out their respective advantages and limitations in a real-time feature extraction setting. We conclude that global CMVN is most desirable regarding real-time properties, although utterance- or speaker-based CMVN yield best recognition accuracy. – We propose to use a two-step transformation method that is empirically shown to transform unnormalized log mel-ﬁlterbank (FBANK) features into suitable acoustic model inputs, without requiring historical data of the current speaker. Using this transformation, we show that the acoustic models trained on the new feature domain signiﬁcantly outperform global CMVN. – We identify a potential mismatch between training and testing data when acoustic models are trained on unnormalized data and propose to use data augmentation as a solution. We empirically show that retraining the feature extraction and the systems on a volume perturbation dataset can avoid the mismatch of audio volume and increase the recognition performance by up to 3.1%. – We also observe and discuss cases when features without zero-mean can be learnt well by neural networks, which stands in contrast to prior work. Without waiting for acoustic features being normalized correctly, a run-on speech recognition using our proposed feature extraction can process utterances of arbitrary length (or shortness). It can also handle situations where multiple speakers are sharing a single microphone.

Building Real-Time Speech Recognition Without CMVN

2

453

Improving Real-Time Feature Extraction

We propose performing two steps to learn robust feature extraction for realtime speech recognition systems. First, the traditional mel-ﬁlter features are transformed into LDA domain and then fed into a bottleneck network to have ﬁnal features which are value-normalized and easier to exploit by neural network models. Second, in order to increase the system’s robustness, data augmentation is used for retraining both the feature extraction and the network model.

Fig. 1. Real-time feature extraction.

2.1

Using LDA Transformed Features

The most popular acoustic features such as MFCC or FBANK without normalization are problematic input for neural networks to learn. MFCC features usually span a wide range in every dimnension, e.g., [−93, 363] on typical data, while FBANK features only have positive values, e.g. in the range [0, 11.66]. We attempt to ﬁnd a transformed domain such that the transformation can be performed in real-time. Linear Discriminant Analysis (LDA) [4] is usually used for dimensionality reduction, but here we propose to use it only for feature transformation. Using LDA, we compute a d × d linear transformation matrix which

454

T. S. Nguyen et al.

projects d -dimensional FBANK into a new domain with the same dimensionality. In this LDA domain, the features maintain the class-discriminatory information and can be mapped with their class-separability magnitudes according to the associated eigenvectors and eigenvalues. When used for dimensionality reduction, LDA is applied by keeping only k (much smaller than d ) features with largest magnitudes. We, however, use all d -dimensional features in our in models because we observed better system performance. 2.2

Using Normalized Bottleneck Features

As will be experimentally shown, optimizing single network models on unnormalized data can be hard. Dealing with this situation, our idea is to train a ﬁrst network model for extracting length-normalized features. Later we can use a second network to perform the real classiﬁcation task. Figure 1 illustrates our proposed feature extraction architecture. The input of the network can be unnormalized FBANK or LDA-transformed features. We employ some rectiﬁer [25] layers on top of the input layer, followed by a narrow (bottleneck) layer of 42 sigmoidal units. Two last layers which will be discarded after the training include one rectiﬁer and the ﬁnal softmax. Since the training of this feature extraction optimizes phonemes classiﬁcation, the extracted features at the bottleneck layer are supposed to be signiﬁcant for class-discrimination. When using a sigmoidal activiation function, we can obtain bottleneck features that are normalized to be in a small range which can be easier handled by the second network. We experimented with sigmoidal functions, the logistic function which has range if [0,1] and the hyperbolic tangent which produces features in range [−1, 1]. Diﬀerent from [8,24], the proposed feature extraction is able to handle both normalized and unnormalized inputs. It does not suﬀer from vanishing gradients and does not need pre-training which signiﬁcantly reduces the training time. Applying this feature extraction in real-time can be considered as adding more hidden units to the classiﬁcation network, which linearly increases the computation time (i.e. 25% in our experiments). 2.3

Increasing Robustness by Data Augmentation

As will be explored in this paper, the neural network systems trained on unnormalized features potentially need to deal with environment mismatch between training and testing. In speech recognition, mismatches such as diﬀerent speech variations, background noises or microphones, can lead to a signiﬁcant drop of recognition performance. In this paper, we analyze the robustness of our proposed feature extraction against the mismatch of audio volume conditions and improve it with the help of data augmentation. Data augmentation applied to speech recognition has been explored in many studies. In [10,16], corrupting clean training speech with noise improved the speech recognizer against noisy speech. Using vocal tract length perturbation [11] has shown gains on TIMIT. In [12,14], training with speed and volume perturbations datasets increased the system performance on several LVCSR tasks.

Building Real-Time Speech Recognition Without CMVN

455

In this paper, we only consider data augmentation by performing volume perturbation.

3 3.1

Experimental Setup Training and Test Data

In our experiments we used a large training set of 460 h. This dataset is the result of combining TED-LIUM [18], Quaero [21] and Broadcast News [9] corpora. Our three evaluation sets include TED-LIUM test, tst2013 from the IWSLT evaluation campaign [2] and the English set from the MSLT corpus [5] which contains conversations over Skype. The volume perturbations were done as suggested by [14] where each recording was scaled with a random variable using sox. We set the random variable within the range [0.2, 2] for all recordings in the training data set. Then they were added to the original training set to form the augmented dataset. To investigate the robustness against volume mismatch, we used the ranges [0.2, 0.6] and [1.6, 2.0] for the all recordings of the tst2013 set to create a perturbed test set. 3.2

System Description

All the network models used roughly same number of input features (i.e., 440 FBANK and 462 LDA or bottleneck features) and were trained using the crossentropy loss function to predict 8,000 context-dependent phonemes. Rectiﬁer networks were constructed of 6 hidden layers with 1,600 units per layer. For sigmoidal networks, we used 5 hidden layers of 2,000 units and performed pretraining with denoising auto-encoders [23]. For our convolution neural network (CNN), we used the best architecture from [20] which includes two convolutional layers of 256 hidden units with ﬁlter size 9 and a max pool size of 3, followed by 4 fully connected layers with 1,024 units. However, we did not use delta and delta-delta features for consistent comparisons between models. The tests were performed with the Janus Recognition Toolkit (JRTK) [6] with a 4-gram language model and a vocabulary of more than 150,000 words.

4 4.1

Results Using Normalized and Unnormalized Features

In Table 1, we compare the systems using diﬀerent CMVN methods against various systems trained on unnormalized FBANK features. Using our training data, CMVN systems performance depends on the amount of available speaker historical data. Normalization at speaker level yielded the best performance, followed by utterance level normalization and normalizations with windows 300 frames in length. The results on the perturbed test set show an interesting fact that these normalizations produce robust features to the changes of audio volume.

456

T. S. Nguyen et al.

Global CMVN is less optimal than other normalizations (7.1% rel. increase in WER compared to speaker level). However, real-time system may have to adopt this method, in order to achieve acceptable latency. For the normalized features, the gap between sigmoidal and rectiﬁer [25] networks appears small. However, when using the features without normalization which have only positive values in a large range [0, 11.66], optimizing sigmoidal networks for good convergence becomes diﬃcult. We had to reduce the initial learning rate by a factor of ten compared to normalized features. The training then converged at a poor local minimum and caused worse classiﬁcation performance. The situation changed with the rectiﬁer network. We were able to keep the same learning rate and the training converged with the same pattern. However, it suﬀers from a 7.3% rel. increase in WER compared to global CMVN. Switching to a CNN network gave a further improvements, however its result is still not good as that of the CMVN systems. These results demonstrate the diﬃculties when training single network models on unnormalized FBANK features. The increase in WER of the systems using unnormalized features and global normalized features on the perturbed test set indicates that they may be sensitive to volume mismatch between training and test data. Table 1. Word error rates of various systems using 40 log mel-ﬁlter bank features with and without CMVN.

4.2

CMVN

Network Type

tst2013 tst2013-vp

Speaker

sigmoid

15.5

15.5

Utterance sigmoid

15.8

15.8

Window

sigmoid

16.2

16.4

Global

sigmoid

16.6

17.3

Global

rectiﬁer

16.5

17.1

none

sigmoid

22.3

23.2

none

rectiﬁer

17.7

18.0

none

rectiﬁer (CNN) 17.1

17.6

Using LDA Transformed Features

Table 2 compares the eﬃciency of diﬀerent LDA transformations applied to unnormalized features. Such a conventional approach which reduces dimensionality of 440 features of 11 consecutive frames down to 42 and then stacks again for 11 frames, does not show clear improvements. When transforming 40 FBANK features without reduction and stacking 11 adjacent frames of LDA features as the network input, the systems improved. Further improvement was achieved when transforming 440 features of 11 consecutive frames via LDA and using

Building Real-Time Speech Recognition Without CMVN

457

them as network input. Interestingly, the transformed features which are in the range [−14.95, 14.50] without zero-mean are better than FBANK with global CMVN. When applying global mean and variance normalization again on these LDA features, the performance even got worse showing that the normalization is unnecessary for this training data. The large degradation (5.4% rel. in WER) of the performance on the perturbed test set presents the need of a method for improving LDA features against possible environment mismatch. Table 2. The systems with LDA features. LDA Feature CMVN DNN Reduction

4.3

none

tst2013 tst2013-vp

rectiﬁer 17.5

17.8

Full-40

none

rectiﬁer 16.8

17.4

Full-440

none

rectiﬁer 16.2

17.0

Full-440

Global rectiﬁer 16.5

17.2

Full-440

none

17.7

sigmoid 16.8

Using Normalized Bottleneck Features

The proposed bottleneck feature extraction shows its advantages when applied to both unnormalized FBANK and LDA features and produces improved features. The same networks trained on the bottleneck features showed relative reduction of 7.4% and 4.9% as shown in Table 3. The extracted bottleneck features are in a normalized range [0, 1] or [−1, 1], so a sigmoid network can be trained well showing again that we do not need to apply mean normalization. When evaluating against the mismatch test set, we found that the extracted features are more stable to speech variations indicating the normalized bottleneck network may be automatically forced to learn robust features. Table 3. The system with normalized bottleneck (BN). Feature

BN Type DNN

tst2013 tst2013-vp

FBANK sigmoid

rectiﬁer 16.4

16.6

FBANK sigmoid

sigmoid 16.5

16.8

LDA

sigmoid

rectiﬁer 15.5

15.8

LDA

tanh

rectiﬁer 15.5

15.8

458

4.4

T. S. Nguyen et al.

Using Data Augmentation

When retraining the feature extraction and the systems on the augmented dataset, we obtained improvements on both test sets as presented in Table 4. Now, there are only small gaps between the two test sets indicating robustness of the models and the eﬀectiveness of the proposed data augmentation. Retraining improves the recognition performance in general (i.e. 3.1% rel. for the bottleneck system using FBANK). We could only achieve small gains for the systems using LDA features. This can be the case when we only retrained the feature extractions and the systems without re-estimating the LDA transformation. Table 4. The systems trained with data augmentation. Feature

tst2013

FBANK

17.3 (2.3%) 17.4 (3.3%)

tst2013-vp

LDA

16.1 (0.6% ) 16.3 (4.7%)

BN-FBANK 15.9 (3.1%) 15.9 (4.2%) BN-LDA

4.5

15.4 (0.7% ) 15.6 (1.3% )

Comparison on Diﬀerent Test Sets

Table 5 compares the results of our systems on two diﬀerent test sets. TEDLIUM set contains 11 TED talks while the MSLT set is a collection of 3,000 utterances of recorded Skype conversations. There is no speaker information for the MSLT set and more than a half of the utterances are less than 3 s. In diﬀerent online domains, our proposed feature extraction can reduce the WER by 12.1% relative compared to the global CMVN. Comparing to another system of the same complexity which uses the bottleneck architecture from [8], we also achieve a signiﬁcant improvement. Table 5. Results on the TED-LIUM and MSLT test sets. Feature

CMVN TED-LIUM MSLT2016

FBANK Global

9.8

BN [8]

9.2

30.9

FBANK none

10.3

35.0

BN-LDA none

8.7

29.8

Global

33.9

Building Real-Time Speech Recognition Without CMVN

5

459

Conclusions

We have presented a novel and eﬀective feature extraction for real-time and runon speech recognition. Our proposed two-step transformation is able to transform unnormalized log mel-ﬁlterbank features into useful value-normalized features. These features can be used directly for neural networks or Gaussian mixture models without further normalization. Applying this feature extraction approach hides the involvement of explicit normalization such as CMVN. Other real-time speech applications (such as speaker recognition) can also beneﬁt from our method.

References 1. Alam, M.J., Ouellet, P., Kenny, P., O’Shaughnessy, D.: Comparative evaluation of feature normalization techniques for speaker veriﬁcation. In: Travieso-Gonz´ alez, C.M., Alonso-Hern´ andez, J.B. (eds.) NOLISP 2011. LNCS (LNAI), vol. 7015, pp. 246–253. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-250200 32 2. Cettolo, M., Niehues, J., St¨ uker, S., Bentivogli, L., Frederico, M.: Report on the 10th IWSLT evaluation campaign. In: The International Workshop on Spoken Language Translation (IWSLT) 2013 (2013) 3. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker veriﬁcation. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011) 4. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classiﬁcation. Wiley-Interscience (2000) 5. Federmann, C., Lewis, W.D.: Microsoft speech language translation (MSLT) corpus: the IWSLT 2016 release for English, French and German. In: The International Workshop on Spoken Language Translation (IWSLT) 2016 (2016) 6. Finke, M., Geutner, P., Hild, H., Kemp, T., Ries, K.R., Westphal, M.: The karlsruhe VERBMOBIL speech recognition engine. In: Proceedings of ICASSP (1997) 7. Furui, S.: Cepstral analysis technique for automatic speaker veriﬁcation. IEEE Trans. Acoust. Speech Signal Process. 29(2), 254–272 (1981) 8. Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3377–3381. IEEE (2013) 9. Graﬀ, D.: The 1996 broadcast news speech and language-model corpus. In: Proceedings of the DARPA Workshop on Spoken Language Technology (1997) 10. Hannun, A., et al.: Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014) 11. Jaitly, N., Hinton, G.E.: Vocal tract length perturbation (VTLP) improves speech recognition. In: Proceedings of ICML Workshop on Deep Learning for Audio, Speech and Language (2013) 12. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH, pp. 3586–3589 (2015) 13. LeCun, Y.A., Bottou, L., Orr, G.B., M¨ uller, K.-R.: Eﬃcient backprop. In: Montavon, G., Orr, G.B., M¨ uller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012). https://doi.org/ 10.1007/978-3-642-35289-8 3

460

T. S. Nguyen et al.

14. Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for eﬃcient modeling of long temporal contexts. In: INTERSPEECH, pp. 3214– 3218 (2015) 15. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011 16. Prisyach, T., Mendelev, V., Ubskiy, D.: Data augmentation for training of noise robust acoustic models. In: Ignatov, D.I., et al. (eds.) AIST 2016. CCIS, vol. 661, pp. 17–25. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52920-2 2 17. Pujol, P., Macho, D., Nadeu, C.: On real-time mean-and-variance normalization of speech recognition features. In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings, vol. 1, p. I. IEEE (2006) 18. Rousseau, A., Del´eglise, P., Est`eve, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: Proceedings of LREC (2014) 19. Sainath, T.N., Kingsbury, B., Mohamed, A.R., Ramabhadran, B.: Learning ﬁlter banks within a deep neural network framework. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 297–302. IEEE (2013) 20. Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE (2013) 21. St¨ uker, S., Kilgour, K., Kraft, F.: Quaero 2010 speech-to-text evaluation systems. In: Nagel, W., Kr¨ oner, D., Resch, M. (eds.) High Performance Computing in Science and Engineering ’11. Springer, Heidelberg (2012). https://doi.org/10.1007/ 978-3-642-23869-7 44 22. Viikki, O., Laurila, K.: Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Commun. 25(1), 133–147 (1998) 23. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: The 25th International Conference on Machine Learning, pp. 1096–1103. ACM (2008) 24. Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: Interspeech, vol. 237, p. 240 (2011) 25. Zeiler, M.D., et al.: On rectiﬁed linear units for speech processing. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3517–3521. IEEE (2013) 26. Zeyer, A., Schl¨ uter, R., Ney, H.: Towards online-recognition with deep bidirectional LSTM acoustic models. In: Interspeech 2016, pp. 3424–3428 (2016)

Choice of Signal Short-Term Energy Parameter for Assessing Speech Intelligibility in the Process of Speech Rehabilitation Dariya Novokhrestova , Evgeny Kostyuchenko(B) , and Roman Meshcheryakov Tomsk State University of Control Systems and Radioelectronics, Tomsk, Russia [email protected], [email protected] http://www.tusur.ru

Abstract. The article describes an approach to assessing the intelligibility of speech in the process of speech rehabilitation by ﬁnding the measure of the similarity of the standard and distorted pronunciation of phonemes. The approach is based on the calculation of the correlation coeﬃcient between the transformed signal envelopes. The envelope of the signal is constructed on the basis of the calculation of the shortterm energy of signal. The selection of the short-term energy parameter (window size) is also described. The parameter selection is based on comparing the diﬀerences between the correlation coeﬃcients for pairs with normal pronunciation and pairs with distorted pronunciation, calculated for diﬀerent window sizes. The window sizes for each problem phoneme are selected. Keywords: Correlation Speech quality criteria

1

· Cancer of the oral cavity and oropharynx

Introduction

The urgency of the development of rehabilitation techniques after surgical treatment of oncological diseases of the organs of the speech-forming apparatus is conﬁrmed every year by an increasing number of their detection. In 2016, about 25,000 new cases of diseases have been identiﬁed, and the total number of patients with cancer of this location is currently more than 100,000 [1,2]. Currently, rehabilitation is carried out according to GOST R 50840-95 [3], the lack of which is a subjective assessment of speech intelligibility. As part of the development of new methods of voice rehabilitation is one of the important tasks the development of an automated system for assessing speech intelligibility of patients that would avoid the subjective evaluation. Such evaluation may be obtained by comparing the reference pronunciation of syllables by the patient with the evaluated one, and the dynamics of rehabilitation based on comparing c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 461–469, 2018. https://doi.org/10.1007/978-3-319-99579-3_48

462

D. Novokhrestova et al.

the estimates of diﬀerent sessions is of interest. The reference is the speech of the same patient before surgery. In the previous stages of the study, an approach for the formation of such an estimate based on the correlation coeﬃcient was described [4]. In [5], an approach was proposed for automatic time normalization (in view of the impossibility of comparing two signals of diﬀerent lengths) on the basis of a dynamic transformation of the time scale (dynamic time warping - DTW), and an attempt to apply smoothing to the energy of the signal was made. This article describes an approach to the evaluation of speech intelligibility through the calculation of the correlation coeﬃcient using the algorithm of the dynamic time warping applied to the envelopes of signals and the choice of the envelope parameter. As the envelope, the short-term energy values found are used. The correlation value is analysed depending on the window size when calculating the short-term energy.

2 2.1

Description of the Algorithm Description of the Input Data

At this stage of the study, the intelligibility assessment was carried out only for problematic phonemes deﬁned earlier [6]. For this, 3 sets of records were made, the recording was made by a healthy speaker: 2 sets of records with normal pronunciation of phonemes in syllables (normal phonemes), 1 set of recordings with pronunciation of problematic phonemes without using the language (modiﬁed phoneme). Each set is 90 records of diﬀerent syllables: 6 problem-phonemes [k], [k ’], [s], [s’], [t], [t ’], for each 15 syllables (5 syllables with a phoneme at the beginning of the syllable, 5 syllables with a phoneme in the middle of the syllable, 5 syllables with a phoneme at the end of the syllable), a complete list of phonemes is also presented in [6]. 2.2

Short-Term Energy

Applicability of short-term energy in speech analysis is described in [7]. Shortterm energy characterizes the signal energy within a window of size N and is deﬁned as T +N −1 (s(n) ∗ w(n))2 (1) ET = n=T

where s(n) is the amplitude of the signal in the count n, w(n) is the window function. This algorithm uses a rectangular window, that is, w(n) = 1. The overlap between the windows is N −1. 2.3

Description of the Sequence of Actions

The algorithm can be described as a sequence of steps.

Choice of Signal Short-Term Energy Parameter

463

1. Segmentation of the phoneme in the syllable. In each record, a problematic phoneme was singled out, the segmentation was carried out manually by listening and analysing the oscillogram and the spectrogram of the signal. Phonemes were cut into separate sound ﬁles, and further work was carried out with these ﬁles. 2. Transformation of records of phonemes in a sequence of values of amplitudes of signals. 3. Finding the values of signal short-term energies with the size of the window N for three realizations of phonemes in the syllable (phonemes from the same syllable from diﬀerent sets of records). 4. Application of the DTW algorithm for each of the pairs of obtained arrays of short-term energy values. The DTW algorithm itself is described in [8]. 5. Finding the correlation coeﬃcient between the transformed values of the short-term energy values Rnj (the correlation coeﬃcient for the normal pronunciation of phonemes from the j-th syllable), R1ij and R2ij (correlation coeﬃcients for pairs normal pronunciation - distorted pronunciation for the phoneme from the j-th syllable). For each phoneme with the same location in the syllable (for example, background-ma [s] at the beginning of the syllable) repeat step 2–5. Thus, 5 values of Rnj , R1ij and R2ij were found for each location of the phoneme in syllables. 6. Finding the average values of Rn (2), R1i (3), and R2i (4), that is, the average value for a pair of normal phoneme pronunciation and the average values for pairs of distorted-normal phoneme pronunciation, as 5 j=1 Rnj (2) Rn = 5 5 j=1 R1ij R1i = (3) 5 5 j=1 R2ij R2i = (4) 5 7. Finding d is the ratio of the average arithmetic of the correlation coeﬃcients average values for pairs with a changed pronunciation to the average value of correlation coeﬃcient for normal pronunciation, according to R1i + R2i (5) 2Rn Steps 2–7 are repeated for window sizes N from 10 to 300 for phonemes [k], [k’], [s] and [s’] and from 10 to 250 for phonemes [t] and [t’] in increments of 10. The window size is usually takes 10–20 ms depending on the data analysed. At a sample rate of 16000 Hz, this is 160–320 samples, but since the phoneme duration [t] [t] is 0.06–0.04 s for normal pronunciation, the window more than 15 ms strongly distorts the signal data. 8. Construction of the approximating quadratic function f (N ) by the method of least squares [9] to analyse the dependence of the values of d on the window size N . d=

464

3

D. Novokhrestova et al.

Results and Discussion

The average values of the correlation coeﬃcients for each of the pairs (Rn , R1i , R2i ), as well as the ratio of the average arithmetic mean values of the correlation coeﬃcients for pairs, the distorted phoneme-normal phoneme to the mean correlation coeﬃcient between normal phonemes, were presented as graphs of the window size of the instantaneous energy for each of the locations in the syllable of problematic phonemes. Also, for an average relation, an approximating quadratic function was determined by the least squares method [9]. Let us consider each of the phonemes successively. As for the hard and soft realization of phonemes similar results are obtained, then these implementations are considered and analysed together. Those window sizes N for which the value of d is less than the value of d for N = 1 will be considered as suitable window sizes for constructing the signal envelope. In the graphs below, the average correlation coeﬃcients are plotted along the main axis (left scale), the values of d and the approximating function f (N ) are plotted along the auxiliary axis (right scale). 3.1

Phonemes [k] and [k ’]

Figure 1 shows the resulting values for diﬀerent locations of phonemes [k] and [k ’] in syllables. For the arrangement of the phoneme [k] at the beginning of the syllable, though, a signiﬁcant decrease in the values of d with increasing window size N is observed. However, with a window size greater than 150, one of the average correlation coeﬃcients for the pair is a distorted pronunciationnormal pronunciation becomes negative, which means an inverse relationship that is diﬃcult to interpret within the scope of the problem being solved for assessing the quality of pronunciation of syllables. For the phoneme [k ’] at the beginning of the syllable, according to the approximating function, the minimum values of d are attained at the minimum and maximum values of the size of the window N . However, if we look at the values of d themselves, then the values are less than d for N = 1, at N from 260 and above. The minimum value of the approximating function for the phoneme [k] at the beginning of the syllable is at N = 300, for the phoneme [k ’] at the beginning of the syllable it is at N = 300. When phonemes [k] and [k ’] are arranged in the middle of the syllable, similar patterns are observed: with an increase in the size of the window of short-term energy, the value of d decreases. All values of d are less than 1, but for N from 10 to 60 for the phoneme [k] and for N from 10 to 20 [k ’], the values of d are greater than the initial values of d for N = 1. For a phoneme [k ’] with a window size N greater than 150, one of the correlation coeﬃcients for pairs of distorted-normal pronunciation becomes negative. The minimum value of the approximating function for the phoneme [k] in the middle of the syllable is at N = 300, for the phoneme [k ’] in the middle of the syllable is at N = 300. When phonemes [k] and [k ’] are arranged at the end of the syllable, similar patterns are also observed. By the form of the approximating function, it’s possible to say that as the window size is increased, the value of d is also reduced. The values of d are less than the original values for N from 190 for the phoneme [k] and

Choice of Signal Short-Term Energy Parameter

465

for N from 200 for the phoneme [k ’]. The minimum value of the approximating function for the phoneme [k] at the end of the syllable is at N = 300, for the phoneme [k ’] at the end of the syllable N = 300. Based on the obtained results, it can be concluded that for phonemes [k] and [k ’], in constructing an envelope based on short-term energy, it is possible to select the window size N about 20 ms, i.e. about 300 counts.

Fig. 1. Results for phonemes [k] and [k ’] - (a), (b), (c) and (d), (e), (f) respectively (a, d - phoneme at the beginning of the syllable, b, e - phoneme in the middle of the syllable, c , f - phoneme at the end of the syllable). x - window size, y - result quality estimation (0–1).

3.2

Phonemes [s] and [s’]

Figure 2 shows the resulting values for diﬀerent locations of phonemes [s] and [s’] in syllables. Even though these phonemes were also combined for analysis, similar values were obtained only for the location of the phoneme at the end of the syllable. If the phoneme [s] is located at the beginning of the syllable, by the form of the approximating function and the obtained values of d, the minimum values are attained at N from 120 to 150, the minimum of the function is observed for N = 135. Also, with N greater than 270, there is a sharp decrease in the correlation coeﬃcient between a pair of normal phoneme pronunciations. If the phoneme [s’] is positioned at the beginning of the syllable, the value of d

466

D. Novokhrestova et al.

increases with increasing window size. For a given arrangement of the phoneme, the construction of the enveloping on the basis of short-term energy will not lead to an increase in the diﬀerence between the normal and distorted pronunciation of the phoneme. The minimum of the approximating function f (N ) on the investigated segment is observed for N = 1. The construction of the envelope based on short-term energy for the phoneme [s] in the middle of the syllable also does not lead to an improvement in results, since the value of d also increases with increasing the window. The minimum of the approximating function is observed for N = 7. For the phoneme [s’] in the middle of the syllable, on the contrary, as the window size increases, the value of d decreases. And all the obtained values of d are less than d for N = 1. The minimum of the approximating function on the investigated segment is observed in N = 300. For phonemes [s] and [s’] at the end of the syllable, there is no obvious decrease in the values of d as the window is increased, the short-term energy is not observed. According to the approximating function, the minimum values of d for the phoneme [c] and for the phoneme [c ’] are in N = 165 and N = 134, respectively.

Fig. 2. Results for phonemes [s] and [s’] - (a), (b), (c) and (d), (e), (f) respectively (a, d - phoneme at the beginning of the syllable, b, e - phoneme in the middle of the syllable, c , f - phoneme at the end of the syllable).

Choice of Signal Short-Term Energy Parameter

3.3

467

Phonemes [t] and [t’]

Figure 3 shows the resulting values for diﬀerent locations of phonemes [t] and [t ’] in syllables. With the arrangement of phonemes at the beginning of the syllable, in spite of the diﬀerent kinds of approximating functions, for any values of the window sizes the values of d are less than the value of d for N = 1. For the phoneme [t], the minimum of the approximating function, as well as the minimum values of d, are observed for N = 151. For the phoneme [t’], despite the fact that the minimum of the approximating function is at N = 1, the minimum values of d are observed at N from 30 to 50. When the phoneme [t] is located in the middle of the syllable, the values of d decrease with increasing the window size, however, when the minimum of the function at the point N = 65 is reached, the values begin to increase. But all the values obtained are less than the value of d for N = 1. The minimum values of d are observed at N equal to the following values: 10, 20, 50, 60. For the phoneme [t’] in the middle of the syllable, all d values are greater than the original value, despite the form of the approximating function. The minimum of the function is observed in N = 300. When phonemes [t] and [t’] are located at the end of the syllable, the minimum values are observed for N = 1, as the window size increases, the value of d also increases.

Fig. 3. Results for phonemes [t] and [t’] - (a), (b), (c) and (d), (e), (f) respectively (a, d - phoneme at the beginning of the syllable, b, e - phoneme in the middle of the syllable, c, f - phoneme at the end of the syllable).

468

4

D. Novokhrestova et al.

Conclusion

After analyzing the obtained data, one can conclude that it is impossible to select a single parameter for all problem phonemes for constructing an envelope based on instantaneous energy. For phonemes [k] and [k ’], the maximum diﬀerence between a pair with a normal pronunciation of phonemes and pairs with a changed pronunciation is the minimum values achieved with a window size of 20 ms. For the phonemes [s], [s’], [t] and [t’], the window sizes were also determined, for which the minimum values of d reﬂecting the similarity of the normal and changed pronunciation are achieved. For the phoneme [s] at the beginning of the syllable, the window size with the minimum value is 135, for the same phoneme in the middle and the end of the syllable the window size is 1 and 165, respectively. For the phoneme [s’] at the beginning, middle and end of the syllable, the sizes of the windows are 1, 300 and 134, respectively. For the phonemes [t] and [t’] at the end of the syllable, the best result for the window size is N = 1. The window size for [t] at the beginning of the syllable is N = 151, in the middle of N = 65. For [t’] at the beginning and middle of the syllable N = 40 and N = 300, respectively. The approach to the evaluation of speech intelligibility based on an automatic calculation of the correlation coeﬃcient between records with normal and distorted phoneme pronunciation between transformed envelopes based on short-term energy is described. This approach can be applied in the process of speech rehabilitation after surgical treatment of oncological diseases of the speech-forming apparatus. On next step of the work will be analysed a combination of the DTW and deep learning for segmentation and recognition of the syllables in task of speech rehabilitation [10]. Acknowledgements. Supported by a grant from the Russian Science Foundation (project 16-15-00038).

References 1. Kaprin, A.D., Starinskiy, V.V., Petrova, G.V.: Status of cancer care the population of Russia in 2016. P.A. Hertsen Moscow Oncology Research Center - branch of FSBI NMRRC of the Ministry of Health of Russia, Moscow (2018) 2. Kaprin, A.D., Starinskiy, V.V., Petrova, G.V.: Malignancies in Russia in 2014 (Morbidity and mortality). P.A. Hertsen Moscow Oncology Research Center branch of FSBI NMRRC of the Ministry of Health of Russia, Moscow (2017) 3. Standard GOST R 50840–95 Voice over paths of communication. Methods for assessing the quality, legibility and recognition. Publishing Standards, Moscow (1995) 4. Kostyuchenko, E., Meshcheryakov, R., Ignatieva, D., Pyatkov, A., Choynzonov, E., Balatskaya, L.: Correlation normalization of syllables and comparative evaluation of pronunciation quality in speech rehabilitation. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 262–271. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3 25

Choice of Signal Short-Term Energy Parameter

469

5. Novokhrestova, D.: Time normalization of syllables with the dynamic time warping algorithm in assessing of syllables pronunciation quality when speaking. Proc. TUSUR 4(20), 142–145 (2017). https://doi.org/10.21293/1818-0442-2017-20-4142-145 6. Kostyuchenko, E., Ignatieva, D., Meshcheryakov, R., Pyatkov, A., Choynzonov, E., Balatskaya, L.: Model of system quality assessment pronouncing phonemes. In: Dynamics of Systems, Mechanisms and Machines, Dynamics. Omsk (2016). https://doi.org/10.1109/Dynamics.2016.7819016 7. Bachu, R., Kopparthi, S., Adapa, B., Barkana, B.: Voiced/Unvoiced Decision for Speech Signals Based on Zero-Crossing Rate and Energy. In: Elleithy K. (eds.) Advanced Techniques in Computing Sciences and Software Engineering, 279–282. Springer, Dordrecht (2010). https://doi.org/10.1007/978-90-481-3660-5 47 8. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 1(26), 43–49 (1978). https://doi.org/10.1109/tassp.1978.1163055 9. Legendre: Adrien-Marie New Methods for the Determination of the Orbits of Comet. F. Didot, Paris (1805) 10. Kipyatkova, I.S., Karpov, A.A.: Variants of deep artiﬁcial neural networks for speech recognition systems. Proc. SPIIRAS 49(6), 80–103 (2016). https://doi.org/ 10.15622/sp.49.5

The Benefit of Document Embedding in Unsupervised Document Classification Jarom´ır Novotn´ y(B) and Pavel Ircing Faculty of Applied Sciences, Department of Cybernetics, The University of West Bohemia, Plzeˇ n, Czech Republic {fallout7,ircing}@kky.zcu.cz http://www.kky.zcu.cz/en

Abstract. The aim of this article is to show that the document embedding using the doc2vec algorithm can substantially improve the performance of the standard method for unsupervised document classification – the K-means clustering. We have performed rather extensive set of experiments on one English and two Czech datasets and the results suggest that representing the documents using vectors generated by the doc2vec algorithm brings a consistent improvement across languages and datasets. The English dataset – 20NewsGroups – was processed in a way that allows direct comparison with the results of both supervised and unsupervised algorithms published previously. Such comparison is provided in the paper, together with the results of supervised classification achieved by the state-of-the-art SVM classifier.

Keywords: Document embedding K-means · SVM

1

· Doc2vec · Classification

Introduction

It is generally accepted that even such a simple unsupervised algorithm as the classic K-means achieves surprisingly good classiﬁcation results, if it is presented with appropriate feature vectors. Our previous research [8] conﬁrmed that the well-established tf-idf vectors work rather well. The aim of the work presented in this paper was to test whether the recently introduced document embeddings produced by the doc2vec method [2,4,15] can further improve the performance.

2

Datasets

As our basic dataset, we have again picked the 20NewsGroups English corpus1 which is widely used as a benchmark for document classiﬁcation [1,3,5,7,8,11,12]. 1

This data set can be found at http://qwone.com/∼jason/20Newsgroups/ and it was originally collected by Ken Lang.

c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 470–478, 2018. https://doi.org/10.1007/978-3-319-99579-3_49

The Benefit of Document Embedding

471

It contains 20 000 text documents which are evenly divided into 20 categories that each contain discussion about a speciﬁc topic. The second data set CNO and all sub-sets of this data set are in the Czech language. It contains approximately 68 000 articles divided into 31 categories2 . This corpus was created so that it is at least in size and partially also in topics comparable to the English data set. Third group of data sets – TC and Large TC – consists of the transcription of phone calls from the Language Consulting Center (LCC) of the Czech Language Institute of the Academy of Sciences of the Czech Republic, which provides a unique language consultancy service in the matters of the Czech language. The counselors of the LCC are answering questions regarding the Czech language problems on a telephone line open to public calls. The data, gathered from these language queries are unique in several aspects. The Language Consulting Center deals with completely new language material so it is the only source of advice for new language problems. It also records peripheral matters that will never be explained in dictionaries and grammar books as these are focused on the core of the language system. In order to compare our results with the ones published previously, we have re-created two subdivision of the 20NewsGroup corpus. The ﬁrst one is created according to [12] and also used in our previous work [8] where it is described in more details. The other subdivision is created in order to compare the results with experiments described in [1,3]. 20NG1 data sub-set consists of the 5 new categories (according to [1]) created by joining original ones as follows: Motorcycle – Motorcycle and Autos; Hardware – Windows and MAC; Sports – Baseball and Hockey; Graphics – Computer graphics; Religion – Christianity, Atheism and misc. Furthermore, they divided this sub-set to three training and testing data sets, where they used [50, 200, 350] documents as test data and the rest as training data. 20NG2 input is whole unchanged 20NewsGroup corpus divided into training (13 000 documents) and testing (aproximatelly 7 000 documents) data (the same divisions as in [3]). The results achieved on the CNO and TC sets and sub-sets cannot be directly compared with the results of other research teams as the data are not (yet) made publicly available. However, these data are important for our own research and we decided to publish the results here to show some important properties of the doc2vec embedding (see the discussion below). From the ﬁrst Czech data – CNO – we have created the following subsets: – Set CNO consists of all 31 original categories. This results in approximately 68 000 documents in total. 2

It was created from a database of news articles downloaded from the http://www. ceskenoviny.cz/ at the University of West Bohemia and constitutes only a small fraction of the entire database – the description of the full database can be found in [14].

472

J. Novotn´ y and P. Ircing

– Set RCNO1 consists of 11 original categories which contain at least 1000 documents. – Set RCNO2 consists of 10 original categories containing between 500 and 1500 documents. – RCNO3 set is created from 12 categories, each containing randomly chosen 1000 documents from the original categories. This set is created for the purpose to be similar to 20NewsGroup corpus. The data TC and Large TC sets were created from a corpus obtained by LCC. These data sets consist of manually transcribed 607 parts of historical mono phone calls (each call can contain more than one parts, each part with diﬀerent questions about diﬀerent topic) and automatically transcribed (by ASR system3 ) 3128 parts of actual stereo phone call, all divided into 20 categories by their topic. These 20 categories were manually assigned by counselors from LCC (for example “semantics” or “lexicology”) and corresponds with the higher level of the linguistic topic tree. The division of phone calls into categories is not uniform, some categories contain only a few parts. The setting is based on previous ﬁndings. TC consists of mentioned 20 categories containing 3713 transcripted text parts of the phone calls. Some of the categories are formed from a small number of texts (for example only 10), we responded to that by creating Large TC data consisting of 10 original categories (3343 transcripted text parts) where each contains at least 100 text parts.

3

Preprocessing

First processing step is only in case of the 20NewsGroups data, where we removed all the headers except for the Subject. Then all uppercase characters were lowercased and all digits were replaced by one universal symbol. As the next processing step, we wanted to conﬂate diﬀerent morphological forms of the given word into one representation. We opted for lemmatization. The MorphoDiTa [13] tool was picked for the task – it works for both English and Czech and is available as a Python package.4 Traditional stop word removal is further preprocessing operation done in this paper by picking only the top T lemmas with highest mutual information (MI). After applying all these processing steps we can create following vector representations: 3.1

Representation by TF-IDF Weights

Common representation in text processing task named TF-IDF weights – i.e. combination of Term Frequency (TF ) and Inverse Document Frequency (IDF )

3 4

Created by colleagues at University of West Bohemia. ufal.morphodita at https://pypi.python.org/pypi/ufal.morphodita.

The Benefit of Document Embedding

473

weights. The well-known formula to compute TF-IDF weights wl,d for the lemmas l ∈ L and documents d ∈ D: wl,d = tfl,d ∗ idfl

(1)

where tfl,d denotes the number of times the lemma l occurs in document d and idfl is computed using formula: idfl =

N N (l)

(2)

where N is a total number of documents and N (l) denotes a number of documents containing the lemma l. In essentially all further experiments we use implemented Python package sklearn [9]5 for computing TF-IDF weights. 3.2

Representation by Doc2vec Weights

According to [4] doc2vec representation is simple extension of word2vec. This is done by embedding word sequences into vectors. Input can be n-grams, sentences, paragraphs or whole documents. This type of representation is considered as state-of-the-art for sentiment analysis, which is essentially also a classiﬁcation task. There was therefore a good chance that it will help in our task as well. In this paper we use the doc2vec implementation in Gensim package [10] for Python. Input data are in form of pairs consist of feature vector representation gain from 3 and label of the given document. The output is then vectors of doc2vec weights, where every row corresponds to a speciﬁc document. 3.3

Use of LSA Reduction on Representations 3.1 and 3.2

We have also tried to further reduce the dimension of the vector representations described in 3.1 and 3.2 by the Latent Semantic Analysis (LSA) and consequently analyze the eﬀect on the classiﬁcation accuracy. The LSA method is implemented in the Python package sklearn – the module TruncatedSVD.

4

Classification Methods

For our purposes, we picked one simple supervised and one simple unsupervised method. Our goal is to use unsupervised classiﬁcation and at least get similar results to supervised ones.

5

More precisely the TfidfVectorizer module from that package.

474

4.1

J. Novotn´ y and P. Ircing

K-Means

Simple unsupervised classiﬁcation algorithm – the classic K-means clustering method [6] – is being used here. It is generally accepted that even such a simple method is quite powerful for unsupervised data clustering if it is given an appropriate feature vector. As we have shown in [8], even simple feature vectors consisting of the tf-idf weights appear to capture the content of the document rather well (and the reduced feature vectors obtained from LSA do it even better). However, we expected to obtain even better results from doc2vec weights as they have been shown to be very good for extraction of the semantic information from the documents. The sklearn package implementation is being used as the version of the K-means algorithm. All preprocessed representation created according to 3 are used and this model is applied to all the data sets described in Sect. 2. Results can be found in Sect. 6. 4.2

SVM

The supervised classiﬁcation method being used here is the classic Linear SVM algorithm. This simple but powerful supervised data classiﬁcation algorithm could be quite suﬃcient. This algorithm was run only with TF-IDF weights representation. We have used the version of Linear SVM algorithm implemented in our favourite sklearn package (to be exact the module Linear SVM). Results can be found in Sect. 6.

5

Evaluation

Quite a few measures for evaluation of the classiﬁcation algorithms are widelyused in published papers. In our experiments, we have decided to use accuracy, precision, recall and F1; this choice was guided mostly by the fact that we wanted to compare the performance of our algorithms to the previously published results. The Accuracy (Acc) measure is picked only because of 20NG2 data set. It represents the percentage of correctly classiﬁed documents. This percentage is simply a number of the test documents, which are assigned with the correct topic. The Tables 1 and 3 lists the results with the use of Precision and Recall measures computed according to [12]. Following equations for computing microaverage type of Precision and Recall measures are explained in our previous work [8] or in article [12]. c α(c, T ) (3) P (T ) = c α(c, T ) + β(c, T ) c α(c, T ) R(T ) = (4) α(c, T ) + γ(c, T ) c

The Benefit of Document Embedding

475

Standard equation for computing F1 measure is [1]: F1 = 2 ∗

P ∗R P +R

(5)

The results reported in Tables 1 and 3 lists only the Precision measure, this is caused by usage of uni-labeled data sets (number of original categories in corpus have to be also the same as the number of output clusters from algorithms), the P (T ) is necessarily equal to R(T ) and to F 1 and it is suﬃcient to report only one of those values.

6

Results

First sets of results are listed in Table 1; these results were achieved on 20NG, 10NG, Binary[0/1/2], 5Multi[0/1/2], 10Multi[0/1/2] data sets. We are reporting only 10Multi Average, 5Multi Average, 2Multi Average result of the smaller data sub-sets and compare it with the values reported in the previously published paper [12]. It were used only results of unsupervised Sequential Information Bottleneck (sIB ) method created by the autors of the mentioned paper. In our experiments, Linear SVM uses 10-fold cross validation technique and we run K-means algorithms 10 times over each subset (same approach used in [12]). Averaged results from those runs are listed in Table 1. The meaning of the Kmeans experiment labels is listed in the following table: – TF-IDF uses tf-idf weights as input, every vector has size 5000. – TF-IDF (LSA) uses tf-idf weights reduced by LSA method, every vector has size 200. – doc2vec uses doc2vec weights as input, every vector has size 5000. – doc2vec (LSA) uses doc2vec weights reduced by LSA method, every vector has size 200. – TF-IDF + doc2vec is combination of TF-IDF (LSA) with doc2vec (LSA) weights, every vector has size 400. In Table 2 are listed second sets of results. We again compare our results with values reported in the previously published papers [1,3]. The authors of the [1] paper used SVM based 1 (SVM b. 1 ) and SVM based 2 (SVM b. 2 ) methods. Both of these methods are classic SVM algorithms, in case of SVM b. 1 method uses as input generated training data by use of WordNet, documents of input corpus and preprocessing as: stop-word removal, tokenization, TF-IDF representation, clusters created by Latent Semantic Indexing (LSI), etc. The SVM b. 2 method is same in preprocessing but uses the corpus of input documents. The results of both their methods and our used algorithms are macro F1-measures from three data sub-sets divided into training and testing data according to Sect. 2. The method listed as HM stated in [3] is semi-supervised classiﬁcation and uses the hybrid model of deep belief network and soft regression. The unlabeled data are used to train deep belief network model and labelled data are used to

476

J. Novotn´ y and P. Ircing Table 1. Comparison of our results with results achieved in [12].

20NewsGroups sub-sets

Precision of methods [%] sIB

Linear K-means method with input representations SVM TF-IDF TF-IDF doc2vec doc2vec TF-IDF + (TF-IDF) (LSA) (LSA) doc2vec

20NG

57.50

96.38

51.75

51.68

70.91

70.76

73.14

10NG

79.50

95.61

41.43

42.42

62.80

67.81

62.67

Average “large”

68.50

95.99

46.59

47.05

66.86

69.29

67.91

10Multi Average

67.00

91.63

40.26

40.79

47.15

49.90

52.18

5Multi Average

91.67

96.85

63.65

63.25

72.45

77.76

80.95

2Multi Average

91.20

99.25

93.49

93.57

96.81

96.91

96.08

Average “small”

83.30

95.91

65.80

65.87

72.13

74.86

76.40

train softmax regression model and ﬁne-tune the coherent whole system. The results stated as HM are only one of the few results in [3], they use diﬀerent division of the data set to training and testing data, for these results they used 7 500 as the test set, 11 000 as unlabeled training set and 3000 as the labelled training set. For gaining our results we used similar division used in [3]. We gained training (we concatenated their unlabeled and labelled data – approximately 13 000 labelled documents for Linear SVM and without labels for K-means) and test data (approximately 7000 documents). Table 2. Comparison of our results with results achieved in [1, 3]. 20News Group Sets

20NG1 c F1 [%]

Methods SVM b. 1a

73.00

SVM b. 2b

64.00

HM

–

Our approach Lin. SVM (TF-IDF)

K-means method with input representations TF-IDF

TF-IDF (LSA)

doc2vec

doc2vec (LSA)

TF-IDF + doc2vec

80.00

54.00

54.00

69.01

48.00

52.00

25.72

66.06

27.15

29.47

20NG2 d Acc [%] – – 82.63 95.21 52.74 a Training done by using 20News Group and Web Features b Training done by using only 20News Group c Data set prepared according to [1] and describe in Sect. 2 d Data set prepared according to [3] and describe in Sect. 2

Results on Czech data sets are listed in Table 3. We state these only for the purpose of testing our approach on the data in the diﬀerent language than English. The results on the language rather distant from English shows that our approach of the preparation of the data can be also applied in this case.

The Benefit of Document Embedding

477

Table 3. Results on Czech data sets. Czech data sets

7

Precision of methods [%] Linear SVM (TF-IDF)

K-means method with input representations TF-IDF

TF-IDF (LSA)

doc2vec

doc2vec (LSA)

TF-IDF + doc2vec

CNO

76.79

28.79

28.91

30.87

29.97

29.45

RCNO1

93.94

46.13

47.06

53.71

52.79

54.60

RCNO2

96.30

42.20

42.85

49.24

49.46

53.04

RCNO3

93.54

51.11

51.86

61.00

61.00

61.29

TC

77.92

31.29

32.12

31.51

28.65

32.53

Large TC

78.89

40.34

38.79

38.68

38.54

42.08

Conclusion

A reasonably eﬀective pipeline for unsupervised text documents classiﬁcation according to their topic is introduced in this paper. Preprocessing of the raw input text6 and extracted feature vectors7 are key factors in our approach. Simple supervised Linear SVM and unsupervised classiﬁcation K-means algorithms were used and as was predicted, the supervised one is superior to the unsupervised. Our main goal is to at least have similar results with unsupervised algorithm to supervised one. The performance of this unsupervised method (stated in Table 2) was almost on par with semi-supervised algorithm and even better against supervised algorithms used in [1]. Also as you can see from all Tables 1, 2 and 3 representation with use of doc2vec model increases performance of our unsupervised method around 10%. This is an important ﬁnding of our research, since the benchmark training data – which are necessary for supervised learning – are often not available. Also our approach of preprocessing input data texts is suitable even for simple supervised Linear SVM algorithm whose performance is comparable with more complex one (Table 2). Acknowledgments. This research was supported by the Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506.

References 1. Chinniyan, K., Gangadharan, S., Sabanaikam, K.: Semantic similarity based web document classification using support vector machine. Int. Arab J. Inf. Technol. (IAJIT) 14(3), 285–292 (2017) 2. Hamdi, A., Voerman, J., Coustaty, M., Joseph, A., d’Andecy, V.P., Ogier, J.M.: Machine learning vs deterministic rule-based system for document stream segmentation. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 5, pp. 77–82. IEEE (2017) 6 7

Applying lemmatization and data-driven stop-word removal. Use of LSA method.

478

J. Novotn´ y and P. Ircing

3. Jiang, M., et al.: Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29(1), 61–70 (2018) 4. Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368 (2016) 5. Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2418–2424 (2015) 6. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: 5-th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967) 7. Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015) 8. Novotn´ y, J., Ircing, P.: Unsupervised document classification and topic detection. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 748–756. Springer, Cham (2017). https://doi.org/10.1007/978-3319-66429-3 75 9. Pedregosa, F.: Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011). http://scikit-learn.org ˇ uˇrek, R., Sojka, P.: Software framework for topic modelling with large cor10. Reh˚ pora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010). https://radimrehurek.com/gensim/ 11. Siolas, G., d’Alche Buc, F.: Support vector machines based on a semantic kernel for text categorization. In: IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN), vol. 5, pp. 205–209 (2000) 12. Slonim, N., Friedman, N., Tishby, N.: Unsupervised document classification using sequential information maximization. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 129–136 (2002) 13. Strakov´ a, J., Straka, M., Hajiˇc, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 13–18 (2014) ˇ 14. Svec, J., et al.: General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes. Lang. Resour. Eval. 48(2), 227– 248 (2014). https://doi.org/10.1007/s10579-013-9246-z 15. Trieu, L.Q., Tran, H.Q., Tran, M.T.: News classification from social media using twitter-based doc2vec model and automatic query expansion. In: Proceedings of the Eighth International Symposium on Information and Communication Technology, pp. 460–467. ACM (2017)

A Comparative Survey of Authorship Attribution on Short Arabic Texts Siham Ouamour and Halim Sayoud ✉ (

)

University of Science and Technology Houari Boumediene, Algiers, Algeria [email protected], [email protected]

Abstract. In this paper, we deal with the problem of authorship attribution (AA) on short Arabic texts. So, we make a survey on a set of several features and classiﬁers that are employed for the task of AA. This investigation uses characters, character bigrams, character trigrams, character tetragrams, words, word bigrams and rare words. The AA is ensured by 4 diﬀerent measures, 3 classiﬁers (MultiLayer Perceptron (MLP), Support Vector Machines (SVM) and Linear Regres‐ sion (LR)) and a new proposed fusion called VBF (i.e. Vote Based Fusion). The evaluation is done on short Arabic texts extracted from the AAAT dataset (AA of Ancient Arabic Texts). Although the task of AA is known to be diﬃcult on short texts, the diﬀerent results have revealed interesting information on the performances of the features and classiﬁcation techniques on Arabic text data. For instance, character-based features appear to be better than word-based features for short texts. Furthermore, the proposed VBF fusion provided high performances with an accuracy of 90% of good AA, which is higher than the score of the original classiﬁer using only one feature. Globally, the results of this inves‐ tigation shed light on the eﬃciency and pertinency of several features and clas‐ siﬁers in AA of short Arabic texts. Keywords: Natural language processing · Artiﬁcial intelligence Authorship attribution · Arabic language · Short texts · Text-mining

1

Introduction

As per deﬁnition, the task of author recognition can be divided into several ﬁelds: • authorship attribution (AA) or identiﬁcation: consists in identifying the author(s) of a set of diﬀerent texts; • authorship veriﬁcation: consists in checking whether a piece of text is written or not by an author who claimed to be the writer; • authorship discrimination: consists in checking if two diﬀerent texts are written by a same author or not [1]; • plagiarism detection: in this research ﬁeld we look for the sentences or paragraphs that are taken from another author [2]; • text indexing and segmentation: which consists in segmenting the global text into homogeneous segments (each segment contains the contribution of only one author) by giving the name of the appropriate author for each text segment [3]. © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 479–489, 2018. https://doi.org/10.1007/978-3-319-99579-3_50

480

S. Ouamour and H. Sayoud

Although several works are reported for the English and Greek [4] languages, the authors have not found a lot of serious research works made with Arabic texts. That is why; they propose an overall research work of AA that handles several texts written by 10 ancient Arabic travelers who wrote several books describing their travels. A special Arabic corpus has been built by the authors of this paper in order to assess several features and classiﬁers. The paper is organized as follows: In Sect. 2, we quote some previous works related to AA. In Sect. 3, we describe our textual corpus. Section 4 deﬁnes the diﬀerent classiﬁers and distances used during the experiments. Results are presented in the Sect. 5 and an overall conclusion is given.

2

Related Works

Authorship attribution consists in identifying the author of a given text. Several works have tested diﬀerent features during the last three decades. For instance, Holmes in 1994 [5], Stamatatos in 2000 [6] and Zheng in 2006 [7] proposed taxonomies of features to quantify the writing style. Mendenhall in 1887 [8] proposed sentence length counts and word length counts. A signiﬁcant advantage of such features is that they can be applied to any language. Several researchers used lexical features to represent the author style. However other works used common words instead [9, 10]. Hence, various sets of words have been used for English, we can quote the works of Abbasi and Chen in 2005 [11]; the works of Argamon in 2003 [12]; the works of Zhao and Zobel in 2005 [13]; and the works of Koppel and Schler in 2003 [14]. Similarly, in the works of Argamon in 2007 [15], A new interesting feature was proposed by [16] and [17], namely: the word ngrams, which provided very good performances. Concerning the character n-grams, the application of this approach to AA has shown an interesting success. Character bigrams and trigrams have been used in the works of Kjell [18]. In the works of Forsyth and Holmes [19], one found that bigrams and character n-grams of variable-length performed better than lexical features. They have been successfully used in the works of Peng [20], Keselj [21] and Stamatatos [22]. On the other hand, it is not only the feature which is important; in fact, the choice of a suitable classiﬁer is important too. That is, in 2010, Jockers and Witten [23] tested ﬁve diﬀerent classiﬁers. Concerning the Arabic language, there are not a lot of works that are reported. However, we can cite some recent works such as those reported by Sayoud 2012 [1] and Shaker [24]. Sayoud conducted an investigation on authorship discrimination between two old Arabic religious books: the Quran (The holy words of God) and Hadith (statements of the prophet Muhammad) [1]. Shaker investigated the AA problem in Arabic, using Function Words [24]. In this investigation, we are interested in using several features and classiﬁers for an evaluation in Arabic stylometry. The AAAT dataset is built by the authors of this paper for a purpose of AA.

3

Description of the Text Dataset

Our textual corpus is composed of 10 groups of old Arabic texts extracted from 10 diﬀerent Arabic books. The books are written by ten diﬀerent authors and each group

A Comparative Survey of Authorship Attribution on Short Arabic Texts

481

contains diﬀerent texts belonging to a unique author. This set of texts has been collected in 2011 from “Alwaraq library” (www.alwaraq.net); we called it AAAT. Furthermore, this corpus represents a reference dataset for AA in Arabic, which has been used by several researchers working in this ﬁeld. The texts of the corpus are quite short: the average text length is about 550 words and some texts have less than 300 words.

4

Classiﬁcation Methods

For the evaluation task, we have evaluated 4 distances (Manhattan, Cosine, Stamataos, and Canberra distances) and 3 classiﬁers (SVM, MLP and LR). Several features are also used, namely: characters, character n-grams, words, word n-grams and rare words in order to ﬁnd the most reliable characteristic for the Arabic language. Furthermore, a Vote Based Fusion (VBF) has been proposed to enhance the overall classiﬁcation performances. 4.1 Manhattan Distance (Man) The Manhattan distance between two vectors X and Y of length n is deﬁned as follows:

Man(X, Y) =

n ∑ i=1

|X − Y | i| | i

(1)

4.2 Cosine Distance Cosine similarity is a measure of similarity between two vectors X and Y (of length n) that measures the cosine of the angle between them (denoted by θ). The cosine distance, cos(θ), is represented using a dot product and magnitude as: ∑n X ∗ Yi X.Y i=1 i =√ cos 𝜃 = √ ) ( ∑n ∑n ( ) 2 ‖X‖‖Y‖ 2 ∗ X Yi i i=1 i=1

(2)

4.3 Stamatatos Distance (Sta) This distance was introduced by Stamatatos [25] to measure texts similarity. It was successfully employed in AA. It is given by the following formula:

Sta(X, Y) =

n ∑ [ ( ) ( )]2 2 Xi − Yi ∕ Xi + Yi i=1

(3)

482

S. Ouamour and H. Sayoud

4.4 Canberra Distance (Can) The Canberra distance between vectors X and Y is given by the following equation:

Can(X,Y) =

∑n i=1

| (Xi − Yi ) | | | | X +Y | | i i |

(4)

4.5 Sequential Minimal Optimization-Based Support Vector Machines (SVM) In machine learning, SVM are supervised learning models with associated learning algorithms that analyze data and recognize patterns. They are used for classiﬁcation and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classiﬁer. Given a set of training examples, each marked as belonging to one of two categories, a SVM training algorithm builds a model that assigns new examples into one category or the other. A SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. In addition to performing linear classiﬁcation, SVM can eﬃciently perform nonlinear classiﬁcation using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. The SVM is a very accurate classiﬁer that uses bad examples to form the boundaries of the diﬀerent classes. Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming problem that arises during the training of the SVM. The SMO algorithm is used to speed up the training of the SVM. In our application, we solved the multi-class problems by using pairwise classiﬁcation technique. 4.6 Multi-layer Perceptron (MLP) The MLP is a feed-forward neural network classiﬁer that uses the errors of the output to train the neural network: it is the “training step”. The MLP is organized in layers: one input layer of distribution points, one or more hidden layers of artiﬁcial neurons (nodes) and one output layer of artiﬁcial neurons. Each node, in a layer, is connected to all other nodes in the next layer and each connection has a weight (which can be zero). The MLP is considered as universal approximator and is widely used in supervised machine learning classiﬁcation. The MLP can use diﬀerent back-propagation schemes to ensure the classiﬁer training. 4.7 Linear Regression Linear regression models are often ﬁtted using the least squares approach, but they may also be ﬁtted in other ways, such as by minimizing the “lack of ﬁt” in some other norms

A Comparative Survey of Authorship Attribution on Short Arabic Texts

483

(as with least absolute deviations regression), or by minimizing a penalized version of the least squares loss function as in ridge regression. In linear regression, data are modeled using linear predictor functions, and unknown model parameters are estimated from the data. Such models are called linear models. Usually, the predictor variable is denoted by the variable X and the criterion variable is denoted by the variable y. Most commonly, linear regression refers to a model in which the conditional mean of y given the value of X is an aﬃne function of X. Less commonly, linear regression could refer to a model in which the median of the condi‐ tional distribution of y given X is expressed as a linear function of X. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of y given X, rather than on the joint probability distribution of y and X, which is the domain of multivariate analysis. 4.8 Classiﬁcation Process The general classiﬁcation process is divided into two methods: Training Model based Classiﬁcation and Nearest Neighbor based Classiﬁcation. In the ﬁrst type, a training step is required to build the model or the centroid (in case of similarity measures); afterward, the testing step could be performed by using the resulting model. In the second type, the training is not required, since a simple similarity distance is computed between the unknown document and each referential text: the smallest distance gives an indication on the most probable class. Furthermore two types of measures are employed: a simple distance and a centroid based distance. The ﬁrst type is known to be inaccurate, while the second one (i.e. centroid) is more accurate and robust against noises. The ﬁrst classiﬁcation type includes the following classiﬁers: Centroid based Similarity meas‐ ures, Multi-Layer Perceptron, SMO-based Support Vector Machines and Linear Regres‐ sion; whiles the second classiﬁcation type includes only the nearest neighbor similarity measures. After every identiﬁcation test, a score of good AA is computed in order to get an estimation on the overall classiﬁcation performances.

5

Experiments of Authorship Attribution

In this section, we present the diﬀerent experiments of AA, which are conducted on the historical Arabic texts. Several features are tested such as: characters, character bigrams, character trigrams, character tetragrams, words, word bigrams, word trigrams, word tetragrams and rare words. On the other hand, diﬀerent types of classiﬁers (MLP, SVM and LR) and distances are employed to ensure the AA classiﬁcation. The AA Score (AAS) is calculated by using the RandAccuracy formula, as follows: AAS score = Rand Accuracy =

number of texts that are well attributed total number of texts

(5)

484

S. Ouamour and H. Sayoud

5.1 Comparative Performances For a purpose of comparison, several ﬁgures are represented and commented on to make a comparative study of the diﬀerent features and classiﬁers. That is, Fig. 1 summarizes the overall best results given by each classiﬁer. In this ﬁgure, we remark that the Manhattan centroid distance seems to be very accurate, with a score of 90%, followed by the classiﬁers MLP and SVM, with a score of 80%, after that, we retrieve the Manhattan nearest neighbor distance and the LR classiﬁer, which provide a score of 70%. Finally, the remaining distances: Canberra, Cosine and Stama‐ tatos distances, give the worst performances, score of 60%.

Fig. 1. Best scores of authorship attribution (AAS) given by the diﬀerent classiﬁers.

In Fig. 2, we have presented the average AA performances for every feature. Those performances are obtained by calculating the mean of all the feature scores.

Fig. 2. Overall authorship attribution score for the diﬀerent features used.

A Comparative Survey of Authorship Attribution on Short Arabic Texts

485

From Fig. 2, we can deduce that the best feature in these experiments is character trigrams, followed by character tetragrams, character bigrams and rare words. The performances of AA continue to decrease respectively by using words, characters, word bigrams, word trigrams and ﬁnally, word tetragrams, which represents the worst features in our experiments. In overall, we notice two important points: On one hand, the AAS increases with the character n-gram size (i.e. the size n) and decreases with the word ngram size. On the other hand, character n-grams seems to be more accurate than word n-grams and rare words. Similarly and in a dual form, Fig. 3 displays the average scores that are obtained by the diﬀerent classiﬁers. These scores of performance are obtained by calculating the mean of all the scores of a speciﬁc classiﬁer. So, we notice that the machine learning classiﬁers are the most accurate, especially the SMO-SVM (average score exceeding 70%), which provides high performances of AA. The MLP is strongly accurate with a score of about 70% of good attribution and the linear regression is quite interesting (score over 60%). On the other hand, we notice that the distances are less accurate in the overall, since the average attribution scores do not exceed 58.33%.

Fig. 3. Average AA score per classiﬁer.

Once again, we can observe that character n-grams are better than word n-grams according to this same ﬁgure (Fig. 3) and we can also notice that the system presents a failure when using word n-grams. These last ones seem to be not suitable for short texts: this result is logical because short texts do not contain enough words or enough word n-grams either to make a fair statistical representation of the features. Figure 4 presents the best score given by each feature. We see that a score of 90% is given by character tetragrams, followed by a score of 80% for character bigrams, character trigrams and rare words, thereafter, a score of 70% for words, 60% for char‐ acters, 50% for word bigrams, and a score of 20% for word trigrams and tetragrams.

486

S. Ouamour and H. Sayoud

Fig. 4. Best score obtained with the diﬀerent features.

5.2 Vote Based Fusion In order to enhance the attribution performance, we thought to use several classiﬁers, which are combined in order to get a lower discrimination error: this combination is called Fusion. The fusion in the broad sense can be performed at diﬀerent hierarchical levels or processing stages [26], as follows: • Feature level, where the feature sets of diﬀerent modalities are combined; • Score (matching) level is the most common level where the fusion takes place. The scores of the classiﬁers are normalized and then combined in a consistent manner; • Decision level where, the outputs of the classiﬁers establish the decision via techni‐ ques such as majority voting. In this investigation, we have chosen to use the SMO-SVM classiﬁer, which seems to be the best classiﬁer in our experiments. The proposed fusion method is done at the decision level and is called “Vote-Based Fusion technique” or VBF. It consists in fusing the output decisions of the diﬀerent systems (i.e. each system uses the SVM classiﬁer with one speciﬁc feature) as it is described in Eq. 6. For the choice of the features, we have decided to keep only the most pertinent ones, namely those presenting a “bestscore” of at least 80%. So according to Fig. 5, those pertinent features are: Character bigram; Character trigram and Rare words. VBFFusion = Round{(𝛼1 .Char2gramCLASS + 𝛼2 .Char3gramCLASS 1 }, + 𝛼3 .RareWordsCLASS ) 𝛼1 + 𝛼2 + 𝛼3

where CLASS represents the classiﬁer output and 𝛼i is a constant smaller than one.

(6)

A Comparative Survey of Authorship Attribution on Short Arabic Texts

487

Fig. 5. Vote fusion technique. The outputs Oj are fused to produce the author identity.

The same previous experiments of AA have been conducted by using the proposed fusion technique. Results show that the fusion provides an accuracy of 90%, which is higher than all the scores provided by the SVM. This result is interesting since it shows that it is possible to enhance the identiﬁcation accuracy only by combining several features and/or classiﬁers together. Furthermore, it is important to mention that an accu‐ racy of 90% with short texts is motivating, since previous works showed that the minimum amount of required text for a fair AA is at least 2500 tokens [27].

6

Conclusion

An investigation of AA has been conducted on an old Arabic set of text documents that were written by ten ancient Arabic travelers. In this investigation, eleven diﬀerent clas‐ siﬁers and distances have been used for the attribution task, by using nine diﬀerent features. Moreover a fusion technique, called VBF, has been proposed to enhance the AA performances. The main conclusions of the diﬀerent experiments can be summar‐ ized by the following points: • Character bigram, trigram and tetragram appear to be interesting: Character tetra‐ grams appear to be suitable for distances (Manhattan, Canberra, Cosine and Stama‐ tatos), while for the machine learning, character bigram is the most accurate one. • Manhattan centroid distance has shown excellent performances with an accuracy of 90% when using character tetra-grams. The performances of this distance are more or less comparable to those of the SVM, which is considered very reliable. • As expected theoretically, the SVM has shown excellent average performances in most experiments, which recommends the use of this type of classiﬁer in AA. • Character-based features are better than word-based ones for short documents. • The proposed VBF fusion provided high performances with an accuracy of 90% of good AA, which highly recommends the use of the fusion in AA. • Although the word-based features did not give good results, rare words have presented good scores for almost all the classiﬁers. This result shows that some linguistic information of the author style are embedded in the rare words.

488

S. Ouamour and H. Sayoud

Finally, we think that the results of this investigation are interesting since they shed light on the eﬃciency of several features and classiﬁers in AA of short Arabic texts. As perspectives, one proposes to evaluate our system on dialectical Arabic language.

References 1. Sayoud, H.: Author discrimination between the Holy Quran and Prophet’s statements. Lit. Linguist. Comput. 27(4), 427–444 (2012) 2. Chowdhury, H.A., Bhattacharyya, D.K.: Plagiarism: taxonomy, tools and detection techniques. In: Paper of the 19th National Convention on Knowledge, Library and Information Networking (NACLIN 2016) held at Tezpur University, Assam, India (2016) 3. Sayoud, H.: Segmental analysis based authorship discrimination between the Holy Quran and Prophet’s statements. Can. Soc. Digit. Hum., Digital Studies Journal (2015) 4. Tambouratzis, G., Hairetakis, G., Markantonatou, S., Carayannis, G.: Applying the SOM model to text classiﬁcation according to register and stylistic content. Int. J. Neural Syst. 13(1), 1–11 (2003) 5. Holmes, D.I.: Authorship attribution. Comput. Humanit. 28, 87–106 (1994) 6. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000) 7. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identiﬁcation of online messages: writing style features and classiﬁcation techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006) 8. Mendenhall, T.C.: The characteristic curves of composition. Science 9, 237–249 (1887) 9. Argamon S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing (2005) 10. Burrows, J.F.: Word patterns and story shapes: the statistical analysis of narrative style. Lit. Linguist. Comput. 2, 61–70 (1987) 11. Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. Intell. Syst. 20(5), 67–75 (2005) 12. Argamon, S., Saric, M., Stein, S.: Style mining of electronic messages for multiple authorship discrimination: ﬁrst results. In: Proceedings of 9th ACM SIGKDD, pp. 475–480 (2003) 13. Zhao, Y., and Zobel, J.: Eﬀective and scalable authorship attribution using function words. 2nd Asia Information Retrieval Symposium (2005) 14. Koppel, M., Schler J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: IJCAI Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003) 15. Argamon, S., et al.: Stylistic text classiﬁcation using functional lexical features. J. Am. Soc. Inform. Sci. Technol. 58(6), 802–822 (2007) 16. Peng, F., Shuurmans, D., Wang, S.: Augmenting naive Bayes classiﬁers with statistical language models. Inf. Retrieval J. 7(1), 317–345 (2004) 17. Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation. In: Proceedings of the International Conference on Empirical Methods in Natural Language Engineering, pp. 482–491 (2006) 18. Kjell, B.: Discrimination of authorship using visualization. Inf. Process. Manag. 30(1), 141– 150 (1994) 19. Forsyth, R., Holmes, D.: Feature-ﬁnding for text classiﬁcation. Lit. Linguist. Comput. 11(4), 163–174 (1996)

A Comparative Survey of Authorship Attribution on Short Arabic Texts

489

20. Peng, F., Shuurmans, D., Keselj, V., Wang, S.: Language independent authorship attribution using character level language models. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, pp. 267–274 (2003) 21. Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author proﬁles for authorship attribution. Paciﬁc Association for Computational Linguistics, pp. 255–264 (2003) 22. Ouamour, S., Sayoud, H.: Authorship attribution of ancient texts written by ten arabic travelers using character N-Grams. CITS-2013, Athens, Greece, CITS (2013) 23. Jockers, M.L., Witten, D.M.: A comparative study of machine learning methods for authorship attribution. Lit. Linguist. Comput. 25(2), 215–223 (2010) 24. Shaker, K.: Investigating features and techniques for Arabic authorship attribution, PhD thesis Heriot-Watt University (2012) 25. Stamatatos, E.: Author identiﬁcation using imbalanced and limited training texts, text-based Information Retrieval, pp. 237–241 (2007) 26. Jain, A.K., Ross, A., Prabhakar, S.: An introduction to biometric recognition. Trans. Circ. Syst. Video Technol. 14(1), 4–20 (2004) 27. Ouamour, S., Khennouf, S., Bourib, S., Hadjadj, H., Sayoud H.: Eﬀect of the text size on stylometry-application on arabic religious texts. In: International Conference on Computer Science Applied Mathematics and Applications, pp 215–228, Vienna, Austria (2016)

How Good Is Your Model ‘Really’ ? On ‘Wildness’ of the In-the-Wild Speech-Based Aﬀect Recognisers Vedhas Pandit1(B) , Maximilian Schmitt1 , Nicholas Cummins1 , Franz Graf2 , orn Schuller1,3 Lucas Paletta2 , and Bj¨ 1

ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany [email protected] 2 Joanneum Research Forschungsgesellschaft mbH, Graz, Austria 3 Group on Language, Audio, and Music (GLAM), Imperial College London, London, UK

Abstract. We evaluate, for the ﬁrst time, the generalisability of in-thewild speech-based aﬀect tracking models using the database used in the ‘Aﬀect Recognition’ sub-challenge of the Audio/Visual Emotion Challenge and Workshop (AVEC 2017) – namely the ‘Automatic Sentiment Analysis in the Wild (SEWA)’ and the ‘Graz Real-life Aﬀect in the Street and Supermarket (GRAS2 )’ corpus. The GRAS 2 corpus is the only corpus to date featuring audiovisual recordings and time-continuous aﬀect labels of the random participants recorded surreptitiously in a public place. The SEWA database was also collected in an in-the-wild paradigm in that it also features spontaneous aﬀect behaviours, and real-life acoustic disruptions due to connectivity and hardware problems. The SEWA participants, however, were well aware of being recorded throughout, and thus the data potentially suﬀers from the ‘observer’s paradox’. In this paper, we evaluate how a model trained on a typical data suﬀering from the observer’s paradox (SEWA) fairs on a real-life data that is relatively free from such psychological eﬀect (GRAS2 ), and vice versa. Because of the drastically diﬀerent recording conditions and the recording equipments, the feature spaces for the two databases diﬀer extremely. The in-the-wild nature of the real-life databases, and the extreme disparity between the feature spaces are the key challenges tackled in this paper, a problem of a high practical relevance. We extract bag of audio words features using, for the very ﬁrst time, a randomised database-independent codebook. True to our hypothesis, the Support Vector Regression model trained on GRAS2 had better generalisability, as this model could reasonably predict the SEWA arousal labels. Keywords: Aﬀective speech analysis · Transfer learning Observer’s paradox · One-way mirror dilemma Authentic emotions · In-the-wild c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 490–500, 2018. https://doi.org/10.1007/978-3-319-99579-3_51

How Good Is Your Model ‘Really’ ?

1

491

Introduction

Human speech is a complex signal, featuring a plethora of information beyond the spoken words. In addition to the linguistic content, a speech signal tells the listener a lot about the speaker – such as their age, gender, native language, motivations and emotions. It is important for a human-machine interaction (HCI) system to recognise these contexts correctly, to be able to respond in accordance. Today, we are continuously surrounded by human-machine interfaces. A virtual assistant in a handheld device has no longer remained a science ﬁction, but is simply an everyday reality. There is, therefore, a growing interest in the ﬁeld of aﬀective computing, to make the machines ‘understand’ human speech in its entirety, i. e., including the featured emotions and contexts. Broadly speaking, there are three types of databases used in aﬀect research. Early research utilised acted speech data, which typically featured a highly exaggerated aﬀect behaviours, far from the natural ones (e. g., EmoDB [1,12]). In another data collection strategy, the participants are made to converse in a laboratory environment. While the behaviours collected are mostly natural and spontaneous, the collected data is typically clean and unaﬀected by the reallife eﬀects such as noise (e. g., RECOLA [16]). The third, ‘in-the-wild’ databases refer to the data collected in a non-laboratory, the everyday, unpredictable noisy environments. However, the so-called ‘in-the-wild’ databases mostly feature the recordings collected in an identical real-life settings, with very similar acoustic disruptions. This has direct implications on the trained models, limiting their generalisability. Also, most of these databases suﬀer from the phenomenon called ‘observer’s paradox’ or ‘one-way mirror dilemma’ – where the participants are typically well aware of being recorded right from the beginning of the recordings – which aﬀects featured aﬀect behaviours [19]. In this contribution, we test, for the ﬁrst time, the hypothesis that the models trained on a closer-to-real-life database is likely to generalise better [14]. While there have been transfer learning studies on aﬀect [2–4,11], there is hardly any research on generalisability of time-continuous aﬀect recognising models for the real-life or in-the-wild datasets. To this end, we ﬁrst introduce the two databases used in this study in Sect. 2. We describe our experiments in detail in Sect. 3. After this, we present our ﬁndings in Sect. 4 before we conclude the paper in Sect. 5.

2

Databases

To test which of the two aﬀect recognising models generalises better – i. e., whether the one trained on a ‘more’ in-the-wild data or the one using database collected under relatively restrained or ‘laboratory’-like settings – we use two prominent benchmark databases, namely the ‘Automatic Sentiment Analysis in the Wild’ (SEWA) corpus used in the AVEC 2017 challenge and the ‘Graz Reallife Aﬀect in the Street and Supermarket’ (GRAS2 ) corpus. The SEWA database features video chat recordings of the participants discussing the commercials they just watched. The recordings were collected using

492

V. Pandit et al.

the standard webcams and computers from the participants’ homes or oﬃces. The data collection took place over the internet using a video-chat interface speciﬁcally designed for this task. The recordings feature spontaneous aﬀect behaviours, real-life noises and delays due to connectivity and hardware problems. The participants dominated the conversations more or less the equally. The GRAS2 database features audiovisual recordings of the conversations with the unsuspecting participants from a ﬁrst-person point of view in a busy shopping mall. The participants were made aware of being recorded only half way through the conversations, and were requested to sign a consent form agreeing to release the recordings for research purposes. The database, thus, features spontaneous and ‘more’ authentic aﬀective behaviours, as they are relatively more observer’s paradox-free. Because the conversations were totally spontaneous, the durations of the conversations vary widely (standard deviation = 56.3 s). Also the extent to which the participants dominate the conversations, i. e., relative durations of the subject’s speech and the speech by the student research assistant collecting the data, varies widely. Unfortunately, the student research assistants dominate many of the conversations. The sections of the recordings where the participants read the documents before signing the consent form hardly feature subject’s speech. The recordings also contain dynamically varying noise, including the impact sounds, bustle, background music, and background speech. There are only 28 conversations available. All these factors combine to make the this database a lot more ‘in-the-wild’ and the aﬀect tracking task lot more challenging. The corpus was used previously in a research study establishing correlation between an eye-contact and the speech [6], and another study on time-continuous authentic aﬀect recognition in-the-wild [13].

3 3.1

Experimental Design Data Splits

We split both the SEWA and GRAS2 corpus into training, validation and test sets in a roughly similar 2:1:1 ratio, in terms of both the number of ﬁles in a split and the cumulative duration of the audio clips. We use the same splits used in the AVEC 2017 challenge [15] when running our experiments (Fig. 1) on the SEWA database. The splits are made such that a participant-independent model can be trained, i. e., no participant is present in more than one split. The splits on GRAS2 are made such that each split features a diﬀerent student assistant likewise, i. e., no student assistant is present in more than one split. The statistics for the three splits are presented in Table 1. 3.2

Feature Engineering

We need the features from the two databases such that they are compatible with one another, the two ideally share a common feature space. Because we are interested in predicting time-continuous signals of emotion dimensions, the

How Good Is Your Model ‘Really’ ?

493

Fig. 1. Entire experimental design pipeline. Table 1. Duration statistics for the SEWA and GRAS2 data splits. GRAS2

SEWA Train Duration (seconds) Total

Validation

5608.02 2272.30

Test

Train

Validation

2807.42 2018.75 1000.45

Test 998.02

Max

175.64

175.45

175.81

218.77

290.94

Min

46.68

97.43

174.9

71.77

100.31

86.40

164.94

162.31

175.46

126.17

166.74

166.34

34.90

63.67

74.93

Mean Std. Dev. Number of participants

31.24

26.71

34

14

0.24 16

16

6

309.40

6

features should also ideally capture the temporal dynamics of the varying lowlevel descriptor (LLD) space. The features should ideally be robust to noise. We generate the bags of audio words (BoAW) features using our own openXBOW toolkit [17] by vector quantising the ‘enhanced Geneva Minimalistic Acoustic Parameter Set’ (eGeMAPS) [5] low level descriptors (LLDs) extracted using our openSMILE toolkit [7]. This feature set is quite popular in the aﬀective computing ﬁeld already; we have used these exact features for establishing a baseline model performance for the AVEC 2017 challenge as the challenge organisers. The eGeMAPS LLDs is a minimalistic set of acoustic parameters, particularly tailor-made for aﬀective vocalisation and voice research, consisting of only 23 LLDs. To capture the temporal dynamics of the individual parameters and LLD types, we extract BoAW features based on these LLDs. The BoAW approach generates a sparse ﬁxed length histogram representation of the quantised features in time, thus capturing the temporal dynamics of the LLD vectors, while remaining noise-robust due to its inherent sparsity and the quantisation step [13,17,18]. However, the eGeMAPS LLDs are drastically diﬀerent for the two databases in terms of their value ranges. Because the critical statistics – such as the mean, the variance, the maximum and the minimum value – are radically diﬀerent (some with even the opposite signs), the statistics computed on one database cannot be reliably be used to standardise or normalise the other database such that they share a common feature space. Furthermore, the codebook used in the AVEC 2017 challenge utilises a random sampling of the SEWA eGeMAPS LLD vectors. For transfer learning experiments however, we ideally should not gener-

494

V. Pandit et al.

ate the codebook by sampling only one of the two databases; a codebook that is likely to represent one dataset better. It is imperative to use an identical codebook to vector quantise the two databases that is completely data-independent – especially when the ranges of feature values are drastically diﬀerent. It is only then that we can independently assess generalisability of the trained models objectively, free from eﬀect of the codebook better representing temporal dynamics in one dataset over the other. We thus generate a codebook of size 1000, independent of the two databases, consisting of 23-length LLDs. An array of shape 1000 × 23, populated with random samples from a normal distribution (mean = .5, standard deviation = .1) is used as a codebook matrix. We preprocess the LLDs by scaling and oﬀsetting all of the data splits, using the oﬀsets and the scaling factors that normalise the respective training split in the range [0, 1]. We then vector quantise all of the LLDs to the randomised codebook generated with 10 soft assignments for every LLD. We compute the distribution of the assignments in a moving window of 6 s, with a hop size of 0.1 s – similar to how AVEC 2017 features were generated [15]. 3.3

Gold Standard Generation

We use the gold standard arousal and valence values of the AVEC 2017 challenge when training using the SEWA database [15]. We generate the gold standard for the GRAS2 database using the same algorithm as of SEWA. The gold standard used in our previous studies on GRAS2 diﬀers only in that, we previously did not compensate for annotator-speciﬁc mean annotation standard deviations [13]. We use the modiﬁed Evaluator Weighted Estimator (EWE) method to generate the gold standard s, one per subject per emotion dimension. The goal of the EWE metric is to take into account the reliability of the individual annotators, signiﬁed by the weight rk for every annotation yk . This conﬁdence value is computed by quantifying extent to which the annotations by that annotator agree with the rest of the annotations. The gold standard, yEW E is deﬁned as: 1 yEW En = K

K

k=1 rk k=1

rk yn,k ,

(1)

where yn,k is an annotation by the annotator k (k ∈ N, 1 ≤ k ≤ K) at instant n (n ∈ N, 1 ≤ n ≤ N ) contributing to the annotation sequence yk . The symbol rk is the corresponding annotator-speciﬁc weight. The lower bound for rk is set to 0. In [8], the weight rk is deﬁned to be normalised cross-correlation between yk and the averaged annotation sequence y¯n . The gold standards used in both the AVEC 2017 baseline paper [15] and the GRAS2 -based aﬀect recognition study [13] redeﬁned the weight rk such that it gets strongly inﬂuenced by the total number of annotations yk is in agreement with, and also by the extent to which they agree, by simply averaging the pair-wise correlations. The weights are lower bounded to 0 as usual. They are then normalised such that they sum to 1. N yn,ki − µki yn,kj − µkj 1 , where: µk = yn ,k , 2 2 N N n =1 y − µ − µ y n,ki ki n,kj kj n=1 n=1 N

rki ,kj = N

n=1

(2)

How Good Is Your Model ‘Really’ ?

rk i 3.4

K 1 =

K

kj =1 rki ,kj

0

if if

K

kKj =1

rki ,kj > 0

kj =1 rki ,kj

≤0

,

r rki = K ki

kj =1 rkj

495

.

(3)

Annotator Lag Compensation

To compensate for the reaction time of the annotators, we delay the feature vectors in time [10]. We use the delay value of 2.2 s, based on our previous grid search analysis on SEWA corpus [15]. In this study, we remove the repeating feature vectors at the beginning of every sample sequence introduced due to the lag compensating function used in AVEC 2017. We ﬁnd that there is minute to no diﬀerence in performance because of removal of erroneously repeating feature vectors. This is expected, since the number of removed features (=22, in case of annotator lag compensation of 2.2 s) is less than 2% of the total number of feature vectors for an average SEWA audio recording. Though it does not improve or deteriorate the performance of the models, we note this addition to our preprocessing steps in comparison with the AVEC 2017 workﬂow [15], for the sake of correctness and completeness. 3.5

Regression Models

For the new BoAW feature sets generated using a randomised codebook, we ﬁrst generate baseline regression results by training support vector machine (SVM)based regression models (SVR) using a linear kernel with complexity values, C = [2−15 , 2−14 , ..., 20 ], just as was done when establishing the AVEC 2017 challenge baseline. We also experiment with additional C values in the range [10−8 , ..., 10−5 ] as the GRAS2 -trained arousal model was found to perform well for C ∈ [2−15 , 2−7 ] . We ran regression models using simple feedforward neural networks (FFFN) and the double-stacked and a single-stacked recurrent neural network (RNN) with gated recurrent units (GRUs) in cascade with FFNNs. To train a GRU-based model, we used feature sequences of length 60, corresponding to 6 s. We experimented with several conﬁgurations for the network topologies (with 20 to 100 GRU nodes, 10 to 50-node layered FFNNs) , activation function permutations (selu, tanh, linear), feature lengths (60,80), learning rates (0.001 to 0.01 in the steps of 0.003), and optimisers (rmsprop, adam, adagrad, and adamax). 3.6

Post-processing

We post-process the predictions using the equation: σ1 (4) Ynew = (Yorig − μ2 ) + μ1 , σ2 where Yorig is the primary prediction, Ynew is the post-processed prediction, μ1 , σ1 , μ2 , σ2 are the mean and standard deviation of the training label sequence and the model’s prediction on the training data respectively [20].

496

4

V. Pandit et al.

Results and Discussions

All of the models we trained (SVRs, GRU-RNNs and FFNNs) performed reasonably well, so long as the test split and the training splits came from the same database, with concordance correlation coeﬃcient (CCC) [9] close to 0.25 on an average. Of these, only the SVR-based models trained on GRAS2 arousal annotations could reasonably make predictions in the transfer learning experiments (Table 2). The models otherwise mostly fail to generalise to a diﬀerent dataset, with CCC values close to zero. For these transfer learning experiments from SEWA to GRAS2 , and vice versa, following are our key ﬁndings. 4.1

Neural Networks Tended to Overfit to the Primary Database

We observed the neural network-based models tended to overﬁt to the database they were trained on. The predictions were reasonably good for the test and validation splits of the same database that the training split came from. While performance on the same primary database depends also on the random initialisation of its weights and biases, the models invariably failed to make reasonable predictions on a diﬀerent database (CCC close to zero). 4.2

Valence Tracking Learnings Were Not Generalisable Beyond the Database

A valence prediction is a particularly a harder problem as compared to an arousal prediction [13,16,18]. We observed that the models could predict the valence dimension for the validation and test splits of the same database (CCC as high as 0.42), but the prediction models tend to overﬁt to the database. This observation was irrespective of the type of model used, and the direction of transfer learning (i. e., whether SEWA to GRAS2 , or GRAS2 to SEWA). 4.3

GRAS2 -trained SVR-Based Arousal Tracking was Reasonably Generalised

Interestingly though, an SVR-based arousal prediction models trained on GRAS2 alone faired reasonably well on SEWA database with CCC values as high as 0.222 over the complete SEWA database – despite SEWA database being twice the size of GRAS2 . In the interest of reproducibility of the experiments presented in this paper, the complexity values and the corresponding performance values for the diﬀerent models are as indicated in Table 2. We note that, out of the three SEWA splits, the model performs the worst on its training data split, which also is the most diversiﬁed split out of the three splits Table 1. Despite having a lot smaller training set, the GRAS2 to SEWA model transfer learning for the arousal prediction worked reasonably well. SEWA to GRAS2 transfer learning however does not quite work (again, CCC close to zero), despite the training split having twice as much the data to train the model on, with an

How Good Is Your Model ‘Really’ ?

497

Table 2. Performance of the models in the transfer learning experiments for the arousal dimension. The models were trained only using the training split of the GRAS2 database, and were tested on the remaining data splits of GRAS2 and the entire SEWA German database. We note the performance on the individual data-splits of the SEWA database, to get better understanding of the coincidental data disparities and similarities between the two databases, and how the performance varies across splits with change in the complexity values. Interestingly enough, the similar SVR-based models trained on SEWA did not perform well on GRAS2 database. C Value Database 10−5

GRAS2

SEWA

2−15

GRAS2

SEWA

2−13

GRAS2

SEWA

2−11

GRAS2

SEWA

2−9

GRAS2

SEWA

Phase Training

Data split CCC PCC RMSE Training

.501

.501

.137

Validation Validation .363

.370

.144

Testing

Testing

.280 .320

.152

Testing

Training

.171

.149

Training

.216

Validation .325

.356

.144

Testing

.230

.132

.197

Entirety

.223 .263

.144

Training

.582

.125

.582

Validation Validation .382

.386

.140

Testing

Testing

.266

.303

.149

Testing

Training

.170

Training

.128

.178

Validation .280

.340

.161

Testing

.250

.144

.188

Entirety

.191

.252

.162

Training

.691

.691

.108

Validation Validation .350

.353

.144

Testing

Testing

.241

.256

.143

Testing

Training

.188

Training

.082

.103

Validation .236

.290

.184

Testing

.191

.160

.155

Entirety

.156

.193

.180

Training

.778

.778

.091

Validation Validation .331

.341

.152

Testing

Testing

.228

.235

.144

Testing

Training

.198

Training

.107

.122

Validation .251

.279

.195

Testing

.191

.169

.171

Entirety

.175

.196

.190

Training

.834

.834

.079

Validation Validation .248

.265

.170

Testing

Testing

.180

.183

.145

Testing

Training

.233

.120

.146

Validation .156

.174

.231

Testing

.208

.239

.186

Entirety

.156

.181

.221

498

V. Pandit et al.

identical model parameters. We speculate that the SEWA database is not as inthe-wild as GRAS2 . GRAS2 features also the random background speech, bustle, impact sounds, background music, and even the long non-speech sections. There exist emotion dimension labels for even these non-speech/rare-speech sections which the model needs to learn, which in itself is a challenging task. Such more in-the-wild nature of the data manifests itself in lot more challenging training instances that help model to learn arousal predictions with more nuances.

5

Conclusions and Future Work

We present a ﬁrst-of-its-kind transfer learning study on the speech-based timecontinuous in-the-wild aﬀect recognising models. To this end, we used a novel BoAW approach that uses a novel data-independent randomised codebook. The GRAS2 database – featuring relatively more observer’s paradox-free aﬀective behaviours, and a lot more data diversity in terms of conversation durations, acoustic events, noise dynamics, spontaneity of the featured aﬀective behaviours – proved to be highly eﬀective in training a more generalised arousal tracking model than the SEWA database, despite its smaller size. As for the valence dimension, none of the databases were eﬀective enough in training a better-generalised valence tracking model. Furthermore, none of our neural network-based models could predict emotion dimensions (both arousal and valence) on a diﬀerent database through transfer learning. All these models were observed to perform well on unseen data from the databases they were trained on. The new BoAW paradigm of using the data-independent randomised codebooks helps one project dissimilar databases onto a common normalised feature space, while also inherently capturing the temporal dynamics of the LLDs; the technique which can be further developed and ﬁne-tuned. We intend to investigate eﬀect of diﬀerent randomisation strategies (sampling from diﬀerently skewed distribution, or uniform or diﬀerent normal distributions), also the codebook size and the number of assignments on the model performance. We would like to also extend on this work by adding more in-the-wild databases. Our ﬁndings on better generalisability of the GRAS2 -trained arousal tracking model encourage us to use more of such databases that are free from the observer’s paradox. Unfortunately, there are no other observer’s paradox-free databases to work with, that are publicly available today. We plan to therefore collect new data using a similar data collection strategy used to build GRAS2 . The next logical step is to add other prominent aﬀect recognition databases – such as RECOLA [16]. This will culminate into an exhaustive study on aﬀectrelated databases on their eﬀectiveness in training the most-generalised, real-life time-continuous aﬀect recognisers. Acknowledgments. This work was partly supported by the EU’s Horizon 2020 Programme through the Innovative Action No. 645094 (SEWA), and European Community’s 7th Framework Program under the Grant No. 288587 (MASELTOV).

How Good Is Your Model ‘Really’ ?

499

References 1. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Proceedings of the 9th EUROSPEECH, pp. 1517– 1520 (2005) 2. Coutinho, E., Deng, J., Schuller, B.: Transfer learning emotion manifestation across music and speech. In: Proceedings of the IJCNN, Beijing, China, pp. 3592–3598. IEEE (2014) 3. Deng, J., Xia, R., Zhang, Z., Liu, Y., Schuller, B.: Introducing shared-hiddenlayer autoencoders for transfer learning and their application in acoustic emotion recognition. In: Proceedings of the 39th ICASSP, Florence, Italy, pp. 4851–4855. IEEE (2014) 4. Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Proceedings of the 5th HUMAINE Association Conference on ACII, Geneva, Switzerland, pp. 511–516. IEEE (2013) 5. Eyben, F., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and aﬀective computing. IEEE Trans. Aﬀect. Comput. 7(2), 190–202 (2016) 6. Eyben, F., Weninger, F., Paletta, L., Schuller, B.: The acoustics of eye contact detecting visual attention from conversational audio cues. In: Proceedings of the 6th Workshop on Eye Gaze in Intelligent Human Machine Interaction: Gaze in Multimodal Interaction (GAZEIN) at 15th ICMI, Sydney, Australia, pp. 7–12. ACM (2013) 7. Eyben, F., Weninger, F., Groß, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM MM 2013, Barcelona, Spain, pp. 835–838. ACM (2013). (Honorable Mention (2nd place) in the ACM MM 2013 Open-source Software Competition, acceptance rate: 28%, >200 citations) 8. Grimm, M., Kroschel, K.: Evaluation of natural emotions using self assessment manikins. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 381–385 (2005) 9. Lawrence, I., Lin, K.: A concordance correlation coeﬃcient to evaluate reproducibility. Biometrics 45(1), 255–268 (1989) 10. Mariooryad, S., Busso, C.: Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In: Aﬀective Computing and Intelligent Interaction (ACII), pp. 85–90 (2013) 11. Ng, H.W., Nguyen, V.D., Vonikakis, V., Winkler, S.: Deep learning for emotion recognition on small datasets using transfer learning. In: Proceedings of the 17th ICMI, pp. 443–449. ACM (2015) 12. Paeschke, A., Kienast, M., Sendlmeier, W.F.: F0-contours in emotional speech. In: Proceedings of the 14th International Congress of Phonetic Sciences, vol. 2, pp. 929–932 (1999) 13. Pandit, V., et al.: Tracking authentic and in-the-wild emotions using speech. In: Proceedings of the 1st ACII Asia 2018, Beijing, P. R. China. AAAC/IEEE (2018) 14. Pantic, M., Sebe, N., Cohn, J.F., Huang, T.: Aﬀective multimodal human-computer interaction. In: Proceedings of the 13th ACM MM, Multimedia 2005, Singapore, pp. 669–676. ACM (2005)

500

V. Pandit et al.

15. Ringeval, F., et al.: AVEC 2017 - real-life depression, and aﬀect recognition workshop and challenge. In: Ringeval, F., Valstar, M., Gratch, J., Schuller, B., Cowie, R., Pantic, M. (eds.) Proceedings of the 7th International Workshop on Audio/Visual Emotion Challenge (AVEC 2017) at 25th ACM MM, Mountain View, CA, pp. 3–9. ACM (2017). 6 p 16. Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the RECOLA multimodal corpus of remote collaborative and aﬀective interactions. In: 10th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2013), Shanghai, P. R. China, pp. 1–8. IEEE (2013) 17. Schmitt, M., Schuller, B.: openXBOW - Introducing the passau open-source crossmodal bag-of-words toolkit. J. Mach. Learn. Res. 18, 3370–3374 (2017) 18. Schmitt, M., Ringeval, F., Schuller, B.: At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech. In: Proceedings of the 17th INTERSPEECH, San Francisco, CA, pp. 495–499. ISCA (2016) 19. Speer, S., Hutchby, I.: From ethics to analytics: aspects of participants’ orientations to the presence and relevance of recording devices. Sociology 37(2), 315–337 (2003) 20. Trigeorgis, G., et al.: Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings of the 41st ICASSP, Shanghai, P. R. China, pp. 5200–5204. IEEE (2016)

RAMAS: Russian Multimodal Corpus of Dyadic Interaction for Aﬀective Computing Olga Perepelkina1,2(B) , Evdokia Kazimirova1 , and Maria Konstantinova1,2 1 Neurodata Lab LLC, Miami, FL, USA {o.perepelkina,e.kazimirova,m.konstantinova}@neurodatalab.com, [email protected] http://www.neurodatalab.com/en/ 2 Lomonosov Moscow State University, Moscow, Russia

Abstract. Emotion expression encompasses various types of information, including face and eye movement, voice and body motion. Emotions collected from real conversations are diﬃcult to classify using one channel. That is why multimodal techniques have recently become more popular in automatic emotion recognition. Multimodal databases that include audio, video, 3D motion capture and physiology data are quite rare. We collected The Russian Acted Multimodal Aﬀective Set (RAMAS) − the ﬁrst multimodal corpus in Russian language. Our database contains approximately 7 h of high-quality close-up video recordings of faces, speech, motion-capture data and such physiological signals as electrodermal activity and photoplethysmogram. The subjects were 10 actors who played out interactive dyadic scenarios. Each scenario involved one of the basic emotions: Anger, Sadness, Disgust, Happiness, Fear or Surprise, and such characteristics of social interaction like Domination and Submission. In order to note emotions that subjects really felt during the process we asked them to ﬁll in short questionnaires (self-reports) after each played scenario. The records were marked by 21 annotators (at least ﬁve annotators marked each scenario). We present our multimodal data collection, annotation process, inter-rater agreement analysis and the comparison between self-reports and received annotations. RAMAS is an open database that provides research community with multimodal data of faces, speech, gestures and physiology interrelation. Such material is useful for various investigations and automatic aﬀective systems development. Keywords: Aﬀective computing · Multimodal aﬀect recognition Multimodal database · Russian emotion database

1

Introduction

Emotions are diﬃcult to classify by means of one channel so multimodal techniques have recently become more popular in automatic emotion recognition. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 501–510, 2018. https://doi.org/10.1007/978-3-319-99579-3_52

502

O. Perepelkina et al.

There are several data corpora suitable for multimodal emotion recognition purposes. The USC CreativeIT database [20] may serve as an example here. It includes full-body motion capture, video and audio data. This database provides annotations for each fragment concerning valence, activation and dominance categories. The interactive emotional dyadic motion capture database (IEMOCAP) [9] contains audio-visual and motion data for faces and hands only, but not for the whole body. The MPI Emotional Body Expressions Database [29] was also collected by means of several channels, yet only motion capture data are available for the research community. Finally, there are well-designed multimodal databases (e.g. RECOLA dataset [23], GEMEP corpus [8]) that provide multichannel information, but leave out motion data. The investigations of facial expression recognition can be classiﬁed into two parts, the detection of facial aﬀect (human emotions), and the detection of facial muscle action (action units) [10]. Currently a good classiﬁcation accuracy (more than 90% [7,27]) of basic emotions has been achieved by using face images. Speech is another essential channel for emotion recognition. Some emotions, such as sadness and fear, could be distinguished from an audio stream even better than from video [11]. The average recognition level across diﬀerent studies varies from 45% to 90% and depends on the number of categories, the classiﬁer type and the dataset type [14]. Physiological signals like cardiovascular, respiratory and electrodermal measures are also successfully applied in emotion recognition. In several studies of biosignal-based aﬀect classiﬁcation recognition rate is more than 80% [6,15]. The analysis of body motion data for emotion recognition has become more common only recently. Movement data gives recognition rates comparable to facial expressions or speech in multimodal scenarios and also improves overall accuracy in multimodal systems when combined with other modalities [17]. In general, recognition based on several modalities gives better results than one-channel recognition [12,22]. Thus, a vast amount of researches have been conducted but the problem of emotion recognition remains challenging. As existing datasets have some limitations, we decided to collect multimodal database in Russian with multiple channels including motion and physiology data along with audio-visual data. The Russian Acted Multimodal Aﬀective Set (RAMAS) expands and complements existing datasets for the needs of aﬀective computing.

2 2.1

RAMAS Dataset Dataset Collection

The Russian Acted Multimodal Aﬀective Set (RAMAS) consists of multimodal recordings of improvised aﬀective dyadic interactions in Russian (see example in “Fig. 1”). It was obtained in 2016–2017 by Neurodata Lab LLC [3]. Ten semi-professional actors (5 men and 5 women, 18–28 years old, native Russian speakers) participated in the data collection. Semi-professional actors are more

RAMAS: Russian Multimodal Corpus of Dyadic Interaction

503

Fig. 1. Examples of RAMAS database. A. Screenshot from a close-up video. B. Screenshot from a full-scene video with depicted skeleton data.

preferable than professional actors for analyzing movements in emotional states, as professional theater actors may use stereotypical motion patterns [24,29]. The actors were given scenarios with descriptions of diﬀerent situations, but no exact lines. They were encouraged to gesticulate and move, but they had to stay in a certain, specially marked part of the room to achieve stable close-up footage of their faces. In order to perform reﬁned motion tracking all the actors were dressed in tight black clothes. First of all the participants contributed a sample of their neutral emotional state. Then they improvised from 30 to 60 seconds on each of the 13 topics. Interactions were conducted in mixed-gender dyads where participants were assigned to be either friends or colleagues. Scenarios implied the presence of one of the six basic emotions (Anger, Sadness, Disgust, Happiness, Fear, and Surprise) in each dialogue or the neutral state. In each scenario, one actor was instructed to play a dominant role and the other a submissive role, and these roles were balanced across scenarios (He, English translations of scenarios could be found under the link [4]. Each scenario was played out from two to ﬁve times to achieve the best quality and the highest variety in emotional and behavior expressions. The roles were assigned to actors in such a way that all states were evenly distributed between men and women. We also asked the subjects to ﬁll in short questionnaires (self-reports) after each played scenario in order to note emotions they really felt during the process. The actors received ﬁxed payments for the production days and consented to the usage of all of the recordings for scientiﬁc and commercial purposes. 2.2

Apparatus and Recording Setup

Audio was recorded with Sennheiser EW 112-p G3 portable wireless microphone system (wav format, 32-bit, 48000 Hz). Microphones were placed on the participants’ necklines. General acoustic scene was obtained by stereo Zoom H5 recorder. Microsoft Kinect RGB-D sensor v. 2 [2] was used to gather 3D skeleton data, RGB and depth videos (of both actors simultaneously). A green background was used to ensure the motion capture and video quality enhancement. Close-up videos of each participant were recorded by means of two cameras (Canon HF G40 and Panasonic HC-V760). Photoplethysmogram (PPG) and

504

O. Perepelkina et al.

electrodermal activity (EDA) were registered by Consensys GSR Development Kit [1]. As a result, the RAMAS database comprise approximately 7 h of synchronized multimodal information including audio, 3D skeleton data, RGB and depth video, EDA, and PPG. We used SSI software to synchronize the streams [30].

3 3.1

Post-processing and Annotation Annotation

We asked 21 annotators (18–33 years old, 71% women) to evaluate emotions in the received video-audio fragments. Each annotator worked with 150 video fragments (except for two annotators who had 150 fragments in sum) and at least ﬁve annotators marked each fragment. We asked all the applicants to take the emotional intelligence test [19,25], and only those, who got average or above the average results, were picked to annotate the material. The Elan [26] tool from Max Planck Institute for Psycholinguistics (the Netherlands) was used for emotion annotation. Oral and written instructions as well as templates of all the emotional states were provided. The task was to mark the beginning and the end of each emotion that seemed natural. The work of the annotators was monitored and paid for. 3.2

Inter-Annotator Agreement

We used Krippendorﬀ’s alpha statistic [18] to estimate the amount of inter-rater agreement in the RAMAS database. Alpha deﬁnes the two reliability scale points as 1.0 for perfect reliability, 0.0 for the absence of reliability, and alpha < 0 when disagreements are systematic and exceed what can be expected by chance. We choose Krippendorﬀ’s alpha because it is applicable to any number of coders, allows for reliability data with missing categories or scale points, and measures agreements for nominal, ordinal, interval, and ratio data [16]. The expression for calculation of Alpha (α) is as follows: α=

1 − D0 De

(1)

where D0 is the disagreement observed and De is the disagreement expected by chance. Results of annotators who labelled the same video fragments (150 in most cases) were grouped, and Alpha was counted for each group for each of 9 emotional, social and neutral scales. Elan provides an opportunity to set variative length for fragment annotation (i.e. the subject determines the starting and ending point of emotions). Subsequently, to count realiability with Krippendorﬀ’s alpha statistics we split all annotations into second intervals. The average Alpha statistics for RAMAS dataset is 0.44. The mean and median statistics for each scale is presented in Table 1.

RAMAS: Russian Multimodal Corpus of Dyadic Interaction

505

Table 1. Mean and median Krippendorﬀ’s alpha for each scale Scale

3.3

Mean alpha Median alpha

Disgust

0.54

0.66

Happiness

0.58

0.6

Anger

0.5

0.56

Fear

0.47

0.48

Domination 0.45

0.46

Submission 0.46

0.44

Surprise

0.41

0.38

Sadness

0.35

0.31

Neutral

0.22

0.07

Self-report Analysis

Real emotions of the actors were collected by means of self-reports they gave after the last take of each played scenario. The actors were to ﬁll in short questionnaires and evaluate his/her state during the scenario (Angry, Sad, Disgusted, Happy, Scared, and Surprised) on a 5-point Likert scale (1 = did not experience the emotion, 5 = experienced it a lot). They also answered the question about the complexity of the scenario played (1 = very easy, 5 = very diﬃcult). We analyzed self-reports trying to answer the following questions: (1) Are there any diﬀerences between the emotions that actors experienced during all the scenarios? (2) Are there any diﬀerences in the complexity of the scenarios? If yes, which kind of emotions (types of scenarios) were more diﬃcult for playing? (3) What are relations between played and experienced emotions? Did actors really feel the same emotions they played? Question #1: The diﬀerences between experienced emotions. First, we tested the diversity of emotions each actor experienced during the experiment. We wanted to ﬁnd out which emotions they experienced more often regardless of the type of the emotion in the scenario. The answers from self-reports were compared with pairwise Wilcoxon rank sum test. There were no signiﬁcant differences between experienced emotions. That means the actors experienced balanced overall amount of all the emotions during all sessions. Since the number of the emotions in the scenarios was balanced, it also corresponds to the experienced emotions. Question #2: Complexity of the scenarios. We analyzed what kinds of scenarios were more diﬃcult for playing according to the actors’ self-reports. For this purpose the logistic regression model was constructed, with complexity evaluations as dependent variables and the types of scenario as predictors, and tested the comparisons of interests by using contrast/lsmeans. Scenarios for disgust were more diﬃcult than the scenarios for angriness (p = 0.002, z = −3.773) and happiness (p = 0.018, z = 3.191), and the scenarios for fear were more diﬃcult than the scenarios for angriness (p = 0.014, z = −3.264). Then we studied the

506

O. Perepelkina et al.

eﬀect domination-submission type had on the evaluated complexity of the scenario. Logistic regression model and contrast/lsmeans revealed that there were no diﬀerences between these types of scenarios. Question #3: Relations between played and experienced emotions. First, we studied how the intensity of the experienced emotions depended on the individual diﬀerences between the actors. We counted sum of all emotional scores in each answer (intensity of emotions, varies from 6 to 30 in each answer, M 10.5 ± SD 2.6 in all answers). Then generalized linear model was created with the intensity as response and the actors as predictors, and tested with Anova type two. The model was signiﬁcant (p < 0.001) which meant the intensity of the experienced emotions varied from actor to actor. Since emotion intensity depended on actors, we normalized their evaluations for further analysis: in order to do so we divided score of the emotion by the sum of all emotions in each answer. Normalized scores of the experienced emotions were compared to the emotions that actors had to play according to the scenarios. We analyzed the relation between played and experienced emotions using proportional odds logistic regression with the normalized score of the questionnaire as the response and the logical variable (that reﬂected whether the played emotion matched the felt emotion) - as the predictor. Anova type two revealed that this predictor was signiﬁcant, that is: the actors tend to evaluate the emotion they had just played as their most intense feeling. In other words, the actors reported experiencing the same emotions they had played out in the scenario (see “Fig. 2”). The properties of the database

Fig. 2. Played and experienced emotions. False – the emotion in the scenario did not match the emotion in the self-report, true – they coincided. Plot with mean values and bootstrap estimated conﬁdence intervals (CI).

RAMAS: Russian Multimodal Corpus of Dyadic Interaction

507

are summarized in Table 3. The RAMAS database is a novel contribution to the aﬀective computing area since it contains multimodal recordings of the six basic emotions and two social categories in the Russian language.

4

Discussion

We collected the ﬁrst emotional multimodal database is Russian language. Semiprofessional actors played out prepared scenarios and expressed basic emotions (Anger, Sadness, Disgust, Happiness, Fear and Surprise), as well as two social interaction characteristics – Domination and Submission. Audio, closeup and whole scene videos, motion capture and physiology data were collected. Twenty one annotators marked emotions in the received videos, at least ﬁve annotators marked each video. The further analysis of the annotations revealed that the RAMAS database has moderate inter-rater agreement (Krippendoroﬀ’s alpha = 0.44). Among all the scales except for the neutral condition, the smallest inter-rater agreement was observed for sadness (0.35), while the largest agreement was observed for happiness (0.58). After playing out each scenario all the actors answered several short questions about the emotions they experienced. Analysis of these answers revealed that the actors experienced balanced overall amount of all emotions during all sessions (Table 2). Table 2. Scripts, videos and scenarios in RAMAS database Expression

Written Number Length, scripts of videos minutes

Emotions Disgust

4

80

56.6

Happiness

4

64

42.2

Anger

5

62

45.5

Fear

5

94

65.8

Surprise

5

70

44.6

Sadness

5

84

63.8

Neutral (speaker + listener)

6+6

63 + 64

37.8 + 38.6

Social scales Domination (emotional + neutral) 14 + 6

227 + 63 158.8 + 37.8

Submission (emotional + neutral) 14 + 6

227 + 64 158.8 + 38.6

The analysis of the question about the complexity of the played emotion revealed that the actors had more diﬃculties with the scenarios for disgust compared to the scenarios for angriness and happiness, and the scenarios for fear were more diﬃcult for them than the scenarios for angriness. There were no diﬀerences between the complexity of playing dominative vs. submissive types

508

O. Perepelkina et al. Table 3. Basic properties of RAMAS database Number of videos

581

General length of database, minutes/hours

395/6.6

Number of actors

10 (5 men, 5 women)

Age of actors

18–28 years

Language of videos

Russian

Video length, seconds (min/max/average)

9/96/41

Number of annotators

21 (6 men, 15 women)

Inter-rater agreement (Krippendorﬀ’s alpha) 0.44

of the scenarios. Presumably, experience and natural expression of fear and disgust are more rooted in the real context of the situation in comparison with other emotions, because the function of those aﬀects is avoiding threats here and now [13,21,28]. For example, it’s easier to trigger natural anger than fear by memories. The comparison of played and experienced emotions revealed that, according to their self-reports, the actors experienced the same emotions that they played during the scenario. Due to this fact RAMAS could be considered quite a naturalistic database.

5

Conclusion

RAMAS is the unique play-acted multimodal corpus in the Russian language. The database in open and provides research community with multimodal data of face, speech, gestures and physiology modalities. Such material is useful for various investigations and automatic aﬀective systems development. It can also be applicable for psychological, psychophysiological and linguistic studies. Access to the database is available under the link [5]. Acknowledgements. Supported by Neurodata Lab LLC. The authors would like to thank Elena Arkova for ﬁnding the actors and helping with the scenarios and experimental procedure, and Irina Vetrova for evaluating the emotional intelligence of the annotators with MSCEIT v 2.0 test.

References 1. Consensys GSR development kit. http://www.shimmersensing.com/products/gsroptical-pulse-development-kit 2. Kinect v. 2. http://www.microsoft.com/en-us/kinectforwindows 3. Neurodata Lab LLC. http://www.neurodatalab.com 4. RAMAS scripts. http://neurodatalab.com/upload/technologies ﬁles/scenarios/ Scripts RAMAS.pdf 5. RAMAS database (2016). http://neurodatalab.com/en/projects/RAMAS

RAMAS: Russian Multimodal Corpus of Dyadic Interaction

509

6. Anderson, A., Hsiao, T., Metsis, V.: Classiﬁcation of emotional arousal during multimedia exposure. In: Proceedings of the 10th International Conference on Pervasive Technologies Related to Assistive Environments, pp. 181–184. ACM (2017) 7. Ayvaz, U., G¨ ur¨ uler, H., Devrim, M.O.: Use of facial emotion recognition in elearning systems. Inf. Technol. Learn. Tools 60(4), 95–104 (2017) 8. B¨ anziger, T., Pirker, H., Scherer, K.: GEMEP-GEneva multimodal emotion portrayals: a corpus for the study of multimodal emotional expressions. In: Proceedings of LREC, vol. 6, pp. 15–19 (2006) 9. Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008) 10. Chaw, T.V., Khor, S.W., Lau, P.Y.: Facial expression recognition using correlation of eyes regions. In: The FICT Colloquium 2016, p. 34, December 2016 11. De Silva, L.C., Miyasato, T., Nakatsu, R.: Facial emotion recognition using multimodal information. In: Proceedings of 1997 International Conference on Information, Communications and Signal Processing, ICICS 1997, vol. 1, pp. 397–401. IEEE (1997) 12. D’mello, S.K., Kory, J.: A review and meta-analysis of multimodal aﬀect detection systems. ACM Comput. Surv. 47(3), 43:1–43:36 (2015). http://doi.acm.org/10.1145/2682899 13. Douglas, M.: Purity and danger: an analysis of pollution and taboo London (1966) 14. El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classiﬁcation schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011) 15. Gouizi, K., Bereksi Reguig, F., Maaoui, C.: Emotion recognition from physiological signals. J. Med. Eng. Technol. 35(6–7), 300–307 (2011) 16. Hayes, A.F., Krippendorﬀ, K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007) 17. Karg, M., Samadani, A.A., Gorbet, R., K¨ uhnlenz, K., Hoey, J., Kuli´c, D.: Body movements for aﬀective expression: a survey of automatic recognition and generation. IEEE Trans. Aﬀect. Comput. 4(4), 341–359 (2013) 18. Krippendorﬀ, K.: Estimating the reliability, systematic error and random error of interval data. Educ. Psychol. Meas. 30(1), 61–70 (1970) 19. Mayer, J.D., Salovey, P., Caruso, D.R., Sitarenios, G.: Measuring emotional intelligence with the MSCEIT V2. 0. Emotion 3(1), 97 (2003) 20. Metallinou, A., Lee, C.C., Busso, C., Carnicke, S., Narayanan, S.: The USC creativeIT database: a multimodal database of theatrical improvisation. Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, p. 55 (2010) 21. Rachman, S.: Anxiety. Psychology Press Ltd., Publishers, East Sussex (1998) 22. Ranganathan, H., Chakraborty, S., Panchanathan, S.: Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9, March 2016. https://doi.org/ 10.1109/WACV.2016.7477679 23. Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the RECOLA multimodal corpus of remote collaborative and aﬀective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013) 24. Russell, J.A., Fern´ andez-Dols, J.M.: The Psychology of Facial Expression. Cambridge University Press, Cambridge (1997)

510

O. Perepelkina et al.

25. Sergienko, E.G., Vetrova, I.I., Volochkov, A.A., Popov, A.Y.: Adaptation of J. Mayer P. Salovey and D. Caruso emotional intelligence test on russian-speaking sample. Psikhologicheskii Zhurnal 31(1), 55–73 (2010) 26. Sloetjes, H., Wittenburg, P.: Annotation by category: ELAN and ISO DCR. In: LREC (2008) 27. Tarnowski, P., Kolodziej, M., Majkowski, A., Rak, R.J.: Emotion recognition using facial expressions. Procedia Comput. Sci. 108, 1175–1184 (2017) 28. Tomkins, S.: Aﬀect Imagery Consciousness: Volume II: The Negative Aﬀects. Springer, New York (1963) 29. Volkova, E., De La Rosa, S., B¨ ulthoﬀ, H.H., Mohler, B.: The MPI emotional body expressions database for narrative scenarios. PloS one 9(12), e113647 (2014) 30. Wagner, J., Lingenfelser, F., Baur, T., Damian, I., Kistler, F., Andr´e, E.: The social signal interpretation (SSI) framework: multimodal signal processing and recognition in real-time. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 831–834. ACM (2013)

Investigating Word Segmentation Techniques for German Using Finite-State Transducers G´ abor Pint´er(B) , Mira Schielke, and Rico Petrick Linguwerk GmbH, 01069 Dresden, Germany {gabor.pinter,mira.schielke,rico.petrick}@linguwerk.com

Abstract. Word segmentation plays an important role in speech recognition as a text pre-processing step that helps decrease out-of-vocabulary items and lowers language model perplexity. Segmentation is applied mainly for agglutinative languages, but other morphologically rich languages, such as German, can also beneﬁt from this technique. Using a relatively small, manually collected broadcast corpus of 134k tokens, the current study investigates how Finite-State Transducers (FSTs) can be applied to perform word segmentation in German. It is shown that FSTs incorporating word-formation rules can reach high segmentation performance with 0.97 precision and 0.93 recall rate. It is also shown that FSTs incorporating n-gram models of manually segmented data can reach even higher performance with accuracy and recall rates of 0.97. This result is remarkable considering the fact that the bottom-up approach performs on par with the expert system without requiring explicit knowledge about morphological categories or word formation rules. Keywords: Word segmentation

1

· Text processing · Morphology

Introduction

Agglutinative languages, such as Turkish or Japanese, are commonly reported to be challenging for automatic speech recognition (ASR), partly because the vocabulary of these languages cannot be eﬀectively accounted for by a simple enumeration of words. Listing is impractical due to the highly productive derivational and inﬂectional morphology. This morphological characteristic may pose problems for speech recognition as the large number of word types can lead to high out-of-vocabulary rates in the pronunciation lexicon and high perplexities in language models. German is not an agglutinative language, but its relatively complex inﬂectional system and its productive compounding characteristics raise problems similar to that of agglutinative languages [4–6,13,16]. For example, German adjective modiﬁers can have diﬀerent endings according to the gender, number and case of the nouns they modify (e.g., das kalt-e Bier ‘cold beer[nom]’ dem kalt-en Bier ‘cold beer[dat]’ kalt-es Wasser ‘cold water[acc]’). Calculating c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 511–521, 2018. https://doi.org/10.1007/978-3-319-99579-3_53

512

G. Pint´er et al.

only with eleven forms per adjective (null, -e -en -em -er -es -ste -sten -stem -ster -stes) would require adding each adjective eleven times to the lexicon. Compounding can also considerably increase the vocabulary size in German as spelling conventions require most compounds to be written as single words, without any hints of morpheme boundaries. This practice results in such super-long word formations as the infamous Donaudampfschiﬀfahrtsgesellschaft ‘Danube Steamboat Shipping Company’. An apparent solution to these problems is to split up word forms, that is, to introduce some kind of word segmentation step that provides input for lexicon and language model related tasks. There are several word segmentation techniques and tools available ranging from morphological analyzers to completely data-driven, unsupervised segmentation techniques. General purpose morphological analyzers, such as Tagh [7] for German, or ChaSen [11] for Japanese provide full-ﬂedged, linguistically accurate morphological analysis, in which morpheme boundaries can be used as splitting points. Although it is not uncommon for studies to implement custom, morphology-based segmentation tools [2,14], the costs associated with the development and maintenance of general morphological analyzers is prohibitive. Languages without appropriate morphological analyzers can be processed by selfor semi-supervised, data-driven algorithms that identify sub-word units automatically, without relying on morphological information. Besides some sporadic, heuristically formulated attempts [10,17], Morfessor [3] has to be highlighted as an established data-driven segmentation tool, frequently occurring in studies concerned with sub-word models of speech recognition [18–20]. While data-driven tools are extremely convenient and their performance tend to improve with more data, they can produce unexpected errors, and their behavior is diﬃcult to control. The current study aims to brieﬂy overview how Finite-State Transducers (FSTs) can be used for word segmentation, and provide a simple performance measure for the techniques introduced—using German data. FSTs can function as a convenient mechanism to segment words, and are often used in morphological analyzers. But FSTs can also operate using bottom-up information, for example in the form of n-gram models. This study introduces and compares two top-down and a range of bottom-up FST models for word segmentation. As preliminary experiments show, more morphological knowledge leads to better segmentation performance, but self-supervised approaches—with no morphological knowledge—can perform on par with expert systems.

2 2.1

Word Segmentation with Transducers Morphological Analysis as Segmentation

The simplest word segmentation transducer can be constructed similarly to a two-level morphological parser [8], except that instead of underlying morphemes and features the output contains only the split input. Input and output labels share the same set of characters with extra segment boundary symbols (e.g. ‘+’) on the output side. The segmentation transducer is deﬁned as a closure over all

Investigating Word Segmentation Techniques for German Using FSTs

513

acceptable segments. Figure 1 demonstrates a sample transducer that splits up the input compound zeitraum ‘time period’ into its components: zeit+raum.1

Fig. 1. A sample word segmenter FST.

A transducer with a simple closure over all lexical items, however, is not an eﬀective segmenter, because it accepts any sequence of segments in any order. For example, the transducer above also accepts nonsense words like zeitzeit or raumraumraum. This problem of over-generation can be addressed by incorporating word-formation constraints into the transducer. A widely utilized technique is to formulate constraints over morphological categories, such as preﬁxes and sufﬁxes. Figure 2 displays a transducer that accepts preﬁxes only at the beginning of words and suﬃxes only at the end. For example, preﬁx ab- attaches only to left side of verbs, suﬃx -ung only to their right side (e.g. ab+schaﬀ+ung).

Fig. 2. Segmenter FST with morphological knowledge about preﬁxes and suﬃxes.

This naive preﬁx/suﬃx only approach also leaves plenty of room for overgeneration. The system can be greatly improved by incorporating more ﬁnegrained word formation rules through taking aﬃx types, part of speech categories and other subcategorization features into consideration. Figure 3 represents a more sophisticated attempt for a morphological approach using various derivational and inﬂectional suﬃx types. Although discovering and implementing word formation rules is a tedious task, it can lead to remarkable segmentation performance as demonstrated by the German morphological analyzers such as Tagh [7]. 1

Words are lowercased for clarity. In German nouns start with capital letters, so the segmentation would be more correctly: Zeitraum → Zeit+Raum.

514

G. Pint´er et al.

Fig. 3. Excerpt of a segmenter FST with expert morphological knowledge.

2.2

Supervised Word Segmentation with N-Grams

While morphological analyzers are obvious choices for segmenting words, the analysis they provide is not necessarily optimal for further processing. For instance, word stems combined with morphological features, instead of the written forms, do not provide optimal input for grapheme-to-phoneme algorithms (e.g. wirfst → werfen < V >< 2 >< Sg >). Also, too short morphemes can be sub-optimal for speech recognition tasks. These along with similar constraints can easily result in disagreements with the morphological analysis. Data-driven segmentation techniques can remedy this problem by providing means to learning arbitrary segmentation patterns. One way of doing this is by training n-gram models on segmented data. The idea of using n-gram-based segmentation as a text pre-processing step is an established method for Asian languages [9] but it has also been applied to German. Incorporation of n-gram models into segmentation FSTs is not a complicated task: FST-based language models are commonly used in various speech and language processing tasks [12,15]. A notable problem concerning the combination of segmentation and n-gram FSTs is that FST-based segmenters typically operate on characters, while n-gram models are deﬁned over words or morphemes. This mismatch can be easily remedied by rewriting character sequences to morpheme labels in segmenters as demonstrated in Fig. 4. As a ﬁrst approximation, n-gram information can be integrated into the segmentation process in two steps. First a lattice of possible segmentations is created; second, this lattice is re-scored with an n-gram FST. The two-step approach, however, is slow and cumbersome. A more elegant approach is to merge the segmenter and the n-gram transducers into one FST. The merged— or composed—FST preserves the overall structure and weights of the n-gram

Investigating Word Segmentation Techniques for German Using FSTs

515

Fig. 4. Word segmenter from Fig. 1 with morpheme-level output labels.

Fig. 5. Fragment of a transducer n-gram model with arcs for an, ab and abend. Epsilon output labels are omitted for clarity.

Fig. 6. Determinized and weight-pushed version of transducer in Fig. 5.

transducer. Figure 5 displays a fragment of an n-gram FST whose input morpheme arcs were replaced by characters. Making the resulting transducer deterministic and sorting it by input label are useful optimization steps as they help reduce model size and enable faster search of arcs. Figure 6 represents an optimized version of the FST of Fig. 5. A disadvantage of these models is that they require custom search and composition algorithms, as their treatment of back-oﬀ and epsilon arcs is diﬀerent from standard FST-based n-gram models.

3

Experiment

A series of experiments was conducted to compare the performance of top-down with bottom-up approaches to FST-based segmentation. The top-down approach was represented by two FST models that implemented diﬀerent amounts—Naive

516

G. Pint´er et al.

and Expert levels—of morphological knowledge. The bottom-up approach was associated with transducers that were based on n-gram models. A relatively small (134k) broadcast news corpus was used in a 10-fold cross-validation setup to evaluate segmentation performance. The folds were analyzed for perplexity and OOV rates as well as precision, recall and f-measure. In preparation of those calculations n-gram models with Katz smoothing were trained using 9 folds out of 10. The quality measures were calculated against the retained folds. 3.1

Corpus Data and Segmentation

As there is no standardized way to segment German text, there is also no standardized segmented corpus available. For development and testing purposes, German news broadcast text was collected from the Deutsche Welle news portal www.dw.de between early 2017 and early 2018. The texts collected, extracted from 207 news reports, was manually normalized and segmented. After normalization each ﬁle contained on average 646.7 tokens. Segmentation involved only the splitting up of words, no morphological categories or features were added. Some examples from the corpus are: Woche-n-arbeit-s-zeit ‘hours worked per week’, Zahl-reich-e H¨ auser sind zer-st¨ or-t ‘several houses are destroyed’. Admittedly, this manual segmentation diverged from traditional morphological analyses. For example, in order to keep the lexical model simple, words were kept together if segmentation would have produced alternative pronunciation, such as with H¨ auser *→Haus+er. 3.2

Perplexity and OOV Rates

The corpus had a relatively small size of circa 134k tokens after text normalization. The segmentation has increased the token count to 198k, while it almost halved the type count. As expected, the segmented corpus had a lower perplexity of 14.1 compared to 21.4 of the original text (Table 1). Perplexity values were calculated using 3-gram language models with Katz smoothing. As shown in Fig. 7, word segmentation has achieved a considerable decrease in perplexity in unseen data: from 219.98 to 79.69 on average. OOV type and token ratios were also calculated for the unseen folds. The weighted average of OOV tokens was 7.47%, which dropped to 1.89% after segmentation. A similar decrease was observed with types: from 20.88% to 9.30% on average. Values for each fold are seen in Fig. 8. Table 1. Text-normalized and segmented news broadcast data. Unit

Token counts

Type counts

word

133,664 (100.00%) 18,131 (100.00%) 21.377

morpheme 197,536 (147.79%) 9,934

(54.79%)

Perplexity 14.057

Investigating Word Segmentation Techniques for German Using FSTs

210.8

217.4

225.4

80

80.5

g

h

219.1

j

f

81.4

i

76.4

original e

216.1

a

segmented b c d

74.4

250

517

226

217.9

226.7

81.2

83.4

229.3 80.6

80.5

211

150

78.6

Perplexity

200

100 50 0

Fig. 7. Perplexity values in unseen folds with segmented and unsegmented text. 0.30

h

0.208

i j 0.216

g

0.195

f

0.204

e

0.091

d

0.202

0.218

0.15

0.05

0.098

0.10 0.101

OOV ratio

0.078

0.068 0.015

0.00

0.02

0.075

0.072 0.019

0.019

0.071

0.077 0.019

0.019

0.078

0.077 0.021

0.02

0.018

0.02

0.019

0.078

0.04

0.073

OOV ratio

0.20 0.06

c b

0.093

a

original

0.092

j

0.217

h

0.098

g

0.218

f

0.211

e

0.105

c

segmented 0.25

i

0.092

b

d

0.198

0.08

a

original

0.085

segmented

0.075

0.10

0.00

Fig. 8. Out-of-vocabulary ratios for tokens (left) and for types (right) in unseen folds.

3.3

Segmentation Models

A series of FST-based word segmenters was created following the concepts outlined in Sect. 2. A Naive model was created with an FST structure relying only on three morpheme categories: preﬁxes, suﬃxes and stems (cf. Fig. 2). Weights were set to a constant value for all segments to prefer longer chunks. The Expert model implemented a thorough, but non-exhaustive set of morphological rules (see Fig. 3). The weights were deﬁned manually, based on experimentation. Both Naive and Expert models used around 80% of the corpus as a development set. Other sources, such as aﬃx dictionaries and word lists were also used to deﬁne morpheme classes, transducer structure and weights.

518

G. Pint´er et al.

In addition to the two top-down approaches, ﬁve data-driven models were created using 1- to 5-gram language models with Katz smoothing. In preparation of these models, ﬁrst transducer-based n-gram models were trained using normalized and segmented text of the training folds. Next, word and morpheme labels in the n-gram transducer were replaced by character sequences on the input side (cf. Fig. 4). Finally, the transducers were determinized, minimized, and the weights were pushed forward for faster performance (cf. Fig. 6). A special, non-epsilon symbol was used as back-oﬀ arc label. All transducers and necessary tools were developed using OpenFst [1] and OpenGrm [15]. 3.4

Results

Recall, precision and f -measure values were calculated to evaluate segmentation performance. The unseen data folds from the cross-validation setup were used as test sets for the n-gram models. For Naive and the Expert models, the separation of seen and unseen data was not consistent, as parts of the corpus were used— besides other sources—to manually discover morphological generalizations. For easier comparison the same “unseen” folds were used for all segmentation models. Table 2 summarizes the means of performance metrics over the test sets. A visual presentation of precision and recall values with medians are presented in Fig. 9. 3.5

Discussion

In terms of f -measures the best segmentation performance was achieved by the 2-gram model. This result, however, is not signiﬁcantly diﬀerent from other Table 2. Segmentation performance: mean values over “unseen” folds. Naive

Expert 1-gram 2-gram 3-gram 4-gram 5-gram

Recall

0.9140 0.9324 0.9410 0.9684

Precision

0.8739 0.9656 0.9468 0.9673 0.9661

0.9657

0.9657

f-Measure 0.8935 0.9487 0.9438 0.9678 0.9673

0.9671

0.9671

1.00

1.00

0.968 0.969 0.969 0.969

0.932

0.967 0.966 0.966 0.966

0.966 0.947

0.941

precision

0.914

0.90

0.90

0.874

Fig. 9. Segmentation performance: recall (left) and precision (right).

5−grm

4−grm

3−grm

2−grm

1−grm

Expert

5−grm

4−grm

3−grm

2−grm

1−grm

Naive

0.85 Expert

0.85

0.95

Naive

recall

0.95

0.9686 0.9686 0.9686

Investigating Word Segmentation Techniques for German Using FSTs

519

higher order n-gram models. Unquestionably the Naive approach had the worst performance among the compared models. This result is not surprising given its over-simpliﬁed morphological model. Incorporating more sophisticated morphological knowledge proved to be useful as demonstrated by the performance improvements of the Expert model. Of course the question is if such expert systems are worth developing as n-gram models without morphological knowledge can deliver similar performance. A closer look at the errors may inﬂuence the interpretation of the seemingly outstanding results. Almost half of OOV words in the unseen folds were named entities in non-aﬃxed forms. These unsplit OOV items did not contribute to the evaluation as non-parsable input words were treated as single units.2 Thus neither the reference nor the hypotheses had morpheme boundaries. Provided that words used for training are segmented correctly, the seen data together with non-splittable OOV items can account for the seemingly impressive results. The low error rates are attributable to the low number of multi-segment OOV items.

4

Conclusion

The goal of this article was to present a brief overview and a few examples of how FSTs can be used for word segmentation. The introduced top-down and bottomup approaches, while performing well in the experiments, provided only a limited insight of what FSTs are capable of. For example, top-down models can easily be augmented with stochastic elements; or inversely, the n-gram approach can integrate morphological classes. It is also possible to detect word-embedded OOV tokens with fall-back arcs in combination with conﬁdence measures. Orthogonal to the direction of these technical improvements, another straightforward extension of this research would involve evaluation of segmentation models in context. The presented low perplexity and OOV rates may imply better ASR performance, but the actual eﬀect on recognition accuracy needs to be veriﬁed through experimentation. Although the current literature does not provide a conclusive answer, it seems that segmentation may lead to better ASR performance, but this gain may decrease with the increase of vocabulary size [19]. While we cannot answer questions related to speech recognition performance at present, we believe that our work provides a useful base for further studies concerning word segmentation using ﬁnite-state techniques.

2

However, a few OOV words were falsely segmented into—typically short— morphemes, leading to errors (e.g. Tories → Tor+ie+s).

520

G. Pint´er et al.

References 1. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: A General ˇ arek, and Eﬃcient Weighted Finite-State Transducer Library. In: Holub, J., Zd´ J. (eds.) CIAA 2007. LNCS, vol. 4783, pp. 11–23. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76336-9 3 2. Arisoy, E., Saraclar, M.: Compositional neural network language models for agglutinative languages. In: Interspeech 2016, pp. 3494–3498 (2016) 3. Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: ACL Workshop on Morphological and Phonological Learning, pp. 21–30 (2002) 4. El-Desoky, A., Shaik, A., Schl¨ uter, R., Ney, H.: Sub-lexical language models for German LVCSR. In: Spoken Language Technology Workshop, pp. 159–164 (2010) 5. El-Desoky, A., Shaik, A., Schl¨ uter, R., Ney, H.: Morpheme level feature-based language models for German LVCSR. In: Interspeech 2012, pp. 170–173 (2012) 6. Geutner, P.: Using morphology towards better large-vocabulary speech recognition systems. In: IEEE International Conference on Acoustic, Speech Signal Processing, vol. 1, pp. 445–448 (1995) 7. Geyken, A., Hanneforth, T.: TAGH: a complete morphology for german based on weighted ﬁnite state automata. In: Yli-Jyr¨ a, A., Karttunen, L., Karhum¨ aki, J. (eds.) FSMNLP 2005. LNCS (LNAI), vol. 4002, pp. 55–66. Springer, Heidelberg (2006). https://doi.org/10.1007/11780885 7 8. Jurafsky, D., Martin, J.: Speech and Language Processing, 2nd edn. Prentice-Hall Inc., Upper Saddle River (2009) 9. Kang, S.-S., Hwang, K.-B.: A language independent n-gram model for word segmentation. In: Sattar, A., Kang, B. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 557–565. Springer, Heidelberg (2006). https://doi.org/10.1007/11941439 60 10. Larson, M., Willett, D., K¨ ohler, J., Rigoll, G.: Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches. In: Interspeech 2000, pp. 945–948 (2000) 11. Matsumoto, Y.: Easy to use practical freeware for natural language processing: morphological analysis system ChaSen. IPSJ Mag. 41(11), 1208–1214 (2000) 12. Mohri, M.: Finite-state transducers in language and speech processing. Comput. Linguist. 23(2), 269–311 (1997) 13. Nußbaum-Thom, M., El-Desoky, A., Schl¨ uter, R., Ney, H.: Compound word recombination for German LVCSR. In: Interspeech 2011, pp. 1449–1452 (2011) 14. Renshaw, D., Hall, K.: Long short-term memory language models with additive morphological features for automatic speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5246–5250 (2015) 15. Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., Tai, T.: The OpenGrm open-source ﬁnite-state grammar software libraries. In: ACL 2012 System Demonstrations, pp. 61–66 (2012) 16. Shaik, A., El-Desoky, A., Schl¨ uter, R., Ney, H.: Feature-rich sub-lexical language models using a maximum entropy approach for German LVCSR. In: Interspeech 2013, pp. 3404–3408 (2013) 17. Shamraev, N., Batalshchikov, A., Zulkarneev, M., Repalov, S., Shirokova, A.: Weighted ﬁnite-state transducer approach to german compound words reconstruction for speech recognition. In: AINL-ISMW FRUCT, pp. 96–101 (2015) 18. Smit, P., Virpioja, S., Kurimo, M.: Improved subword modeling for WFST-based speech recognition. In: Interspeech 2017, pp. 2551–2555 (2017)

Investigating Word Segmentation Techniques for German Using FSTs

521

19. Tachbelie, M., Abate, S., Menzel, W.: Using morphemes in language modeling and automatic speech recognition of Amharic. Nat. Lang. Eng. 20, 235–259 (2012) 20. Zablotskiy, S., Minker, W.: Sub-word language modeling for Russian LVCSR. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 413–421. Springer, Cham (2015). https://doi.org/10.1007/978-3-31923132-7 51

A Comparison of Language Model Training Techniques in a Continuous Speech Recognition System for Serbian Branislav Popović1,3,4(&), Edvin Pakoci1,2, and Darko Pekar1,2 1

3

Department for Power, Electronic and Telecommunication Engineering, Faculty of Technical Sciences, University of Novi Sad, Trg Dositeja Obradovića 6, 21000 Novi Sad, Serbia [email protected] 2 AlfaNum Speech Technologies, Bulevar Vojvode Stepe 40, 21000 Novi Sad, Serbia Department for Music Production and Sound Design, Academy of Arts, Alfa BK University, Nemanjina 28, 11000 Belgrade, Serbia 4 Computer Programming Agency Code85 Odžaci, Ive Andrića 1A, 25250 Odžaci, Serbia

Abstract. In this paper, a number of language model training techniques will be examined and utilized in a large vocabulary continuous speech recognition system for the Serbian language (more than 120000 words), namely Mikolov and Yandex RNNLM, TensorFlow based GPU approaches and CUED-RNNLM approach. The baseline acoustic model is a chain sub-sampled time delayed neural network, trained using cross-entropy training and a sequence-level objective function on a database of about 200 h of speech. The baseline language model is a 3-gram model trained on the training part of the database transcriptions and the Serbian journalistic corpus (about 600000 utterances), using the SRILM toolkit and the Kneser-Ney smoothing method, with a pruning value of 10−7 (previous best). The results are analyzed in terms of word and character error rates and the perplexity of a given language model on training and validation sets. Relative improvement of 22.4% (best word error rate of 7.25%) is obtained in comparison to the baseline language model. Keywords: Language modeling

RNNLM LSTM LVCSR

1 Introduction Language modeling is an essential component of natural language processing and automatic speech recognition systems. In many applications, a good language model can even overcome flaws of an acoustic model by providing the data necessary to recognize natural sentences. It has been shown that language models may become ever so close to human language understanding [1], allowing their application in different domains, and for a range of machine learning problems. For the last several decades, statistical language modeling was based on relatively simple, yet highly effective and widely successful n-grams, i.e., frequencies of word © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 522–531, 2018. https://doi.org/10.1007/978-3-319-99579-3_54

A Comparison of Language Model Training Techniques

523

sequences of up to given length n [2]. This approach, however, has a few well-known issues, such as data sparsity (n-gram approach usually requires smoothing [3]), as well as statistical dependence on a very limited number (n-1) of preceding words (longer contexts are ignored and it is difﬁcult to train them on a limited amount of data). Several attempts have been made in order to improve n-gram results. Nevertheless, they usually bring more complexity, and more importantly, they could beat n-grams only when the amount of training data is limited – in case of larger datasets, n-grams usually came on top. One of those examples is the use of neural networks for language model training (NNLMs). Recurrent neural network based language models (RNNLMs) have been utilized in recent years to resolve issues concerning confusing and difﬁcult implementation and to reduce the computational complexity. Implementation of these networks is much simpler. More importantly, they are able to resolve two major n-gram model issues – they project each word into a compact continuous vector space, which could be described using only a limited set of parameters, and their recurrent connections are able to model longer contexts – sequences of rather arbitrary length, to be precise. Several experiments have already shown their superiority in relation to both n-grams [4] and feed-forward NNLMs [5]. Nonetheless, RNNLMs are very computationally demanding, which can result in low training speeds, especially when using a lot of training data. The focus of this paper is to compare several language model training approaches on the most widely used textual database for Serbian, as well as to compare the results with the previously obtained n-gram based language model, using a ﬁxed acoustic model trained on the same audio database from our previous research [6, 7]. These approaches include Mikolov RNNLM toolkit (a CPU-based implementation) [8], Yandex RNNLM toolkit (a faster RNNLM toolkit variant which uses a Huffman binary tree [9, 10] or an approximation of the Mikolov’s softmax activation function at the output layer with noise contrastive estimation (NCE)) [11], as well as different TensorFlow-based GPU approaches – vanilla, LSTM and fast (pruned) LSTM approach [12], and ﬁnally, the CUED-RNNLM toolkit (another GPU-based approach that involves an efﬁcient GPU implementation with an improved training criterion) [13]. The proposed language models have been evaluated and compared in terms of their perplexity (PPL) on the training and validation sets used for language model training, as well as the word error rate (WER) on the given test set, obtained by lattice rescoring (either regular or pruned), or the N-best list rescoring. All the above mention approaches (except CUED-RNNLM) have already been implemented within the Kaldi speech recognition toolkit [14]. Therefore, they do not require any external libraries other than CUDA and TensorFlow (where applicable). Finite state transducers (FSTs) representing language model differences have been used, as well as n-gram approximation techniques for lattice rescoring [15], in order to reduce the amount of computation needed and to prevent lattice explosion.

524

B. Popović et al.

2 Theoretical Background For n-gram language models, the probability of each word depends only on up to n previous words. The probability of a word sequence is estimated according to the number of appearances in the training corpus. In the case of highly inflective and morphologically rich languages, this approximation is highly inadequate, because there are many sequences that will never occur (in Serbian, there are 7 cases, 2 numbers, 3 genders, 14 verb forms and several dialects). The baseline language model used in this paper is a 3-gram language model, trained on the training part of the database transcriptions (148558 utterances, see Sect. 3.1), and the additional part coming from the Serbian journalistic corpus for more realistic probability estimation (442000 utterances). The model contains 121197 unigrams, 1279389 bigrams and 357721 trigrams. The Kneser-Ney smoothing method was applied with a pruning value of 10−7. The baseline WER was 9.34%, and the baseline CER 2.46%. In our previous research, it was determined that short words, e.g. prepositions, particles and conjunctions, and vowels, poorly covered by the language model, contribute very signiﬁcantly to the total number of word errors (number of insertions, deletions and substitutions in LVCSR system). It was also concluded that a more suitable language model could be used in order to resolve most of the issues [6]. Bearing that in mind, several language model training techniques and multiple training conﬁgurations have been examined and will be briefly described in the rest of this section. 2.1

TensorFlow-Based GPU Approaches

Three TensorFlow-based GPU approaches have been examined. The ﬁrst one was the vanilla RNNLM approach (a simple, regular neural network, with a single hidden layer trained using the standard, i.e., vanilla backpropagation algorithm). However, two major issues have been reported concerning these networks - the vanishing gradient problem (the network is unable to learn long-term dependencies), and the exploding gradient problem (the possibility of overflow). Therefore, the second approach, more robust against the problems of long-term dependency, uses somewhat more complicated long short-term memory (LSTM) network. This is a sequence to sequence approach, using one word at a time to produce probabilities for the next word in a sentence. The third approach is also a LSTM approach, but a pruning algorithm and a modiﬁed softmax function, i.e., a function able to train a self-normalized network, where sum of outputs is automatically close to zero, are used in order speed up the computation [15]. 2.2

Mikolov RNNLM Toolkit

The Mikolov RNNLM toolkit utilizes a recurrent neural network (RNN) architecture, which consists of the input layer, one hidden layer and the output layer [8]. In the training phase, words are subjected to the input layer in 1-of-N representation and further concatenated with the previous state of the hidden layer. The standard sigmoid

A Comparison of Language Model Training Techniques

525

activation function is used for the neurons in the hidden layer. The output layer represents the probability of the current word in relation to the previous word and the state of the hidden layer for the previous time step. The standard stochastic gradient descent algorithm is used for the RNNLM training. The recurrent weights are calculated by using the so-called truncated backpropagation through time (the network is unfolded for the speciﬁed amount of time steps). The training is conducted iteratively (11 to 13 iterations in the case of our experiments). One part of the training set was used for the actual training, while the rest (30000 utterances, about 5% of them) was used for validation purposes (the same conﬁguration was used for each of our experiments). 2.3

Yandex RNNLM Toolkit

This is a faster RNNLM implementation. The topology consists of one input layer (fed by the full history vector, obtained by concatenation of a given word in 1-of-N representation and a continuous vector for the remaining context), one hidden layer that computes another representation using the sigmoid activation function, and one output layer, producing the RNNLM probabilities. Instead of the explicit computation of the output layer normalization term, i.e., the softmax activation function, either a Huffman binary tree [9], that assign short binary codes to the most frequent words, or NCE [11] is used, i.e., a nonlinear logistic regression to discriminate among the observed data and some noise distribution is performed, allowing the efﬁcient implementation during both the training and the testing phase. In the case of our experiments, the –nce option (the number of noise samples) was set to 20 (the recommended value). Around 50% faster training is obtained on average, compared to the Mikolov RNNLM toolkit. 2.4

CUED-RNNLM Toolkit

This is a FRNNLM approach [13], i.e., a RNNLM approach with a full output layer instead of the class based RNNLM. Instead of the conventional objective function based on the cross-entropy criterion, improved training criteria have been implemented, namely variance regularization, which explicitly adds the variance of the normalization term to the standard objective function, and the previously described NCE, where each word is assumed to be generated by both data and noise distributions.

3 Experimental Setup 3.1

Acoustic Model

The baseline acoustic model is a chain sub-sampled time-delay neural network (TDNN), trained using cross-entropy training and a sequence-level objective function, i.e., the log-probability of the correct phone sequence. The training procedure consists of pre-DNN training and DNN training phase. The pre-DNN training phase [3] consists of feature extraction (14 Mel-frequency cepstral coefﬁcients and 3 additional features for pitch: the probability of voicing (POV) - a warped normalized cross correlation function (NCCF) feature, the log-pitch value with POV-weighted mean subtraction

526

B. Popović et al.

over a 1.5 s window, and delta-pitch - delta feature computed on raw log pitch, plus the ﬁrst and second order time derivatives of all static features), monophone model training (1000 Gaussians, 40 iterations), several triphone model trainings (ﬁrst pass - 9000 Gaussians and 1800 states, 35 iterations, second pass - 25000 Gaussians, 3000 states, 35 iterations), and speaker adaptive training (25000 Gaussians, 3000 states, 35 iterations, feature space maximum likelihood linear regression (fMLLR) using diagonal matrices). The DNN phase [7] uses high-resolution features (40 high-resolution melfrequency cepstral coefﬁcients, calculated on 30 ms frames with 10 ms frame shift, plus the pitch features). The network consists of 8 hidden layers and 625 neurons per each hidden layer. In order to reduce the amount of computation, the topology is simpliﬁed. Instead of the standard 3-state left-to-right topology, the chain topology can be traversed in a single frame, i.e., the most hidden layers at the output of the neural network have to be evaluated on every 3rd frame. “−1,0,1” layer splicing conﬁguration is used for the initial layers (describing 3 consecutive frames), and “−3,0,3” was used for the most hidden layers (describing 3 frames, separated by 3 frames from each other). The TDNN was trained in 4 epochs, i.e., 60 iterations, on CUDA enabled GeForce GTX 1080 Ti GPU. 3.2

Data Preparation

The database is comprised of 2 data sets, described in our previous research (see e.g. [3, 6, 7]). The larger one is a set of audio books, read by 132 professional male and female speakers (74 males and 58 females) in a studio environment (mostly high quality audio). This is a set of larger and more complex utterances. The total duration of the ﬁrst set is 154 h and 3 min. The second one consists of domain-oriented speech, mostly short utterances recorded over various mobile phone devices. This part of the database was added in order to increase the total amount of data and to improve the recognition accuracy in voice assistant type applications. The sentences were spoken by 169 male and 181 female speakers and the total duration was 60 h and 57 min. The training part of the database comprises 197 h of speech (including silence), divided into training (95%) and cross-validation (5%) parts. 18 h of speech (including silence) was selected for testing purposes. Both data sets were recorded in mono PCM format, sampled at 16000 Hz, using 16 bits per sample.

4 Experimental Results In Tables 1 to 6, the results are presented for all the above mentioned toolkits and various language model conﬁgurations, and a test vocabulary of around 121000 words. The columns are given in the following order: rescoring type (abbreviated as RT, either language model rescoring of lattices or N-best list rescoring), n-gram order (e.g. if n-gram order is 4, any history that shares the last 3 words would be merged into a single state), number of most frequent words in the shortlist (while the rest of words are grouped together and their probabilities are distributed uniformly according to the respective unigram counts), number of classes and hidden layers (where applicable),

A Comparison of Language Model Training Techniques

527

word (WER) and character (CER) error rates, and the perplexity (PPL) for given training and validation sets. Other parameters (not given in the table) were set to their default values in Kaldi and TensorFlow. Size of the network, i.e., the number of hidden layers and neurons (for all of the experiments) was tailored to obtain the optimal performance in terms of the trade-off between the training speed and the accuracy. In some of the experiments (Tables 4 and 5), the N-best rescoring was used instead of lattice rescoring (the number of hypothesis was set to 1000, so the results were quite similar, but the training was much faster). Table 1. LSTM TensorFlow-based GPU conﬁgurations. RT Lattice Lattice Lattice Lattice Lattice Lattice

n-gram 3 4 3 4 3 4

Words 40000 40000 57464 57464 77803 77803

Classes – – – – – –

Layers 2 2 2 2 2 2

Neurons 200 200 200 200 200 200

WER 7.48 7.31 7.51 7.25 7.35 7.27

CER 2.02 1.99 2.02 1.99 1.99 1.98

PPL train 75.072 75.214 78.893 78.820 81.340 81.354

PPL valid 114.098 114.289 124.185 124.905 130.842 130.658

In Tables 1 to 3, the results are given for various TensorFlow-based GPU approaches, i.e., LSTM (Table 1), fast (pruned) LSTM (Table 2) and vanilla approach (Table 3). The number of words in the training vocabulary was set to either 40000 (approximately one third of the input lexicon), 57464 (words appearing 5 or more times in the training set) or 77803 (words that appear 3 or more times in the training set), respectively. Other parameters were set to their default values (200 neurons in the hidden layer, 2 layers in the case of LSTM and fast (pruned) LSTM training, and a single layer in the case of vanilla training). Table 2. Fast (pruned) LSTM TensorFlow-based GPU conﬁgurations. RT Lattice Lattice Lattice Lattice Lattice Lattice

n-gram 3 4 3 4 3 4

Words 40000 40000 57464 57464 77803 77803

Classes – – – – – –

Layers 2 2 2 2 2 2

Neurons 200 200 200 200 200 200

WER 7.64 7.47 7.65 7.44 7.60 7.58

CER 2.05 2.03 2.05 2.01 2.04 2.03

PPL train 71.972 72.017 76.646 76.291 81.263 80.462

PPL valid 123.010 122.664 135.569 135.343 143.401 142.515

The average PPL on the validation set was 123.163 for the LSTM case, 133.750 for the fast LSTM case and 422.256 for the vanilla case. Those values correspond well to the average word error rate (LSTM 7.36%, fast LSTM 7.56%, vanilla 8.99%). In Table 4, the results are presented for the Mikolov RNNLM implementation, using the default number of 300 neurons in the hidden layer. The number of classes in

528

B. Popović et al. Table 3. Vanilla TensorFlow-based GPU conﬁgurations. RT Lattice Lattice Lattice Lattice Lattice Lattice

n-gram 3 4 3 4 3 4

Words 40000 40000 57464 57464 77803 77803

Classes – – – – – –

Layers 1 1 1 1 1 1

Neurons 200 200 200 200 200 200

WER 8.91 8.93 8.97 9.20 9.00 8.91

CER 2.39 2.38 2.41 2.44 2.40 2.38

PPL train 350.179 386.209 420.131 439.877 377.774 413.213

PPL valid 372.627 406.665 443.908 465.647 409.681 435.018

Tables 4 and 5 (the Yandex implementation) was increased with the number of words (400, 450 and 500 classes were examined). The average PPL was 110.242 for the Mikolov RNNLM and 139.618 for the Yandex RNNLM, and the average WER was 7.48% (Mikolov) and 7.55% (Yandex).

Table 4. Mikolov RNNLM conﬁgurations. RT N-best N-best N-best N-best N-best N-best

n-gram 3 4 3 4 3 4

Words 40000 40000 57464 57464 77803 77803

Classes 400 400 450 450 500 500

Layers 1 1 1 1 1 1

Neurons 300 300 300 300 300 300

WER 7.41 7.53 7.42 7.45 7.59 7.47

CER 2.00 2.04 2.01 2.04 2.05 2.03

PPL train PPL valid – 116.131 – 105.498 – 112.605 – 104.060 – 120.344 – 102.813

In Table 6, results are given for the CUED-RNNLM implementation. The average PPL on the validation set was 163.205 and the average WER 7.62%. Concerning their respective training times, the TensorFlow versions completed in about an hour, Mikolov RNNLM trainings took approximately 20 h, and Yandex RNNLM trainings ﬁnished in about 10 h, as well as the CUED-RNNLM trainings.

Table 5. Yandex RNNLM conﬁgurations. RT N-best N-best N-best N-best N-best N-best

n-gram 3 4 3 4 3 4

words 40000 40000 57464 57464 77803 77803

classes 400 400 450 450 500 500

layers 1 1 1 1 1 1

neurons 300 300 300 300 300 300

WER 7.52 7.72 7.53 7.55 7.54 7.46

CER 2.10 2.11 2.07 2.07 2.06 2.06

PPL train PPL valid – 134.434 – 155.795 – 135.226 – 133.465 – 148.749 – 130.039

A Comparison of Language Model Training Techniques

529

The best result (7.25% WER, 22.4% relative improvement, i.e., the proposed conﬁguration in Figs. 1 and 2) was obtained for the case of LSTM training, for 57464 words and a 4-gram language model. The number of substitutions was the most prominent (8080 vs. 708 insertions and 2720 deletions). Although the influence of the short words and vowels to the ﬁnal WER is somewhat decreased (more errors are caused by an improper case), the distribution of error, given in terms of the number of unique deletions/insertions/substitutions per total number of deletions/insertions/ substitutions, remains almost the same. Table 6. CUED-RNNLM conﬁgurations. RT Lattice Lattice Lattice Lattice Lattice Lattice

n-gram 3 4 3 4 3 4

Words 40000 40000 57464 57464 77803 77803

Classes – – – – – –

Layers 1 1 1 1 1 1

Neurons 200 200 200 200 200 200

WER 7.67 7.58 7.66 7.59 7.68 7.53

CER 2.07 2.04 2.07 2.03 2.07 2.03

PPL train 114.104 115.085 124.129 124.201 129.459 127.410

PPL valid 181.733 183.289 158.669 158.745 149.567 147.229

The comparison between the results obtained using the baseline (SRILM) language model and the proposed (best) conﬁguration is presented in Fig. 1. In Fig. 2, the distribution of error for the baseline and the proposed conﬁguration, i.e., the percentages of unique deletions/insertions/substitutions in total number of deletions/insertions/ substitutions, are shown. Figure 2 suggests there are no new prominent instances of deletions, insertions or substitutions that haven’t emerged before, but on the other hand, there were no new instances of errors in general. Also, the relative number of deletions and insertions is slightly reduced in favor of the number of substitutions.

Fig. 1. A comparison between the baseline and the proposed (best) conﬁguration in terms of the number of deletions, insertions and substitutions.

530

B. Popović et al.

Fig. 2. The distribution of error for the baseline and the proposed conﬁguration (unique deletions/insertions/substitutions per total number of deletions/insertions/substitutions [%]).

5 Conclusion Several language modeling approaches have been analyzed in this paper. Signiﬁcant improvement (more than 20% WER in relative terms) has been obtained in terms of both WER and CER in comparison to the baseline language model, which conﬁrms the hypothesis given in our previous research [6]. However, the distribution of error among different word forms has remained more or less the same. Bearing in mind the language infectivity and the fact that most of the errors are influenced by an improper case (CER was relatively low in all of the experiments), the class n-gram approach is the next logical step in the development of our LVCSR system. Acknowledgments. The work described in this paper was supported in part by the Ministry of Education, Science and Technological Development of the Republic of Serbia, within the project “Development of Dialogue Systems for Serbian and Other South Slavic Languages”, EUREKA project DANSPLAT, “A Platform for the Applications of Speech Technologies on Smartphones for the Languages of the Danube Region”, id E! 9944, and the Provincial Secretariat for Higher Education and Scientiﬁc Research, within the project “Central Audio-Library of the University of Novi Sad”, No. 114-451-2570/2016-02.

References 1. Goodman, J.T.: A bit of progress in language modeling, extended version. Microsoft Research, Technical report, MSR-TR-2001-72 (2001) 2. Rosenfeld, R.: Two decades of statistical language modeling: where do we go from here? Proc. IEEE 88, 1270–1278 (2000) 3. Pakoci, E., Popović, B., Pekar, D.: Language model optimization for a deep neural network based speech recognition system for Serbian. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 483–492. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-66429-3_48

A Comparison of Language Model Training Techniques

531

4. Mulder, W.D., Bethard, S., Moens, M.F.: A survey on the application of recurrent neural networks to statistical language modeling. Comput. Speech Lang. 30(1), 61–98 (2015) 5. Mikolov, T., Kombrink, S., Burget, L., Černocký, J.H., Khudanpur, S.: Extensions of recurrent neural network language model. In: Proceedings of ICASSP, pp. 5528–5531. IEEE (2011) 6. Popović, B., Pakoci, E., Pekar, D.: End-to-end large vocabulary speech recognition for the Serbian language. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 343–352. Springer, Cham (2017). https://doi.org/10.1007/978-3319-66429-3_33 7. Pakoci, E., Popović, B., Pekar, D.: Fast sequence-trained deep neural network models for Serbian speech recognition. In: 11th Digital Speech and Image Processing, DOGS, Novi Sad, Serbia, pp. 25–28 (2017) 8. Mikolov, T., Kombrink, S., Deoras, A., Burget, L., Černocký, J.H.: RNNLM - recurrent neural network language modeling toolkit. In: Procedings of ASRU Workshop (2011) 9. Mikolov, T., Chen K., Corrado, G., Dean, J.: Efﬁcient estimation of word representations in vector space, arXiv:1301.3781 (2013) 10. Niu, F., Recht, B., Ré, C., Wright, S.J.: Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, Chicago, pp. 693–701 (2011) 11. Chen, X., Liu, X., Gales, M.J.F., Woodland, P.C.: Recurrent neural network language model training with noise contrastive estimation for speech recognition. In: Proceedings of ICASSP, pp. 5411–5415. IEEE (2015) 12. Abadi, M: TensorFlow: large-scale machine learning on heterogeneous distributed systems, arXiv:1603.04467 (2016) 13. Chen, X., Liu, X., Qian, Y., Gales, M.J.F., Woodland P.C.: CUED-RNNLM – an opensource toolkit for efﬁcient training and evaluation of recurrent neural network language models. In: Proceedings of ICASSP, pp. 6000–6004. IEEE (2015) 14. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 1–4. IEEE Signal Processing Society (2011) 15. Xu, H., et al.: A pruned RNNLM lattice-rescoring algorithm for automatic speech recognition (2017)

Perceptual-Auditory Evaluation of the Aggressive Speech Behavior: Gender Aspect (on the Basis of Russian and Spanish Languages) Rodmonga Potapova1 , Liliya Komalova1,2(&) and Vsevolod Potapov3

,

1

Institute of Applied and Mathematical Linguistics, Moscow State Linguistic University, Ostozhenka Street 38, 119034 Moscow, Russia {RKPotapova,GenuinePR}@yandex.ru 2 Department of Linguistics, Institute of Scientiﬁc Information for Social Sciences of the Russian Academy of Sciences, Nakhimovsky Prospect 51/21, 117997 Moscow, Russia 3 Centre of New Technologies for Humanities, Lomonosov Moscow State University, Leninskije Gory 1, 119991 Moscow, Russia [email protected]

Abstract. The purpose of the research was to identify prosodic features of perceived images of a “male aggressor” (10 Russian samples, 10 Spanish samples) and a “female aggressor” (10 Russian samples, 10 Spanish samples) reconstructed by Russian male (n = 13) and female (n = 42) listeners in the course of the perceptual-auditory experiment. All listeners reported that the speech of all Russian and Spanish subjects (informants) was perceived as if they were externalizing an offensive type of aggression during the aggression escalation period. It is characterized by strong negative emotions, differing in male and female subjects’ (informants’) groups. Qualitative analysis of the prevailing speech prosodic tendencies revealed that in the studied conditions, female listeners estimated the male subjects’ (informants’) voice intensity in the aggressor image as stronger in comparison with the female subjects’ (informants’) voice intensity. Male listeners found female voice pitch in the aggressor image higher than the male aggressor voice pitch. Male and female listeners perceived the Spanish subjects’ (informants’) speech tempo as faster than the Russian subjects’ (informants’) speech tempo. Female listeners considered the perceptualauditory aggressor images to have clear speech rhythm, while male listeners perceived speech rhythm of the aggressor image as irregular (with no gender difference of a speaker). The obtained results are compared with the previous ﬁndings on the material of British and American English. Keywords: Aggressive speech behavior Speech prosody Gender Speech perception Emotions Speech rhythm Speech pauses Speech breathing Speech timbre Speech melodic pattern

© Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 532–541, 2018. https://doi.org/10.1007/978-3-319-99579-3_55

Perceptual-Auditory Evaluation of the Aggressive Speech Behavior

533

1 Introduction In the realm of pragmalinguistics, the effects of antisocial speech behavior in general have long been studied. However, less well understood are the perceptional and cognitive mechanisms of aggressive speech behavior that influence bystanders. In the aggressive interaction “bystanders are of particular interest as they have the potential to amend the situation by intervening” [2]. Moreover, “the mere perception of another’s behavior automatically increases the likelihood of engaging in that behavior oneself” [5]. Emotionally colored speech behavior of a speaker can intervene in the decision making towards sharing his / her ideas and follow them. Analyzing political communication, Nau and Stewart emphasize that “verbally aggressive political speakers are perceived as less communicatively appropriate and credible than nonaggressive speakers, and are less likely to win agreement with their messages” [23]. Allan points out the difference between how the insulter, the victim and the onlooker / overhearer (side participant) perceive a certain kind of behavior as insulting, arguing that “verbal insult depends in large part on the language used because the insult arises from its perlocutionary effect” [1]. Berrocal says “victimhood”, as a social role, is constructed in discourse. The author examines a display of parliamentary discourse, which presents the violator as a victim of conspiracy, calling for sympathy and providing selfjustiﬁcation, on the one hand, and using verbal attacks to undermine and disqualify a number of overt and covert enemies, and highlights the importance of the discourse analysis in case of detecting the real aggressor and victim [3]. The topicality of the speech perception in the frame of the aggressive speech act analysis cannot be overestimated, for the individual-subjective approach prevails over the formalized-objective procedure in the process of interpersonal communication. Based on many years of practical and research experience, Potapova, Potapov, Lebedeva and Agibalova argue that “the human ear is able to distinguish the speaker’s emotions, even in the absence of any indication of this on the part of semantics, vocabulary and grammar” [31, p. 128]. Analyzing speech indicators the listener gets informed about the emotional change from neutral (safe mode) to aggressive (alert mode) mood of the speaker that helps the percipient tune up with the speaker and prepare a reciprocal communicative response. Anyway, the fact of speech aggression recognition signiﬁes the potential threat to the recipient’s psychological integrity. Extreme aggressive behavior and psychophysiological deviations accompanied with aggressive outcomes are well distinguishable and described in modern pathopsychology [8, 20], psychiatry of deviant behavior [4, 19, 35, 36], and sociology of crowds [9, 34]. The aim of this paper is to bring some clarity to debates about the so called “everyday aggression”. Enikolopov, Kuznetsova and Chudova [7] introduce the concept as the aggression manifested by law-abiding, mentally healthy, educated citizens. The consequences of such aggression are hardly recognized even by victims themselves. Damages are invisible for legal prosecution and cannot be avoided in everyday life. Harmful psychological damage of everyday aggression to emotional, cognitive and axiological spheres is quite real and manifests itself both on individual and social levels of interaction [7, p. 5]. Among possible destructive consequences of

534

R. Potapova et al.

this phenomenon are individual and social adaptability declines, emotional destabilization, and inefﬁciency in solving problems. Due to the fact that the recognition and evaluation of everyday aggression is made by recipients (victims and bystanders) on the basis of subjective (speech, visual, tactile) perception, the experimental approach involving audio-perceptual analysis will make it possible to sketch the perceived image of the emotional-modal complex “aggression”. Communicative behavior is gender identiﬁed [11, 12, 22, 24, 25, 32]. According to the fact that prosody also indicates social and national identities [6, 18, 21], the gender factor can be considered as influential on prosodic behavior. For example, Darania and Darani [13] argue that, “the paralinguistic cue of gender can play an influential role in same-sex and cross-sex talks especially in societies where men and women are viewed differently” [13, p. 427]. That’s why prosodic parameters are usually used in various approaches to solve author gender identiﬁcation issues (see, for example [33]).

2 Method and Procedure The perceptual-auditory method (analysis) supposes assessment of the spoken language materialized in physical form of specially selected recordings by means of special questionnaires. The analyzed subject (speech samples) must reflect selected parameters which are investigated. The methodology assumes generalization of the test evaluations obtained from the homogeneous group, the interpretation of the identiﬁed patterns and trends, and validation of the data for statistical signiﬁcance [15, p. 85]. The purpose of the research was to identify differential prosodic features of the “male aggressor” and “female aggressor” images reconstructed by male and female listeners in the course of the perceptual-auditory experiment. The group of listeners consisted of 55 individuals (42 females and 13 males) – native Russian speakers aged 19 to 23. The experimental material consisted of a dataset of authentic monologue speech samples (N = 40) of speakers perceived as “aggressors”: male Russian native speakers (n = 10); female Russian native speakers (n = 10); male native speakers of Castilian Spanish (n = 10); female native speakers of Castilian Spanish (n = 10). The experimental dataset was constructed at the previous stage of the research (see [27, 28]). The procedure involved 32 Russian native listeners – graduates of the Moscow State Linguistic University, advanced Spanish speakers. The subjects characterized the selected speech stimuli as samples with aggressive speech behavior (98%). All the listeners (N = 55) gave written consent to participate in the experiment before the experiment started. They were willing to stop the experiment at any time they wished. The experimental task consisted of the auditory test. After listening to each speech sample as many times as required, the listeners were asked to answer questions of special questionnaires (as described in [29]). Those tasks were performed strictly individually. In the experiment the listeners played the role of bystanders passively perceiving verbal realization of an aggressive act.

Perceptual-Auditory Evaluation of the Aggressive Speech Behavior

535

3 Results On the basis of the data obtained during the perceptual-auditory analysis in this research, the following perceived images of male and female “aggressors” were identiﬁed (Table 1). Table 1. Perceptual-auditory aggressor images (average normalized evaluations).

Parameters Voice pitch Voice intensity

Voice timbre (by pair)

Speech melodic pattern Speech tempo Speech rhythm Speech pauses Speech breathing

Features low medium high weak moderate strong clear muffled limpid hoarse sing-song sharp soft rough pleasant unpleasant smooth irregular monotonous slow moderate fast clear irregular short medium long normal irregular discomfort

Female aggressor Russian Spanish 6,6 5,1 31 28 19 22 1,6 2 27 31 27 22 30 28 13 12 20 19 14 12 7,2 10 31 22 7,4 11 23 13 7,5 8,5 19 15 9,4 14 43 38 3,1 2,1 9,3 0,8 26 17 21 37 36 36 20 18 33 38 14 11 2,1 1,2 22 34 24 16 11 5,3

Male aggressor Russian Spanish 12 20 37 27 7,3 7,3 3,2 3,8 27 28 26 23 20 10 20 26 20 9,3 17 24 7,2 4,5 31 27 5,6 5 26 23 9,3 6,4 19 17 11 5,7 43 45 1,8 3,3 4,6 0,7 34 17 17 37 33 30 23 24 29 39 18 11 3,4 1,6 27 20 19 23 9,8 12

All listeners reported that the speech of all Russian and Spanish subjects (informants) was perceived as if they were externalizing an offensive type of aggression during the aggression escalation period. The Russian female aggressor image is characterized by strong negative emotions such as rage and hatred mixed with such emotional-modal states as indignation and

536

R. Potapova et al.

grievance experienced by subjects (informants). Their speech was characterized by an irregular melodic pattern (sharp changes at the level of melodic registers with alternating peaks and recessions), clear rhythm and moderate-average speech tempo. Values of the voice pitch are medium; the voice is changing from moderate to strong. Speech pauses are of minimum duration (short pauses). Speech breathing is irregular. The voice timbre is marked with only negative nuances (muffled, hoarse, sharp, rough, gruff, and unpleasant voice). The Russian male aggressor image is characterized by strong negative emotions of anger and rage in combination with negative emotional-modal states of anxiety experienced by subjects. Their speech is also characterized by an irregular melodic pattern, clear rhythm, moderate-average speech tempo, short pauses and normal breathing. Dynamic features of the voice do not deviate from the average values. The voice timbre is marked with only negative nuances (muffled, hoarse, sharp, rough, and unpleasant voice). The Spanish female aggressor image is described as radiating a strong negative emotion of anger and emotional-modal state of indignation and grievance. Dynamic features of the voice do not deviate from the average values. The melodic pattern is perceived as irregular, the rhythm as clear, the speech tempo as fast, pauses as short, and speech breathing as normal. The voice timbre is marked with only negative nuances (sharp, rough, unpleasant voice). The Spanish male aggressor image has similar characteristics: subjects (informants) are perceived as experiencing strong negative emotions of anger and rage in combination with negative emotional-modal states of grievance. Dynamic voice features do not deviate from the average values. The melodic pattern is perceived as irregular, the rhythm as clear, the speech tempo as fast, pauses as short, and speech breathing as irregular. The voice timbre is marked with only negative nuances (muffled, hoarse, sharp, rough, and unpleasant voice). The listeners assessed the Russian and Spanish male and female aggressor images as 5–6 points on the 10 points scale that correlated with the escalation period on the conflict development scale [10]. Therefore, the analyzed speech samples could be placed between two extremums: (1) transition from the inception of the conflict to escalation, (2) transition from escalation to the peak of aggression. These ﬁndings differ from what we obtained from the previous analysis of British and American English samples through the same experimental procedure. Perceiving experimental recordings of British and American English subjects (informants) in the same experimental background marked British recordings as transition from the inception of the conflict to escalation (escalation features prevailing), and American recordings as transition from escalation to the peak of conflict interaction (features of conflict peak prevailing) [26, pp. 150–151]. The Page-test and the Wilcoxon signed rank test were used to measure reliability of the revealed tendencies in the listeners’ evaluations for each parameter and language separately, and the Page-test was conducted to measure reliability of the revealed differences between female and male subjects’ evaluations. All measures were statistically valid (q 0,05). Table 1 shows the average normalized evaluations. If we take the trend indicators as estimated by the mixed listeners’ group (N = 55) and compare them with the evaluations given by the females’ group (n = 42) and the

Perceptual-Auditory Evaluation of the Aggressive Speech Behavior

537

males’ group (n = 13), and then compare the evaluations of the two gender groups between themselves, it is possible to reveal deviations from the mixed group norm correlated with the factor “gender of the recipient”. The qualitative analysis of the prevailing tendencies (Table 2) revealed that in the studied conditions, female listeners estimated male subjects’ (informants’) voice intensity in the aggressor image as stronger in comparison with female subjects’ informants’ voice intensity. Male listeners found the female voice pitch in the aggressor image higher than the male aggressor voice pitch. Male and female listeners perceived the Spanish subjects’ (informants’) speech tempo as faster than the Russian subjects’ (informants’) speech tempo. Female listeners considered the perceptualauditory aggressor image to have clear speech rhythm, while male listeners perceived the speech rhythm as irregular (with no gender difference of a speaker).

Table 2. Results of the qualitative analysis of the perceptual-auditory evaluations of prevailing tendencies.

Female Russian Female listeners subjects Male subjects Male Female listeners subjects Male subjects Female Spanish Female listeners subjects Male subjects Female Male listeners subjects Male subjects

Voice pitch Medium

Voice Speech Speech intensity tempo rhythm Moderate Moderate Clear

Speech pauses Short

Speech breathing Irregular

Medium

Strong

Short

Normal

Medium, high Medium

Moderate Moderate Irregular Short

Normal

Medium

Moderate Fast

Clear

Short

Normal

Low

Strong

Clear

Short

Irregular

High

Moderate Fast

Irregular Short

Normal

Medium

Moderate Fast

Irregular Short

Normal

Moderate Clear

Moderate Moderate Irregular Medium Normal

Fast

4 Conclusions and Discussion One can assume that a clear speech rhythm (of both men and women) signals the speaker’s conﬁdence in what he / she is saying the desire to convey to the recipient the meaning of the message in full, the influence of his / her speech, which is in alignment with the intention of an offensive form of aggression to impose his / her position / point of view. All this is related to the intention of an offensive type of aggression to impose perpetrator’s opinion / point of view on the listener.

538

R. Potapova et al.

For all speech samples female and male listeners reported melodic pattern being perceived as irregular, no matter whether it concerned female or male informant speech, in Russian or Spanish, which indicates the state of emotional imbalance, the presence of the so-called “mixed” feelings [16, p. 123]. The presence of speech pauses of minimum duration indicates the dominant status of the speaker, who is not giving the listener a chance to express his / her opinion and to consider his / her answer. This kind of speech manner in the described communication conditions makes the listener focus all his / her attention on the perceived information and stay in the position of an object (possibly, “victim of aggression”). Speech breathing in the perceived aggressor images is mainly characterized as normal that indicates a relatively balanced emotional state [17, 37] and correlates with the evaluation of awareness (and therefore, the intention) of the speaker’s aggressive actions. Irregular breathing may indicate forcing the air through the narrowed larynx, which, in turn, is also regarded as a “gesture of aggressiveness / negative axiological evaluation” (by S.V. Kodzasov). The voice in the aggressor perceptual-auditory image is characterized by the dominance of negative timbral nuances (hollowness, hoarseness, sharpness, roughness, unpleasantness). According to Potapova and Potapov, a husky, hoarse phonation type may signal deeply felt feelings in many cultures. A sharp drop of the voice pitch in the presence of creaky phonation in the communication of Russian and Spanish male subjects (informants) signals possible intention to humiliate the communication partner, hurt and discredit him / her in the eyes of others [30, p. 297]. When considering prosodic techniques of expressive interactions and their functions, Kodzasov points out that gravel voice is a gesture of a negative evaluation [14, p. 196] of the communicative situation / speech message / interlocutor. Supposedly, gender peculiarities in the female and male aggressor images are explained primarily by stereotypical expectation of aggressive behavior rather by men than by women, which, in turn, leads to greater perceived expressiveness of negative timbral nuances of the voice and a tendency to strengthen the voice in the perceptualauditory male aggressor image. The same pattern is likely to appear in relation to gender-speciﬁc perception of the speaker by the listener of the opposite sex. Female listeners describe the male aggressor image using prosodic features that create the perceived image of a stronger subject who is more conﬁdent in his speech behavior. Acknowledgements. The research is carried out with the support of the Russian Science Foundation (RSF) as part of the project № 18-18-00477.

References 1. Allan, K.: The pragmeme of insult and some allopracts. In: Allan, K., Capone, A., Kecskes, I. (eds.) Pragmemes and Theories of Language Use, vol. 9. Springer, Cham (2016). https:// doi.org/10.1007/978-3-319-43491-9_4 2. Allison, K.R., Bussey, K.: Cyber-bystanding in context: a review of the literature on witnesses’ responses to cyberbullying. Child. Youth Serv. Rev. 65, 183–194 (2016). https:// doi.org/10.1016/j.childyouth.2016.03.026

Perceptual-Auditory Evaluation of the Aggressive Speech Behavior

539

3. Berrocal, M.: ‘Victim playing’ as a form of verbal aggression in the Czech parliament. J. Lang. Aggress. Confl. 5(1), 81–107 (2017). https://doi.org/10.1075/jlac.5.1.04ber 4. Cabrera, O.A., Adler, A.B., Bliese, P.D.: Growth mixture modeling of post-combat aggression: application to soldiers deployed to Iraq. Psychiatry Res. 246, 539–544 (2016). https://doi.org/10.1016/j.psychres.2016.10.035 5. Chartrand, T.L., Bargh, J.A.: The chameleon effect: the perception – behavior link and social interaction. J. Pers. Soc. Psychol. 76(6), 893–910 (1999). https://doi.org/10.1037/0022-3514. 76.6.893 6. Coates, J.: Women, Men and Language: A Sociolinguistic Account of Sex Differences in Language. Longman, London (1993) 7. Enikolopov, S.N., Kuznecova, Y.M., Chudova, N.V.: Aggression in Everyday Live [Agressiya v obydennoj zhizni]. Politicheskaya ehnciklopediya, Moscow (2014). (in Russian) 8. Frederiksen, K.S., Waldemar, G.: Aggression, agitation, hyperactivity, and irritability. In: Verdelho, A., Gonçalves-Pereira, M. (eds.) Neuropsychiatric Symptoms of Cognitive Impairment and Dementia. NSND, pp. 199–236. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-39138-0_9 9. Gerritsen, C., van Breda, W.R.J.: Simulation-based prediction and analysis of collective emotional states. In: Meiselwitz, G. (ed.) SCSM 2015. LNCS, vol. 9182, pp. 118–126. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20367-6_13 10. Glasl, F.: Selbsthilfe in Konflikten. Konzepte, Übungen, praktische Methoden. Freies Geistesleben, Stuttgart (2002). (in German) 11. Göçtüa, R., Kir, M.: Gender studies in English, Turkish and Georgian languages in terms of grammatical, semantic and pragmatic levels. Procedia Soc. Behav. Sci. 158, 282–287 (2014) 12. Goroshko, O.: Differentiation in male and female speech style. Open Society Institute, Budapest (1999). (in Russian) 13. Darania, L.H., Darani, H.H.: Language and gender: a prosodic study of Iranian’s talks. Procedia Soc. Behav. Sci. 70, 423–429 (2013). https://doi.org/10.1016/j.sbspro.2013.01.080 14. Kodzasov, S.V.: Research in Russian prosody. LRC Publishing House, Moscow (2009). (in Russian) 15. Komalova, L.R.: Aggressogen discourse: The multilingual aggression verbalization typology. Sputnik+, Moscow (2017). http://elibrary.ru/item.asp?id=28993951. (in Russian) 16. Komalova, L.R.: The auditory-perceptual proﬁle (image) of an aggressor. Vestnik Mosc. State Linguist. Univ. 7(746), 116–126 (2016). http://libranet.linguanet.ru/prk/Vest/746-7n. pdf. (in Russian) 17. Krivnova, O.F.: Speech breathing factor in the intonational-pausal speech articulation. In: Vinogradov, V.A. (ed.) Linguistic Polyphony, pp. 424–444. LRC Publishing House, Moscow (2007). (in Russian) 18. Labov, W.: The interaction of sex and social class in the course of linguistic change. Lang. Cariation Change 2(2), 205–254 (1990) 19. LaMotte, A.D., et al.: Sleep problems and physical pain as moderators of the relationship between PTSD symptoms and aggression in returning veterans. Psychol. Trauma Theor. Res. Pract. Policy 9(1), 113–116 (2017). https://doi.org/10.1037/tra0000178 20. Mathes, B.M., Portero, A.K., Gibby, B.A., King, S.L., Raines, A.M., Schmidt, N.B.: Interpersonal trauma and hoarding: the mediating role of aggression. J. Affect. Disord. 227, 512–516 (2018). https://doi.org/10.1016/j.jad.2017.11.062

540

R. Potapova et al.

21. Milroy, L., Milroy, L.: Mechanisms of change in urban dialects: the role of class, social network and gender. Int. J. Appl. Linguist. 3(1), 57–77 (1997) 22. Murashova, L.P., Pravikova, L.V.: Critical analysis of gender studies in western linguistics. Language and Culture 1(33), 33–42 (2016). https://doi.org/10.17223/19996195/33/3, https:// elibrary.ru/item.asp?id=25693295. (in Russian) 23. Nau, Ch., Stewart, C.O.: Effects of verbal aggression and party identiﬁcation bias on perceptions of political speakers. J. Lang. Soc. Psychol. 33(5), 526–536 (2014). https://doi. org/10.1177/0261927x13512486 24. Khalida, N., Sholpan, Z., Bauyrzhan, B., Ainash, B.: Language and gender in political discourse (Mass media interviews). Procedia Soc. Behav. Sci. 70, 417–422 (2013). 10.1016/j.sbspro.2013.01.079 25. Potapov, V.: Multilevel strategy in linguistic gendorology. Voprosy Jazykoznanija (Top. Stud. Lang.) 1, 103–130 (2002). http://www.ruslang.ru/doc/voprosy/voprosy2002-1. pdf. (in Russian) 26. Potapov, V., Potapova, R., Komalova, L.: The perceived speech prosodic image of “aggressor”: dialog communication gender features. In: Masalóva, S., Polyakov, V., Solovyev, V. (eds.) Cognitive Modeling: The V International Forum on Cognitive Modeling. Part 1: Cognitive Modeling in Linguistics: Proceedings of the XVIII International Conference «Cognitive Modeling in Linguistics. CML-2017», pp. 147–154. Science and Studies Foundation (2017). https://elibrary.ru/item.asp?id=32559206 27. Potapova, R., Komalova, L.: Auditory-perceptual recognition of the emotional state of aggression. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 89–95. Springer, Cham (2015). https://doi.org/10.1007/978-3-31923132-7_11 28. Potapova, R.K., Komalova, L.R.: Gender-based perception of verbal realization of the emotional state of aggression. Human being: image and essence. Humanit. Aspects 26, 169– 180 (2015). https://elibrary.ru/item.asp?id=25224664. (In Russian) 29. Potapova, R., Potapov, V.: Kommunikative Sprechtaetigkeit: Russland u. Deutchland im Vergleich. Boehlau Verlag, Koeln; Weimar; Wien (2011). (In German) 30. Potapova, R.K., Potapov, V.V.: Speech Communication: From the Sound to the Utterance. LRC Publishing House, Moscow (2012). (in Russian) 31. Potapova, R.K., Potapov, V.V., Lebedeva, N.N., Agibalova, T.V.: Interdisciplinarity in Researching of Speech Polyinformativity. LRC Publishing House, Moscow (2015). (in Russian) 32. Samara, A., Smith, K., Brown, H., Wonnacott, E.: Acquiring variation in an artiﬁcial language: children and adults are sensitive to socially conditioned linguistic variation. Cogn. Psychol. 94, 85–114 (2017). https://doi.org/10.1016/j.cogpsych.2017.02.004 33. Sboev, A., Moloshnikov, I., Gudovshikh, D., Selivanov, A., Rybka, R., Litvinova, T.: Automatic gender identiﬁcation of author of Russian text by machine learning and neural net algorithms in case of gender deception. Procedia Comput. Sci. 123, 417–423 (2018). https://doi.org/10.1016/j.procs.2018.01.064 34. Smokowski, P.R., Guo, S.Y., Evans, C.B.R., Wu, Q., Rose, R.A., Bacallao, M., Cotter, K.L.: Risk and protective factors across multiple microsystems associated with internalizing symptoms and aggressive behavior in rural adolescents: modeling longitudinal trajectories from the rural adaptation project. Am. J. Orthopsychiatr. 87(1), 94–108 (2017). https://doi. org/10.1037/ort0000163

Perceptual-Auditory Evaluation of the Aggressive Speech Behavior

541

35. Urben, S., Habersaat, S., Pihet, S., Suter, M., de Ridder, J., Stephan, P.: Speciﬁc contributions of age of onset, callous-unemotional traits and impulsivity to reactive and proactive aggression in youths with conduct disorders. Psychiatr. Q. 89(1), 1–10 (2018). https://doi.org/10.1007/s11126-017-9506-y 36. Zapolski, T.C.B., Banks, D.E., Lau, K.S.L., Aalsma, M.C.: Perceived police injustice, moral disengagement, and aggression among juvenile offenders: utilizing the general strain theory model. Child Psychiatr. Hum. Dev. 49(2), 290–297 (2018). https://doi.org/10.1007/s10578017-0750-z 37. Zlatoustova, L.V.: Some comments on the speech breathing. In: Zvegintsev, V.A. (ed.) Studies on the Speech Information, vol. 2, Moscow (1968). (in Russian)

Main Determinants of the Acmeologic Personality Proﬁling Rodmonga Potapova1 ✉ (

1

)

and Vsevolod Potapov2

Institute of Applied and Mathematical Linguistics, Moscow State Linguistic University, Ostozhenka 38, Moscow 119034, Russia [email protected] 2 Faculty of Philology, Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow 119991, Russia [email protected]

Abstract. This investigation aims at establishing a set of voice and speech personality identiﬁcation features that predict some phonation and articulation gestures in regard to lexical-semantic, phonological, anthropometrical, acous‐ tical, physiological, psychological, emotional and intellectual peculiarities of the “electronic personality” on the Internet and other automated digital communica‐ tion means and devices. This problem is very signiﬁcant for forensic investiga‐ tions in the domain of speech communication. It is proposed to undertake special studies to ﬁnd solutions to these problems in the ﬁeld of forensic phonetics. In forensic application of the “electronic personality”, it is necessary to be able to specify a temporal dynamics factor for a decision concerning acmeologic quan‐ titative and qualitative changes of the personality in time. Keywords: Social-Network Discourse · Relevant personality features Forensic emotional personality proﬁling in dynamics Acmeologic personality proﬁling · Perceptual-auditory analysis

1

Introduction

The development of the social-network discourse (SND) investigations on the Internet involves studying the mechanism of dependence between the acoustic prosodicsemantic interpretation of the speech utterance by the speaker and processing of the discourse construction by the listener considering such factors as: the cognitive-verbal base of the communicants’ idiosyncratic peculiarities; the multimodal (verbal, para‐ verbal, non-verbal, extra-verbal) structure of coding (stimulus generation) and decoding (reaction to stimulus) of communication process items by the communicants [10, 11, 15–18]; the multi-level (phonological-phonetic, syntactic-semantic and pragmalin‐ guistic) structure of verbal coding (speech stimulus) and decoding (speech reaction to the stimulus) of the process by the communicants; paraverbal (emotional, emotionalmodal and connotative) components of the speech stimulus and speech reaction to this stimulus (in the communication act); extraverbal (situational, individual – idiosyncratic, idiolectal [6], sociolectal, etc.) constituents of the speech stimulus and speech reaction © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 542–551, 2018. https://doi.org/10.1007/978-3-319-99579-3_56

Main Determinants of the Acmeologic Personality Proﬁling

543

to the utterance taking into account the role of presupposition, as well as the recipient’s previous experience in a particular subject area [15]. Experimental research and modeling of acmeologic variability of spoken social network discourse (SND) forms an important direction of modern communicative variantology [23], and forensic sciences require information about the multimodal indi‐ vidual dynamics of personalities or personality communities [9, 18]. Of particular importance is the above direction in connection with the SND functioning in the infor‐ mation and communication space of the Internet [9, 15, 16, 18]. Analysis of the deep mechanism of the prosodic-semantic variability of the verbal response to the stimulus between communicants within the spoken discourse in respect to the SND requires knowledge in various ﬁelds of speechology (spoken language sciences): in general, private and experimental phonetics, cognitive and communicative linguistics, speech acoustics, auditory perception, mathematical statistics, forensic linguistics, etc. [7]. The solution of the problem taking into account the multiversatile variability of the analyzed object includes the following: search for the pronouncing invariant and variants of the prosodic-semantic interpretation of the speech stimulus-utterance in the SND; deter‐ mining the interaction of the various factors listed above in the process of SND construc‐ tion; the degree of inﬂuence of the above factors on the ﬁnal verbal product of the SND in the communication act; identiﬁcation of the prosodic-semantic dominant within the SND; determining the acceptable range of variation of prosodic-semantic variforms – alloprosodosemants; determining speaker proﬁling, veriﬁcation and identiﬁcation; investigation of the variability of the speech and voice characteristics of the communi‐ cants, personality dynamics of the individual “portrait” in time with regard to the acmeologic method [15, 16, 18]. The research on the prosodic-semantic variability of multilevel verbal, para-, non- and extraverbal components of spoken discourse involves various modern methods of analysis, synthesis and modeling of sounding speech: acoustic, perceptual-auditory, associative, prosodic-semantic [7]. Acmeologic proﬁling of communicants on the Internet and in other automated means of communication includes ﬁrst of all interdisciplinary researches: “detailed phonetic and linguistic description of the verbal behavior of an… individual, …careful analysis of dialectal and sociolectal features, speech defects, age, “voice quality,” … a combination of traditional phonetic analysis, techniques, including analytical listening by a phonetician, and modern signal processing techniques…” [4: 80-99].

2

Conceptual Background of the Personality Proﬁling

Speech activities in the format of spoken social-network discourse (SND) – in particular, based on various modern IP-telephony facilities on the Internet, – can be presented taking into account the following level-by-level components: incentive level: external impact; motive; intent; communicative intention; formation level: the sense-forming phase; deep formation of the space-concept scheme; time (linear) development of the spatial-conceptual scheme of the utterance; formulation level: formulating phrase (choice of words); process of grammatical structuring; realization level: articulatory gestures (articulation); voice modulation (phonation); coarticulation transformations;

544

R. Potapova and V. Potapov

acoustic level: transformation of articulatory gestures at the output of the speech-forma‐ tion system into a sound (acoustic) wave; auditory detection, auditory control and recognition of perceived acoustic stimuli; interpretation level: transformation of acoustic stimuli into verbal images, semantic content realization [9, 10, 12]: In accordance with the expanded understanding of the object of research in speech‐ ology, the following techniques and methods can be mentioned: cognitive-communi‐ cative analysis of the text; indirect checking of models and hypotheses, for example, by studying speech errors, linguistic reactions, etc.; neurophysiological methods; bioelec‐ tric methods; registration and analysis of articulation, for example, using computed tomography, etc. [9]. The study of the realization and monitoring of motor programs should be related to the information processing system in the central and peripheral nervous systems. It is likely that in the central nervous system there is no functional center that would specialize in processing verbal information exclusively. Neural networks processing verbal information also include all functions.

3

SND Communication Analysis Considering Human Speech Functions

In investigations [8, 13–16, 18], the concept of SND was ﬁrst substantiated based on its deﬁnition as a special electronic macro-polylogue with regard to a number of categories of form, content and functional weight. An example is one of the form categories on the basis of the “univector – polyvector” opposition. This opposition is correlated not only on the basis of location of communicative interaction vectors on the Internet, but also on the SND participants’ interaction conﬁguration, which is directly dependent on the number of communicants on the Internet. When examining the speech behavior of the speaker in the SND format using, for example, IP-telephony, we proceed from the following postulate: human speech is both a symptom and a signal in relation to the real world: a symptom as a direct psychophy‐ siological response to external stimuli and a signal as a sign language response of neuro‐ psychological nature to stimuli of a more complex behavioral level in the communication act [11]. In this regard, speech is presented as a poly-informative and multifunctional phenomenon. The development of this issue assumes special importance in the study of spoken-speech communication with the help of IP-telephony: to solve a number of problems of forensic examination. “… The latter circumstance naturally led to the fact that phonograms of conversations by the channels of cellular communication and the Internet became objects of investigation for experts in forensic examination of sound recordings” [5: 129, 20]. The pronunciation of the speaker includes a set of speciﬁc properties of this indi‐ vidual manifested in formation of the sound ﬂow in the speech apparatus and conditioned by the peculiarities of its structure, the features of the pronunciation-auditory skills, the speciﬁcs of thinking, and the formulation of thoughts with the help of linguistic means [10–12]. The speech “portrait” of the speaker includes verbal, paraverbal, non-verbal and extraverbal features. Verbal components refer to such aspects as the language used in the communication process (native, non-native, dialect, vernacular, sociolect, etc.)

Main Determinants of the Acmeologic Personality Proﬁling

545

For each speaker, an inventory of stable phonetic features is characteristic: pronouncing variants of phonemes, variants of intonemes, etc. Verbal speech features make it possible to determine such components of the speech portrait as nationality, places of the speak‐ er’s long residence, level of education, social status, economic status, upbringing, level of language proﬁciency, profession, level of intellectual skills, etc. It is thought that extraverbal features correspond with anthropometric (structure of the speech apparatus, body weight, height) [4], physiological (gender, age, norm/pathology), psychological (type of higher nervous activity (HNA) [22], emotional-volitional regulation), intellec‐ tual (speciﬁc thinking, cognitive level) aspects. Accordingly, it is possible to distinguish relatively stable speech extraverbal features in the speaker’s speech portrait. Both verbal and extraverbal features have their own acoustic correlates that make it possible to recreate the “portrait” of the speaker. For example, gender and age can be characterized by some acoustic parameters [1]. There are various data for native Russian speakers. According to observations [12], the average value of the pitch frequency dynamics for males aged from 20 to 80 years increases (≈ 100–130 Hz). For females aged 20 to 80 years, the reverse diﬀerence in value is observed (≈ 220–180 Hz) [3, 10, 13–18, 21]. Proceeding from the basic premise, according to which the human speech is indi‐ vidually organized on the basis of phonation and articulatory gestures in direct connec‐ tion with the socially-conditioned phonological representation of the utterance and its lexical and semantic features, it is proposed to conduct an express-analysis of the speak‐ er’s speech portrait taking into account the following stages: formation of the databases for correlates of anthropometric features; acoustic correlates of physiological features; acoustic correlates of psychological and emotional-psychological features; acoustic correlates of intellectual features [13–15, 17, 18]. Thus, the acoustic-linguistic algorithm of the speaker identiﬁcation analysis is constructed taking into account the following stages: acoustic; anatomical-physiological aimed at decoding of the speech signal; socio-psychological aimed at decoding of the speech signal; intellectual-semantic decoded for the speech signal. In this regard, all the tasks can be conditionally charac‐ terized as tasks of compiling an individual portrait of the speaker, to which phonation (voice), articulatory segment (motor), prosodic (suprasegment) correlates of the speak‐ er’s speech should be attributed. Speech characteristics of the speaker are divided into controlled (external) and uncontrolled (internal) ones. Some experts identify potentially controlled features. The degree of control depends on two factors [2, 11]: the speaker’s ability to use auditory and proprioceptive forms of feedback in the implementation of the articulatory program; from his/her perceptual ability to use auditory forms of information to detect auditory diﬀerences. Therefore, information about the speaker is hidden in the speech signal, is correlated with his/her anatomical features and is stored at the neuronal level by the muscular speech patterns correlating with the speaker’s physique [2, 3, 8].

4

Preliminary Results of the Investigation

When developing expert methods for speaker proﬁling by speech on the Internet, the following conditions for the speech signal realization are taken into account: speech

546

R. Potapova and V. Potapov

should be natural and be varied as much as possible relative to the speakers (interspeaker discrepancies), but rather homogeneous relative to each speaker (intraspeaker discrep‐ ancies); at the initial stage of development, the speech should not be inﬂuenced by noise, interference, etc., and should include special characteristics of transmission along the technical path; no distortion of the voice is allowed [20]. Particularly informative for speaker attribution by speech is the range of the pitch frequency (ΔF0), which includes, ﬁrst of all, such parameters as the pitch frequency range width (ΔF0) and its register (very high, high, medium, lower medium, low, very low), which correlates with the following individual characteristics of the speaker: biological diﬀerentiation by gender, age, physique; and psychological diﬀerences in the speaker’s behavior; idiosyncratic (individual) features at the biological, psychological and regional-social levels [6, 8, 10, 14, 17, 18, 21]. Individual features of the speaker are traditionally divided into two groups: acquired and non-acquired. Acquired features include such speciﬁc speech features that are formed under the inﬂuence of the external conditions of the speaker’s life. Among the latter is primarily the process of language acquisition, and then its application in spoken and written communication. In this case, a special role is played by the dialect used by the immediate environment of the individual, especially when, during the phase of speech acquisition, which corresponds approximately to the time of schooling (age up to 18), the speaker lived in various dialectal societies. This includes the social conditions that deﬁne the so-called sociolect. The acquired features also include speech features resulting from various harmful factors, for example, smoking, alcohol and drug intoxi‐ cation [1, 19]. Non-acquired features are correlated with organic-genetic data based on the anatomical and neurophysiological components of the speech apparatus. The latter include the size and spatial conﬁguration (the so-called cavitary conﬁguration) of the neck-laryngeal, nasal and pharyngeal tracts, the mobility and size of the tongue, and in particular the number of boundary conditions depending on the voice formation (the term of mathematics), as well as age and gender. The pitch frequency can vary depending on such factors as loud speaking (for example, in a state of excitement, in noisy conditions (Lombard eﬀect), etc.) In these cases, the pitch frequency changes upwards, and this should be taken into account when describing the speaker. At certain stages of mental illness, the voice can be not only lower, but also much more monotonous (for example, in a state of depression in manicdepressive patients). In speaker attribution by voice, along with the above characteris‐ tics, of great importance is information on the voice quality. In this case, features speciﬁc to the speaker are found. First of all, one should mention such a qualitative attribute as hoarseness. Here most informative is not this feature in itself, but rather its distribution in the speech ﬂow: this phenomenon can occur where the voice for purely linguistic reasons is lowered, i.e. at the end of sentences and other syntactic or semantic units. In a number of speakers with low voices or voice pathology (for example, due to inﬂam‐ mation of the larynx, a tumor or nodes in the larynx, etc.), this symptom may appear in various other positions of the speech ﬂow [8]. In the speaker attribution process, the rate of speech formation is also informative. The average speech rate for all languages is about 4.5–5 syllables per second. Extreme values are 3, 2–7, 5 syllables. Higher rate

Main Determinants of the Acmeologic Personality Proﬁling

547

leads to incomplete articulation or complete loss of sounds, syllables and even whole words. As an example, the following requirements should be given that characterize the speaker’s portrait by voice and speech: physical: gender, age, height, weight; civil status: parents, their mother tongue, origin, social status, etc.; linguistic: native/non-native, literary/non-literary, regional/dialectal language; educational: length of study (primary, secondary, higher, etc.); geographical: place of long-term residence (if there are some, then indicate periods of residence); professional: work by profession/not by profession; auditory: state of hearing, presence/absence of pathology; medical: chronic/non-chronic diseases; voice: trained, singing, smoking, stressful, etc. voice; musical: musical infor‐ mation, etc.; hobby: sports proﬁle, musical proﬁle, etc. Thus, in the SND the number of communicants’ characteristics determined by acoustic data in IP telephony, can include the following characteristics: social: by level of education; social status; sphere of activity; physical characteristics; emotional characteristics; regional characteristics: place of birth; place of long residence; nationality; additional information; psychological characteristics: mental pathologies; HNA type; character traits; types of intoxication (alcohol, drug); pronouncing characteristics: spontaneous speech; quasi-spontaneous speech; prepared/unprepared reading of text; emotional characteristics: positive, nega‐ tive, etc. [1, 3, 4, 6, 9, 10, 15, 19, 21, 22]. As an example of personality proﬁling on the Internet is presented a speaker voice sample recorded hourly and analyzed by expert listeners on the basis of special instruc‐ tions. It was found that the dynamics of prosodic features (pitch, tempo, and loudness variations) is a reliable diagnostic tool of acmeologic speaker proﬁling with regard to emotional state changes of this personality from a normal emotional state to agitation, anger, fury, etc. This experiment deals with speakers’ emotional state acmeologic proﬁling during hourly communication on the Internet by means of perceptual-auditory analysis (step by step every ten minutes). The sentences used in the experiment were taken from speech communication dialogues on the Internet: a group of communicants (n = 10), male voices, speakers of 18–25 years old; a group of professional listeners (n = 10). The sentences were taken from a pre-election campaign debate in Russia on the Internet. The listeners were asked to evaluate the pitch, tempo, and loudness dynamics of all voice stimuli. The responses across all the stimuli examples are summar‐ ized as mean data. The Figs. 1, 2 and 3 show the dynamics of mean pitch, loudness, and speech rate data during one-hour recording. The experiment involved the perceptualauditory evaluation of such voice features of the listener as: pitch (very low, low, lower medium, medium, high, very high); speech rate (very slow/slower, slow, slowed, moderate, fast, very fast); loudness (subaudible, very slow, low, middle loud, loud, very loud). All conversations data were recorded, copied to a CD and sent to listeners with instructions to deﬁne pitch, speech rate, and loudness characteristics of every subject.

548

R. Potapova and V. Potapov

Fig. 1. Mean perceptual-auditory data of the pitch evaluation and data scope zones (ω = F0max – F0min) regarding hourly dynamics of the speech characterization features (in ten-minute steps).

Fig. 2. Mean perceptual-auditory data of the speech rate evaluation and data scope zones (ω = tmax – tmin) regarding hourly dynamics of the speech characterization features (in ten-minute steps).

Main Determinants of the Acmeologic Personality Proﬁling

549

Fig. 3. Mean perceptual-auditory data of the loudness evaluation and data scope zones (ω = Imax – Imin) regarding hourly dynamics of the speech characterization features (in ten-minute steps).

It can be concluded that every ten minutes acoustic stimuli had enough information available to draw distinction between those with no aggressive behavior dynamics of speech and those with some ones. It is known that pitch, intensity, and speaking rate are aﬀected, e.g., by aggressive emotions and perceptual-auditory corresponding features. The speech acoustic correlations of this aggressive behavior invoke some challenges in deﬁning the real emotional state. But the acoustic data measurements could not be always reliably interpreted with regard to acoustical speech signals. The optimization of the experimental methods lies in looking at combination of perceptual-auditory and acoustic analysis on the basis of fundamental sciences in the ﬁeld of interdisciplinary speech research, regarding acmeologic personality proﬁling on the Internet. As example for emotional personality proﬁling vector, a speaker voice sample is presented which was recorded during one hour and analyzed by expert listeners on the basis of special instructions. It was found that the dynamics of prosodic features (pitch, speech rate, and loudness variations) is a robust diagnostic tool of acmeologic speaker proﬁling with regard to emotional state changes of the personality regarding changes in personal emotional characteristics with connection to psychological, social, physical, etc. factors.

5

Conclusion

Thus, the study of the speech variability process on the Internet is a task of immense complexity connected, on the one hand, with the articulatory-acoustic speciﬁcs of spoken speech and its perceptual auditory and acoustic characteristics, on the other hand,

550

R. Potapova and V. Potapov

with the speciﬁcs of constructing any utterance taking into account the prosodicsemantic variability of the speech product itself. At the same time, the transmission of a high-quality speech signal with regard to IP telephony (more precisely, Voiceover IP (VoIP), etc.), due to the speciﬁcity of encoding, compression and packaging of the speech signal into IP packets, may be a kind of obstacle for the successful solution of the task [5]. An analog voice signal digitized by the PCM method and compressed by codecs to eliminate redundancy, undoubtedly undergoes certain changes at the output. As the results of the preliminary study [20] have shown, the prospect of using special software for establishing acoustic and perceptual-auditory equivalence with some degree of probability is quite promising in solving the problems of an “electronic personality” proﬁling in the Internet information and communication environment. The above-described correlations between the SND-characteristics of the speakers on the Internet and his/her speech reactions on the communication stimuli in Internet dynamics make it possible to undertake further research in the ﬁeld of the acmeologic personality proﬁling on the Internet and other speech communication transmission devices. Acknowledgements. This research is supported by the Russian Science Foundation, Project № 18-18-00477.

References 1. Braun, A.: Sprechstimmlage und regionale Umgangssprache. In: Braun A. (ed.). Beitraege zu Linguistik und Phonetik. Festschrift fuer Ioachim Goeschel zum 70 Geburtstag, pp. 453– 463. Stuttgart (2001) 2. Brown, R.: Auditory Speaker Recognition. Helmut Buske Verlag, Hamburg (1987) 3. Brown, W.S., Morris, R., Hollien, H., Howell, H.F.: Speaking fundamental frequency characteristics as a function of age and professional singing. J. Voice 3, 310–313 (1991) 4. Kuenzel, H.J.: How well does average fundamental frequency correlate with speaker height and weight? Phonetica 46, 117–125 (1989) 5. Mikhailov, V.G.: Features of the formation and analysis of voice signals transmitted by means of IP-telephony. Theor. Pract. Forensic Exam. 3(7), 129–140 (2007). (in Russia) 6. Oksaar, E.: Idiolekt als Grundlage der variationsorientierten Linguistik. Sociolinguistica 14, 37–41 (2000) 7. Potapova, R.K.: Linguistic and paralinguistic functions of prosody (On the experience of searching for prosodo-semanthema). In: Kedrova, G.E., Potapov, V.V. (eds.) Language and Speech: Problems and Solutions, pp. 117–137. MAKS Press, Moscow (2004) (in Russia) 8. Potapova, R.K.: Some observations on artiﬁcially modiﬁed speech. In: Ideas and Methods of Experimental Speech Study. I.P. Pavlov Institute of Physiology (Russian Academy of Sciences), State University, Sankt-Petersburg, pp. 124–135 (2008). (in Russia) 9. Potapova, R.K.: Speech: Communication, Information, Cybernetics, 4th edn. Book house “Librocom”, Moscow (2015). (in Russia) 10. Potapova, R.K., Potapov, V.V.: Language, Speech, Personality. Languages of Slavic Culture, Moscow (2006). (in Russia) 11. Potapova, R., Potapov, V.: Kommunikative Sprechtaetigkeit. Russland und Deutschland im Vergleich. Boehlau Verlag, Koeln (2011) 12. Potapova, R.K., Potapov, V.V.: Speech Communication: From Sound to Utterance. Languages of Slavic Cultures, Moscow (2012). (in Russia)

Main Determinants of the Acmeologic Personality Proﬁling

551

13. Potapova, R., Potapov, V.: Auditory and visual recognition of emotional behaviour of foreign language subjects (by native and non-native speakers). In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS (LNAI), vol. 8113, pp. 62–69. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-01931-4_9 14. Potapova, R., Potapov, V.: Cognitive mechanism of semantic content decoding of spoken discourse in noise. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 153–160. Springer, Cham (2015). https://doi.org/10.1007/9783-319-23132-7_19 15. Potapova, R., Potapov, V.: On individual polyinformativity of speech and voice regarding speakers auditive attribution (forensic phonetic aspect). In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 507–514. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_61 16. Potapova, R., Potapov, V.: Polybasic attribution of social network discourse. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 539–546. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_65 17. Potapova, R., Potapov, V.: Cognitive entropy in the perceptual-auditory evaluation of emotional modal states of foreign language communication partner. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 253–261. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-66429-3_24 18. Potapova, R., Potapov, V.: Human as acmeologic entity in social network discourse (multidimensional approach). In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 407–416. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-66429-3_40 19. Potapova, R.K., Potapov, V.V., Lebedeva, N.N., Agibalova, T.V.: Interdisciplinarity in the Study of Speech Polyinformativity. Languages of Slavic Culture, Moscow (2015). (in Russian) 20. Potapova, R., Sobakin, A., Maslov, A.: On the possibility of the skype channel speaker identiﬁcation (on the basis of acoustic parameters). In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS (LNAI), vol. 8773, pp. 329–336. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11581-8_41 21. Ryan, E.B., Capadano, I., Harry, L.: Age perceptions and evaluative reactions toward adult speakers. J. Gerontol. 33, 98–102 (1978) 22. Sharp, D.: Personality Types: Jung’s Model of Typology. Inner City Books, Toronto (1987) 23. Titscher, S., Meyer, M., Vetter, E., Wodak, R.: Methods of Text and Discourse Analysis. Sage, London (2000)

Studying Mutual Phonetic Influence with a Web-Based Spoken Dialogue System Eran Raveh1,2(B) , Ingmar Steiner1,2,3 , Iona Gessinger1,2 , and Bernd M¨ obius1 1

3

Language Science and Technology, Saarland University, Saarbr¨ ucken, Germany 2 Multimodal Computing and Interaction, Saarland University, Saarbr¨ ucken, Germany [email protected] German Research Center for Artiﬁcial Intelligence (DFKI GmbH), Saarbr¨ ucken, Germany

Abstract. This paper presents a study on mutual speech variation inﬂuences in a human-computer setting. The study highlights behavioral patterns in data collected as part of a shadowing experiment, and is performed using a novel end-to-end platform for studying phonetic variation in dialogue. It includes a spoken dialogue system capable of detecting and tracking the state of phonetic features in the user’s speech and adapting accordingly. It provides visual and numeric representations of the changes in real time, oﬀering a high degree of customization, and can be used for simulating or reproducing speech variation scenarios. The replicated experiment presented in this paper along with the analysis of the relationship between the human and non-human interlocutors lays the groundwork for a spoken dialogue system with personalized speaking style, which we expect will improve the naturalness and eﬃciency of human-computer interaction. Keywords: Spoken dialogue systems Human-computer interfaces

1

· Phonetic convergence

Introduction

With expanding research on, and growing use of, spoken dialogue systems (SDSs), a main challenge in the development of human-computer interaction (HCI) systems of this kind is making them as close as possible to human-human interaction (HHI) in terms of naturalness, ﬂuency, and eﬃciency. One aspect of such HHIs is the relationship of mutual inﬂuences between the interlocutors. Inﬂuence here means changes in one interlocutor’s conversational behavior triggered by the behavior of the other interlocutor. We refer to changes that make the interlocutors’ behaviors more similar as convergence. Convergence can c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 552–562, 2018. https://doi.org/10.1007/978-3-319-99579-3_57

Studying Mutual Phonetic Inﬂuence

553

occur in diﬀerent modalities and with respect to various aspects of the conversation, like eye gaze, gestures, lexical choices, body language, and more. In this paper, we concentrate on phonetic-level inﬂuences, i.e., phonetic convergence. More speciﬁcally, we examine pronunciation variations over the course of HCIs. As speech is the principal modality used for interacting with SDSs, we believe it is an especially important modality to study in the ﬁeld of HCI. Simulating and triggering convergence on the phonetic level, as found in HHI, may contribute a lot to the naturalness of dialogues of humans with computers. SDSs with such personalized speech style are expected to oﬀer more natural and eﬃcient interactions, and move one more step away from the interface metaphor [5] toward the human metaphor [3]. The novel system introduced in Sect. 3 tracks the states of segment-level phonetic features during the dialogue. All of the analyses are automated and run in real time. This not only saves time and manual work typically needed in convergence studies, but also makes the system more suitable for integration into other applications. In Sect. 4, we use this newly introduced system with recordings collected as part of a shadowing experiment to examine the relationship of mutual inﬂuences between a (simulated) user and the system. Using these signals, the system provides both visual and numerical evidence of the mutual inﬂuences between the interlocutors over the course of the interaction. The system itself will be made freely available under an open-source license.

2

Background and Related Work

Integrating support for changes in the speech signal into computer systems may enhance HCI and provide improved tools for studying convergence in HCI. [18] discusses the advantages of systems that dynamically adapt their speech output to that of the user, and the challenges involved in developing and using these systems. 2.1

Phonetic Convergence

According to [19], phonetic convergence is deﬁned as an increase in segmental and suprasegmental similarity between two interlocutors (e.g., [27]). In contrast to entrainment, we use the term convergence to describe dynamic, mutual, and non-imposing changes. Phonetic convergence has been found to various extent in conversational settings [13]. There is evidence for phonetic convergence being both an internal mechanism [21] and socially motivated [9]. Previous studies of phonetic convergence in spontaneous dyadic conversations have focused on speech rate [26], timing-related phenomena [23], pitch [8], intensity [12], and perceived attractiveness [16]. Phonetic convergence is often examined in the scope of shadowing experiments, in which the participants are asked to produce certain utterances after hearing them produced in some stimuli (e.g., [7]). This is typically done with single target words embedded in a carrier sentence. The experiment showcasing our system in Sect. 4 uses whole sentences as stimuli, in which the target features are embedded, making it a semi-conversational HCI setting.

554

2.2

E. Raveh et al.

Adaptive Spoken Dialogue Systems

Various studies have investigated entrainment and priming in SDSs, aiming to better understand HCI dynamics and improve task-completion performance. [15], for example, focused on dynamic entrainment and adaptation on the lexical level. Others, like [17], concentrated on word frequency. [20] examined changes in both lexical choice and word frequency. While these studies addressed the changes in experimental, scripted scenarios, the theoretical foundations for studying these changes in spontaneous dialogue exist as well [2]. [6] provide examples of online adaptation for dialogue policies and belief tracking. It is important to note that while all of the studies mentioned above examine various aspects of dialogues, none of those are related to speech – the primary modality used to interact with SDSs. Studying convergence of speech in an HCI context is made possible with more natural synthesis technology, which gives ﬁne-grained control over parameters of the system’s spoken output. Many systems that deal with adaptation of speech-related features focus on prosodic characteristics like intonation or speech rate. [10] sheds light on acoustic-prosodic entrainment in both HHI and HCI via the use of interactive avatars. [1] found that users’ speech rate can be manipulated using a simulated SDS. Similar results were found when intensity changes in children’s interaction with synthesized text-to-speech (TTS) output were examined [4]. All of the above provide solid ground for further investigation of phonetic convergence in HCI using SDSs.

3

System

The system introduced here is an end-to-end, web-based SDS with a focus on phonetic convergence and its analysis over the course of the interaction. Besides placing convergence in the spotlight, it is designed to be ﬂexible and to meet the researcher’s needs by oﬀering a wide range of customizations (see Sect. 3.2). Its online access via a web browser makes it scalable and simple for the end-user to operate. The system’s architecture and functionality are described in Sect. 3.1, its graphical user interface (GUI) and operation in Sect. 3.3, and an example of its utilization is demonstrated in Sect. 4. Ultimately, it oﬀers an experimentation platform for studying phonetic convergence, with emphasis on the following: Temporal analysis oﬀering real-time visualization of the interlocutors’ relations with respect to selected phonetic features over the course of the interaction. Customizability allowing the user to experiment with diﬀerent scenarios by conﬁguring parameters and deﬁnitions in many of the system’s components. Online scalability connecting multiple web clients to a server, allowing users to use it anywhere without preceding installation and conﬁgurations, and helping experimenters to collect and replay acquired data.

Studying Mutual Phonetic Inﬂuence

3.1

555

Architecture

As the system aims to oﬀer a customizable playground for experimenting and studying phonetic convergence in HCI, a key aspect of its architecture is the separation between client-side, server-side, and external resources (see Fig. 1). All of the resources and conﬁguration ﬁles needed for designing the interaction are located on the server. Running the client and server on diﬀerent machines allows users to interact with the system using a web browser alone.

Fig. 1. An overview of the system architecture. The background colors distinguish client components, server components, and external resources that can be customized. (Color ﬁgure online)

ASR audio

text

Automatic speech recognition

signal features

NLU Natural language understanding

ASP

semantics

DM

Additional speech processing

Dialogue management

audio

semantics

TTS Text-to-speech synthesis

NLG text

Natural language generation

Fig. 2. The architecture of the dialogue system component. The ASP module (dashed line) between the ASR and TTS modules is responsible for performing additional speech processing required for analyzing the phonetic changes. Though additional links between the ASP module and other modules (like NLG for example) could be made, those are beyond the scope of this work.

As shown in Fig. 2, the dialogue system component consists of typical SDS modules such as natural language understanding (NLU) and a dialogue manager (DM), but also contains an additional speech processing (ASP) module [24]. This module is responsible for processing the audio and extracts the features required by the convergence model. While the NLU component uses merely the transcription provided by the ASR, the ASP module analyzes the speech signal

556

E. Raveh et al.

itself. More speciﬁcally, it tracks occurrences of the deﬁned features and passes their measured values to the convergence model, which, in turn, forwards the tracked feature parameters to the TTS synthesis component. 3.2

Models and Customizations

The computational model for phonetic convergence used in the system is described in [25]. Diﬀerent phonetic convergence behavioral patterns that were observed in HHI and HCI experiments can be simulated by combinations of the model’s parameters presented in Table 1. All of the parameters can be modiﬁed in the system’s conﬁguration ﬁle. Table 1. Summary of the computational model’s parameters in their order of application in the convergence pipeline. Parameters marked with an asterisk ‘*’ are deﬁned for each feature independently. allowed range*

allowed value range for new instances

history size

maximum number of exemplars in pool

update frequency

frequency to recalculate feature’s value

calculation method* method to calculate pool value convergence rate

weight given to pool value when recalculating

convergence limit*

the maximum degree of convergence allowed

The entire convergence process is based on the tracked phonetic features that are considered “convergeable”, i.e., prone to variation, and is triggered whenever the ASR component detects a segment containing a phoneme associated with one or more of these features. Each feature is deﬁned by a key-value map, in which the parameters from Table 1 are conﬁgured. A classiﬁer can be associated with each feature to provide real-time predictions for both the user’s and the system’s realizations of that feature, as demonstrated in Fig. 3. With this information available, more meaningful insights can be gained into the dynamics of phonetic changes in the dialogue. The dialogue domain is speciﬁed in an XML-based ﬁle. More details on the domain ﬁle can be found in [14]. The format of the domain ﬁle makes it easy to deﬁne new scenarios for the system, such as a task-speciﬁc dialogue, general-purpose chat, or an experimental setup. Speech processing is a central aspect of the system. Diﬀerent models can be used, e.g., for improving performance or changing the language or the ASR module or the output voice of the TTS module.

Studying Mutual Phonetic Inﬂuence

3.3

557

Graphical User Interface

The system’s GUI consists of three main areas:

Fig. 3. A screenshot of the plot area showing the states of the feature [E:] vs. [e:] (in 2-dimensional formant space) during an interaction. The system’s internal convergence model (orange, bottom right) gradually adapts to the user’s (blue, upper left) detected realizations. A prediction of the feature’s current realization is given for both interlocutors. The annotation box marks the turn in which the system has aggregated enough evidence from the user’s utterances and changes its pronunciation from [E:] (its initial state) to [e:] (the user’s preferred variation). (Color ﬁgure online)

In the chat area, the interaction between the user and the system is shown in a chat-like representation. Each turn’s utterance appears inside a chat bubble with diﬀerent colors and orientations for the user and the system. The turns are also numbered, to better track the dialogue progress and analysis shown by the plots in the graph area. It is also possible to replay the utterance of a turn by clicking the “Play” button in its corresponding bubble. In the interaction area, the user can interact with the system with written or spoken input. Text-based interactions progress through the dialogue (if applicable) and trigger any subsequent domain model, but will not aﬀect convergencerelated models, since there is no audio input to process. Spoken input can be provided either by speaking into the microphone or via audio ﬁles with prerecorded speech. The latter option is especially useful for simulating speciﬁc user input, or for reproducing a previous experiment, as done in Sect. 4. In the graph area, each of the tracked features is visualized in a separate plot, and new data points are added whenever a new instance of the feature is detected. Hovering over a data point in a graph reveals additional information, such as the turn in which it was added, or the realized variant of the feature produced in that turn as predicted by its classiﬁer. These dynamic, interactive plots make it possible to shed light on how the interlocutors inﬂuence each other, whether or not they are aware of it, throughout their exchanges. Figure 3 shows such a graph with several accumulated data points.

558

4

E. Raveh et al.

Showcase: Examining Convergence Behaviors

For demonstrating a possible use of the system, we simulated the shadowing experiment detailed in [7] using the system and its analyses to look into types of participant convergence behavior with respect to the features examined in the experiment (see Table 2). This experiment is designed to trigger phonetic convergence by confronting the participants with stimuli in which certain phonetic features are realized in a manner diﬀerent from their own realizations. The simulation was carried out by building a domain ﬁle with the experimental procedure, including the transition between the experiment’s phases, as well as the ﬂow within each phase. This automates the procedure and adapts it to the participant’s pace. Participants were simulated by using their recorded speech from the original experiment in the same order. The use of the system for this purpose results in an automated, reproducible execution, with additional insights like classiﬁcation of feature realizations and dynamic visualizations in the GUI. The classiﬁers were trained oﬄine on the data points acquired from analyzing the stimuli. However, the system also supports incremental, online re-training whenever requested by the user, for example after every time the convergence model is updated. For the demonstration presented here, a sequential minimization optimization (SMO) [22] implementation of the support vector machine (SVM) classiﬁer was used for training. Each turn’s number and prediction are added as an interactive annotation to the dynamic graph of the relevant features, as shown in Fig. 3. Finally, using the system, the experiment is transformed into an automated dialogue scenario, which enhances its HCI nature. Table 2. Examples of stimuli sentences, each containing one target feature. Sentence War Was Ich I Wir We

4.1

das the bin am besuchen will visit

Feature Ger¨ at device s¨ uchtig addicted euch you

sehr very nach to bald soon

teuer? [E:] vs. [e:] in word-medial ¨ a expensive? Schokolade [Iç] vs. [Ik] in word-ﬁnal -ig chocolate wieder [n] vs. [@n] in word-ﬁnal -en " again

Finding Behavioral Patterns

In this section, we focus on the validation for the feature [E:] vs. [e:] as a representative example for the phonetic adaptation capability of the system. Although the classiﬁed realization is binary ([E:] or [e:]), the underlying representation used by the model is gradual. Both of these views on the feature can be seen in the graph area, as shown in Fig. 3. The degree of convergence was examined per utterance in the shadowing phase of the experiment. Three main groups emerged, each with a diﬀerent

Studying Mutual Phonetic Inﬂuence

559

behavior: one group of participants showing little to no tendency to converge (changes in ≤10% of their utterances), the second, with varying degrees of convergence (10% to 90%), and a third group of participants who were very sensitive to the stimuli’s variation (≥90%). We refer to these groups as Low, Mid, and High, respectively. The feature’s classiﬁer was determined on the ﬂy, so that the prediction for each utterance was made based on the type of the stimulus to which the participant was listening. As Table 3 shows, the Low and High groups are both of signiﬁcant size, indicating that these two distinct behaviors exist in the data and can be spotted by the system. In addition, we validated the separation between these behaviors. To this end, we regarded the shadowing phase as an annotation task, where the annotators are the predictors of the user and the system. Note that 100% similarity would mean complete convergence to every stimulus, which cannot be reasonably expected (cf. [7]). The Cohen’s kappa (κ) values1 of the Low group are expected to be the lowest, as a lesser degree of convergence was found among these participants. By the same logic, the High group is expected to have the highest agreement, and the Mid to have values between the two other groups. Indeed, this hypothesis holds: weak agreement was found in the Low group, strong agreement in the High group, and a value close to 0 (indicating no consistent behavior) for the Mid group.

5

Conclusion and Future Work

We have introduced a system with an integrated spoken dialogue system (SDS), which can track and analyze mutual inﬂuence on the phonetic level during the interaction based on an internal convergence model. This combines work done in the ﬁelds of phonetic convergence and adaptive SDSs, and contributes to the understanding of power relations between a human and a computer interlocutors. Many aspects of the system are customizable, which makes it ﬂexible in terms of possible supported scenarios. The system can also run on a separate server, which makes it easier to scale its online use. To showcase its capabilities, we simulated a replication of a shadowing experiment, which examined phonetic convergence regarding certain segment-level phonetic features. Three main user behaviors were found with respect to their tendency to change their pronunciation based on the system’s stimulus input. This sheds light on possible relations and dynamics between a user and a system in HCI. Running the experiment in this way not only saved time by automating the annotation and phonetic analysis, but also oﬀered additional insight such as visualization and on-the-ﬂy classiﬁcation. We believe that this shows that phonetic convergence can be studied using our SDS, and that this is one step forward toward personalized, phonetically aware SDSs, which will enable more natural and eﬃcient interaction. 1

As calculated by the kappa2 command of the irr R package (v0.84), https://cran.rproject.org/package=irr.

560

E. Raveh et al.

Table 3. A summary of the measures for similarity and agreement between the predictor annotations of user and model productions in the shadowing phase. Similarity (%) Agreement (κ) Size (%) Low Mid High All

Speech and Computer

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch