< j i t t e r r a t e=” 10 ” /> < a r t i c u l a t i o n>
As modifications to display happiness, the pitch-contour gets assigned the so-called “wave model” (which means a fluent up-and-down contour between stressed syllables, see [4] for details) and the duration of the voiceless fricatives gets lengthened by 40%. At the same time, the phonation and articulation parameters get altered according to the emotion model defined for sadness, i.e. jitter is added, the vocal effort is set to “soft” and the articulation target values are set to “undershoot”. To generate test samples for evaluation in a systematic confusion, each of Darwins four “basic emotions” (joy, sadness, fear and anger) was combined with all other emotions and used as primary as well as secondary emotional state. As a reference we added neutral versions, but did not combine neutral with the emotional states. This resulted in 17 samples (4 emotions by 4 + neutral). The target phrases were taken from the Berlin emotional database EmoDB [5]. We used two short and two longer ones. All target phrases were synthesized with a male and female Mbrola German voice (de6 and de7). The resulting number of samples was thus 134 (17*4*2).
4
Perception Experiment
In a forced-choice listening experiment, 32 listeners (16 males, 16 females, 20– 39 years old, mean = 27.26, standard deviation = 3.75) assigned all stimuli to one of the four emotions or “neutral”. A second rating was asked for as “alternative” categorization. The “neutral” emotion was introduced as default in case of uncertainty. The evaluation was done with the Speechalyzer Toolkit [7]. For playback of the stimuli in randomized order, AKG K-601 headphones were used. One single session took about 40 min.
Speech Synthesizing Simultaneous Emotion-Related States
81
A validation of the full emotions (256 ratings per category) confirmed the synthesis quality for basic emotions, as all five synthesized categories are labeled on average with 52,4% as intended (see Table 1). Table 1. Confusion matrix for the single basic emotions only. Primary rating in % divided by 100. Highest values bold. Prim. Rat. Anger Fear Joy Emotion
Neutr. Sadn. F1
Anger
.496
.156 .117 .211
.020
.536
Fear
.223
.367 .180 .133
.098
.411
Joy
.066
.180 .383 .320
.051
.435
Neutral
.043
.039 .082 .582
.254
.488
Sadness
.023
.043 .000 .141
.793
.716
The intended complex emotions were categorized with a primary label 3072 times. Excluding all full single emotions, and thus also all primary ratings for “neutral”, resulted in 2244 answers. The complex emotions as intended with set 1 (prosody) are recognized most frequently. However, anger is equally often confused with fear (Table 2). A similar confusion matrix for the second intended emotion (voice quality, articulation) however, shows no identification by the listeners except for anger (Table 3). The alternative ratings are dominantly “neutral”, indicating difficulties to assign two separate emotions to the stimuli (Tables 4 and 5). The remaining data without any “neutral” responses, i.e. actually assigned to the four emotions in question, account only for 38% of the 3072 responses. Still, there are systematic results visible (Table 6): Within the limits of those actually rating a secondary emotion, combinations of anger and fear as well as fear and sadness are dominantly classified irregardless of the assignment of emotions to the features. Joy combined with fear is most often correctly rated for joy synthesized with prosodic information. In sum, fear was the best performing emotion to be combined with others. Interestingly, all confusions had one emotion in common, whereas another was dominantly replaced with fear.
5
Discussion
The pure emotions were all recognized above chance. Results for the complex emotions indicate that the prosodic parameters significantly elicit the intended emotion, whereas the second bundle (voice-quality and articulation precision) reveals mixed results, even for the primary rating. In particular, the secondary rating was dominantly “neutral”. Nevertheless, when analyzing the pairs of nonneutral ratings, the intended complex emotions including fear work especially
82
F. Burkhardt and B. Weiss
well. Even the confusion pattern for the other targets show systematic effects in favor of fear, always retaining one of the intended emotions that is not dependent on the features bundle. Therefore, these results are most likely originated in the quality of the material and evaluation method at the current state of synthesizing complex emotions, and can not be taken to indicate invalidity of the concept of complex emotions. Table 2. Confusion matrix for the emotions synthesized with prosody. Primary rating in % divided by 100. Highest values bold. Prim. Rating Anger Fear Joy Emotion Set 1
Sadness F1
Anger
.375
.337 .239 .049
.3866
Fear
.173
.518 .202 .108
.4977
Joy
.248
.206 .427 .119
.4340
Sadness
.184
.085 .031 .700
.6976
Table 3. Confusion matrix for the emotions synthesized with voice quality and articulation. Primary rating in % divided by 100. Highest values bold. Prim. Rating Anger Fear Joy Sadness F1 Emotion Set 2 Anger
.343
.222 .215 .220
.346
Fear
.336
.176 .238 .250
.163
Joy
.168
.325 .199 .308
.214
Sadness
.130
.483 .247 .140
.141
Whereas the results are promising, the ultimate aim to validly synthesize two emotions simultaneously was not fully reached. Apparently, some emotions dominate the perception (fear), and the salience or quality of synthesis does not seem to be equally distributed over the two feature bundles. Table 4. Confusion matrix for the emotions synthesized with prosody. Secondary rating in % divided by 100. Highest values bold. Sec. Rat. Anger Fear Joy Neutr. Sadn. F1 Emo. Set 1 Anger
.123
.196 .066 .511
.104
.176
Fear
.136
.202 .097 .392
.173
.277
Joy
.090
.194 .100 .498
.117
.177
Sadness
.116
.211 .035 .525
.112
.115
Speech Synthesizing Simultaneous Emotion-Related States
83
Table 5. Confusion matrix for the emotions synthesized with voice quality and articulation. Secondary rating in % divided by 100. Highest values bold. Sec. Rat. Anger Fear Joy Neutr. Sadn. F1 Emo. Set 2 Anger
.151
.201 .082 .435
.131
.206
Fear
.104
.234 .067 .475
.120
.283
Joy
.108
.189 .072 .521
.110
.136
Sadness
.106
.178 .081 .481
.153
.172
Table 6. Confusion matrix for the complex emotions separated for prosodic and nonprosodic feature order. Primary and Secondary ratings pooled (in % divided by 100). Highest values bold, intended categories in italics. Dual Ratings Anger: Anger: Anger: Fear: Fear: Joy: Joy Sadness Joy Sadness Sadness Complex Emotions Fear Anger-Fear
.461
.113
.174
.148
.087
.017
Fear-Anger
.424
.094
.079
.180
.180
.043
Anger-Joy
.418
.154
.088
.143
.164
.033
Joy-Anger
.308
.288
.144
.115
.077
.067
Anger-Sadness
.420
.037
.074
.247
.198
.025
Sadness-Anger
.067
.053
.400
.000
.413
.067
Fear-Joy
.195
.076
.042
.288
.373
.025
Joy-Fear
.181
.108
.072
.349 .205
Fear-Sadness
.227
.034
.034
.227
.445
.034
Sadness-Fear
.070
.020
.320
.020
.480
.090
Joy-Sadness
.108
.054
.068
.243
.324
.203
Sadness-Joy
.057
.014
.200
.000
.629
.100
.084
From a methodological point of view, hiding the true aim while assessing two emotions per stimulus seemed to be difficult. However, asking for only one emotion and analyzing the frequencies of replies would require comparable perceptual salience of each emotion involved. Fortunately, judging from conversations with the participants and the high amount of neutral second ratings, the cover story of asking for a first and an alternative impression worked. As alternative, openly asking for the mixture of emotions risks to induce effects of social desirability, which might still allow for testing the quality of synthesizing stereotypical emotion combinations, but not for testing validity of the complex emotions. Therefore, a more sophisticated evaluation paradigm applying social situations, in which complex emotions do occur, might be more meaningful.
84
6
F. Burkhardt and B. Weiss
Conclusions and Outlook
We described an approach to simulate first and secondary emotional expression in synthesized speech simultaneously. The approach is based on the combination of different parameter sets with the open-source system “Emofilt” which utilizes the diphone-synthesizer “Mbrola”. An evaluation of the technique was done in a perception experiment which showed only partial results. The ultimate aim to validly synthesize two emotions simultaneously was not fully reached, but, as the results are promising, the synthesis quality, especially for voice quality and articulation, needs to be optimized in order to establish comparable strength and naturalness of the emotions over both feature bundles. Especially the simulation of articulation precision, which is done by replacing centralized phonemes with decentralized ones and vice versa [4], could be enhanced when using a different synthesis technique. Data-based synthesis (like diphone synthesis or non-uniform unit-selection synthesis) is not well suited for manipulations of the articulation precision or voice quality. In this respect the simulation rules that were based on prosodic manipulation (set 1) were of course more effective. As unrestricted text-to-speech synthesis is not of importance while this is still predominantly a research topic, one possibility would be to use articulatory synthesis where the parameter sets can be modeled more elaborately by rules. After quality testing such optimizations, an improved evaluation methodology should be applied to study validity of complex emotions synthesized with “Emofilt”. The approach did result in success with emotions that are neighbors with respect to the emotional dimensional space that’s spanned by the PAD dimensions pleasure, arousal and dominance. For example the combination of sadness and anger as well as fear and sadness share two of the three dimensions and were recognized by the majority of the judges. For future work it would be a possibility to try combinations of emotions that can be envisaged by the listeners more easy than systematic variation, for example by embedding the test sentences into situations that are appropriate for the targeted emotion mix. It would also be an interesting research to investigate the acoustic manifestation of mixed emotions by analysis of natural data, for example the Vera am Mittag corpus [10]. As this corpus consists of real-life emotional expression happening in a TV-show, mixed emotions are very likely to occur. A set of clear representations would have to be identified by a new label process and then analysed for acoustic properties. The outcomes could then be synthesized to validate the findings in a more controlled environment.
Speech Synthesizing Simultaneous Emotion-Related States
85
References 1. Barra-Chicote, R., Yamagishi, J., King, S., Monero, J.M., Macias-Guarasa, J.: Analysis of statistical parametric and unit-selection speech synthesis systems applied to emotional speech. Speech Commun. 52(5), 394–404 (2010) 2. Berrios, R., Totterdell, P., Kellett, S.: Eliciting mixed emotions: a meta-analysis comparing models, types, and measures. Front. Psychol. 6, 428 (2015) 3. Burkhardt, F.: Simulation emotionaler Sprechweise mit Sprachsynthesesystemen. Shaker (2000) 4. Burkhardt, F.: Emofilt: the simulation of emotional speech by prosody transformation. In: Proceedings of Interspeech. Lisbon (2005) 5. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Proceedings of Interspeech. Lisbon (2005) 6. Burkhardt, F.: An affective spoken story teller. In: Proceedings of Interspeech. Florence (2011) 7. Burkhardt, F.: Fast labeling and transcription with the speechalyzer toolkit. In: Proceedings of LREC (Language Resources Evaluation Conference), Istanbul (2012) 8. Du, S., Tao, Y., Martinez, A.: Compound facial expressions of emotion. Proc. Natl. Acad. Sci. 111(15), E1454–62 (2014) 9. Dutoit, T., Pagel, V., Pierret, N., Bataille, F., Van der Vreken, O.: The MBROLA project: towards a set of high-quality speech synthesizers free of use for noncommercial purposes. In: Proceedings of ICSLP 1996, Philadelphia, vol. 3, pp. 1393–1396 (1996) 10. Grimm, M., Kroschel, K., Narayanan, S.: The Vera am Mittag German audio-visual emotional speech database. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Hannover (2008) 11. Latorre, J., et al.: Speech factorization for HMM-TTS based on cluster adaptive training. In: Proceedings of Interspeech. Portland (2012) 12. Lee, Y., Rabiee, A., Lee, S.: Emotional end-to-end neural speech synthesizer. CoRR (2017) 13. Martin, J.C., Niewiadomski, R., Devillers, L., Buisine, S., Pelachaud, C.: Multimodal complex emotions: gesture expressivity and blended facial expressions. Int. J. Humanoid Rob. 3, 269–292 (2006) 14. Murray, I.R., Arnott, J.L.: Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. JASA 93(2), 1097–1107 (1993) 15. Schr¨ oder, M.: Emotional speech synthesis - a review. In: Proceedings of Eurospeech 2001, Aalborg, pp. 561–564 (2001) 16. Schr¨ oder, M., Trouvain, J.: The German text-to-speech synthesis system mary: a tool for research, development and teaching. Int. J. Speech Technol. 6, 365–377 (2003) 17. Tachibana, M., Yamagishi, J., Masuko, T., Kobayashi, T.: Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing. IEICE Trans. Inf. Syst. 88(11), 2484–2491 (2005) 18. Williams, P., Aaker, J.: Can mixed emotions peacefully coexist? J. Consum. Res. 28(4), 636–649 (2002)
An Approach to Automatic Summarization of Television Programs Marco Canora, Fernando Garc´ıa-Granada(B) , Emilio Sanchis, and Encarna Segarra Departamento de Sistemas Inform´ aticos y Computaci´ on, Universitat Polit`ecnica de Val`encia, Camino de Vera s/n, 46022 Valencia, Spain [email protected], {fgarcia,esanchis,esegarra}@dsic.upv.es
Abstract. In this paper we present an approach to document summarization based on unsupervised techniques. We study the adequacy of these techniques to the problem of documents in which many topics of different duration are present, in our case the transcriptions of Spanish TV programs. The paper compares a classical Latent Semantic Analysis approach to a new proposal based on Latent Dirichlet Allocation. It is also studied the application of the summarization process to the different segments obtained in a previous process of topic segmentation. The topic segmentation is performed by considering distances between paragraphs, that are represented by means of continuous vectors obtained from the words contained in them. Experiments on some TV programs of political and miscellaneous news have been performed. Keywords: Document summarization Latent Semantic Analysis
1
· Latent Dirichlet Allocation
Introduction
Multimedia content summarization is an important issue in recent years. Due to the great amount of information available in the web it is necessary to have different tools to help the users to digest that contents in an easy way. For this reason, summarization techniques are a current goal in Natural Language research [9,14]. Traditionally, summarization methods are classified in two categories: extractive and abstractive. Extractive approaches consist of detecting the most salient sentences and the summary generated is composed by those sentences, while abstractive approaches try to be more similar to human summaries and they generate new sentences that may not be in the original document. Although, logically, these last approaches are a more ambitious challenge, recent works have shown promising expectations [3,11]. In the framework of extractive approaches most systems are based on unsupervised learning models. This is the case of Latent Semantic Analysis (LSA) [7], or graph-based [4]. Other systems are based on supervised methods such as Recurrent Neural Networks [3], Conditional Random Fields (CRFs) [13], or Support Vector Machine (SVM) [5]. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 86–93, 2018. https://doi.org/10.1007/978-3-319-99579-3_10
An Approach to Automatic Summarization of Television Programs
87
The organization of evaluation competitions has been an important help for the development of this area. This is the case of DUC1 and TAC2 conferences. They have become a forum to compare the different approaches. To do this, some evaluation corpora have been developed that can be used not only for test purposes but also for training models. Some of the most popular corpora in summarization tasks are the corpus used in DUC and the CNN/DailyMail corpus. This last corpus has widely used for learning models in Neural Networks approaches [3]. Other authors have explored the summarization considering audio documents as input [6]. This task has the additional problems of dealing with different kinds of errors, as speech recognition errors and errors in punctuation of sentences. Moreover, some expressions that appear due to spontaneous speech characteristics must be specifically processed since they could be not relevant for the summary. In this work, we present an extractive approach to document summarization based on unsupervised techniques, in particular Latent Dirichlet Allocation (LDA) [2]. This approach can be considered as topic-based because some topics can be automatically detected and used to determine the most salient sentences according to the topics that appear in the document. Another issue of this work is that we have addressed the problem of summarization of TV programs, in particular a magazine of news. Some characteristics of this task generate specific challenges to the summarization problem. Apart from the speech recognition problems, that are not considered in this work, the most interesting problem is that this kind of programs have a very variable structure, and usually many topics of different duration are present in them. We have studied two strategies of summarization: in the first one, the transcription of the program is the input to the summarization system, and in the second one, a preprocess of segmentation of the program is done, and from the concatenation of the summaries of each segment the final summary is obtained. We have performed some experiments on Spanish TV programs in order to study the behavior of the proposed techniques. The paper is organized as follows. In Sect. 2, the different methodologies developed are described. In Sect. 3, a description of the system architecture is presented. In Sect. 4, we show the characteristics of the corpus. In Sect. 5, we present the experimental results, and in Sect. 6, the conclusions and future works are presented.
2
System Description
Given a document, considered as a set of sentences, the objective of an extractive summarization technique consists of assigning weights to the sentences, that represent the relevance of them. From this ranked set of sentences the system selects the first ones in order to build the summary. 1 2
https://duc.nist.gov/. https://tac.nist.gov/.
88
2.1
M. Canora et al.
Latent Semantic Analysis
Many unsupervised summarization systems are based on LSA. This technique permits to extract the representative sentences for the automatically detected topics in the documents. This is done applying the singular value decomposition of word-sentences matrices. That is, given the word-sentence matrix C the Singular Value Decomposition generates the U , Σ, and V T matrices, where V T represents the association of underlying topics to sentences. C = U ΣV T From this decomposition there are different ways of assigning weights to sentences and then selecting those ones to appear in the summary. Some of them are based on the most salient sentence for each topic, others are based on the combination of the results of the matrix decomposition. We have chosen the Cross method that permits to extract more than one sentence associated to the most important topics [12]. 2.2
Latent Dirichlet Allocation
Another way for discovering hidden topics in documents is the LDA approach. This methodology has been successfully used for topic identification, and can also be used for summary purposes. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes a generative process for each document in a corpus, given the a-priori parameters α and β, that characterize the probabilistic distributions. It assumes that for each word in a document, a topic is chosen given the multinomial distribution of topics, and then a word is chosen given multinomial probability of words conditioned by the selected topic. In order to use LDA, it is necessary to compute the posterior distribution of the hidden variables given a document, and to do this, one of the most popular approach is the Gibbs Sampling. Once the process is done for a fixed number of topics, two matrices are obtained: one of them represents the probability that a concept appears in a document, and the other one represents the probability that a word belongs to a topic (word-topic matrix). Once these matrices have been obtained, we used the word-topic matrix to assign a weight to each word in a sentence. From this information we obtain a sentence-topics matrix that is the input to an adaptation of the Cross method used in the LSA approach. 2.3
Document Segmentation
Sometimes, as in our case, the documents to be summarized are long and heterogeneous, that is, they are composed by different sections, each one focused on a different subject. For this reason it could be convenient to split the document in different pieces, that is know as topic segmentation.
An Approach to Automatic Summarization of Television Programs
89
The approach that we have developed consists of obtaining vector representations of two consecutive paragraphs and defining a distance between vectors to decide if they belong to the same or to different topics. Then, an overlapped sliding window of paragraphs across the document provides the distances between two pairs of consecutive paragraphs. That is, we calculate at the end of each sentence the distance between the previous n sentences and the following n sentences. The length of the sliding window is experimentally determined. In order to represent the paragraphs a semantic-based approach was done, in particular a Word2vec representation [10]. To do this, it was necessary to learn the Word2vec values from a large corpus. This was done from Wikipedia articles in Spanish. Once the word representation was obtained, the way to represent the paragraphs was done by the addition of vectors of the words contained in them. The measure used to determine the distance between consecutive paragraphs was the cosine distance.
3
System Architecture
We have explored different approaches to the problem of summarization. Figure 1 shows the architecture of the first system. In it, the documents are the input to the LSA or LDA processes, and the matrix obtained is the input for the Cross method process.
Fig. 1. Architecture of the system.
Figure 2 shows the architecture for the summarization system when a previous phase of topic segmentation is performed. That is, first of all, the documents are segmented, and each segment is summarized. Then, a concatenation of this topic-dependent summaries is performed in order to generate the final summary.
4
Corpus Description
The corpus consists of seven Spanish TV programs of news including some miscellaneous topics, such as music, gastronomy, culture, etc. We used the correct
90
M. Canora et al.
Fig. 2. Architecture of the system with a previous topic segmentation phase.
transcriptions of the speech, in particular the screenplay of the program presenter. It should be noted that the structure of these programs is very heterogeneous. Sometimes a sequence of short news, one or two sentences, of different topics is followed by a long sequence of sentences related to one topic (for example a musical group that presents a new disc, even including interviews with the musicians). Some characteristics of this corpus are shown in Table 1. In order to evaluate the results, a summary of a 20% of the original document was performed for each document. They were manually built by an expert. Table 1. Corpus characteristics. Total number of words
27,881
Average number of words per TV program
3,983
Number of words of the shortest TV program 2,924 Number of words of the longest TV program
5
4,980
Experiments
Two series of experiments were done. The first one consisted in the application of both methodologies, LSA and LDA to the set of documents, and the second one was the application of the same methodologies with the previous topic segmentation process. We have used different ROUGE [8] measures to evaluate the summaries. The ROUGE metrics include: the ROUGE-n that measures the overlap of ngrams between the system and reference summaries, the ROUGE-L based on the Longest Common Subsequence (LCS), the ROUGE-W that is a Weighted LCS-based statistic, the ROUGE-S that is a skip-bigram based co-occurrence statistic, and finally, the ROUGE-SU that is a skip-bigram plus unigram-based
An Approach to Automatic Summarization of Television Programs
91
co-occurrence statistic. The most widely used in the literature are the ROUGE-1, ROUGE-2, and the ROUGE-L. The results of applying LDA and LSA directly to the transcriptions of the programs are shown in Tables 2 and 3 respectively. Results show that both methods have a good behavior and there is not a relevant difference between them. This can be explained by the fact that both approaches are based on the underlying topics of the documents, although each one of them has its particular way to model the semantics of the document. Tables 4 and 5 show the results when a previous segmentation was done. The pk value [1] of the segmentation was 0.59. It should be noted that the systems with a previous segmentation do not outperform the direct application of the proposed methodologies to the whole document. This could be explained by the fact that the topic segmentation approach is based on a decoupled architecture. That kind of decoupled architecture is very sensitive to the errors in the first phase of the process. This way the errors are transmitted to the following phases, the summarization in our case. Table 2. Evaluation using LDA. Recall
Precision F1
ROUGE-1
0.57134 0.59537
0.58298
ROUGE-2
0.28718 0.29915
0.29299
ROUGE-3
0.22941 0.23884
0.23399
ROUGE-4
0.21471 0.22352
0.21899
ROUGE-L
0.53478 0.55706
0.54558
ROUGE-W-1.2 0.13903 0.27932
0.18561
ROUGE-S*
0.29909 0.32546
0.31145
ROUGE-SU*
0.29976 0.32615
0.31213
Table 3. Evaluation using LSA. Recall
Precision F1
ROUGE-1
0.58019 0.60525
0.59232
ROUGE-2
0.27962 0.29257
0.28588
ROUGE-3
0.20853 0.21844
0.21333
ROUGE-4
0.18838 0.19743
0.19275
ROUGE-L
0.52826 0.55124
0.53938
ROUGE-W-1.2 0.13183 0.26544
0.17612
ROUGE-S*
0.30823 0.33603
0.32124
ROUGE-SU*
0.30890 0.33672
0.32192
92
M. Canora et al. Table 4. Evaluation using LDA when a previous topic segmentation is done. Recall
Precision F1
ROUGE-1
0.51899 0.54040
0.52937
ROUGE-2
0.22402 0.23387
0.22879
ROUGE-3
0.16544 0.17291
0.16905
ROUGE-4
0.15117 0.15808
0.15452
ROUGE-L
0.48046 0.50050
0.49017
ROUGE-W-1.2 0.12027 0.24191
0.16062
ROUGE-S*
0.25231 0.27433
0.26264
ROUGE-SU*
0.25297 0.27501
0.26331
Table 5. Evaluation using LSA when a previous topic segmentation is done. Recall
6
Precision F1
ROUGE-1
0.51915 0.54059
0.52954
ROUGE-2
0.22379 0.23363
0.22856
ROUGE-3
0.16549 0.17296
0.16911
ROUGE-4
0.15133 0.15825
0.15468
ROUGE-L
0.48154 0.50137
0.49114
ROUGE-W-1.2 0.11991 0.24106
0.16012
ROUGE-S*
0.25372 0.27567
0.26401
ROUGE-SU*
0.25437 0.27635
0.26468
Conclusions
In this paper we have presented an approach to summarization of Spanish TV programs. It is based on unsupervised methods, and it is specially oriented to documents with heterogeneous structures, that is, documents that contain many topics with very different durations. Two approaches based on underlying topic detection have been explored. The first one consists in the application of the methods directly to the document and the second one has a previous phase of topic segmentation. Results show that both approaches provide good results, and they have a similar behavior. As future work, we will try to improve the segmentation based approach developing some mechanisms to transmit more than one segmentation hypothesis to the summarization phase. This way, the errors generated by the first phase could be recovered during the summarization process. It can be also interesting to develop another way to combine the summaries of the detected segments, instead of a straight forward concatenation of them.
An Approach to Automatic Summarization of Television Programs
93
Acknowledgments. This work has been partially supported by the Spanish MINECO and FEDER founds under project AMIC: Affective Multimedia Analytics with Inclusive and Natural Communication (TIN2017-85854-C4-2-R).
References 1. Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Mach. Learn. 34(1), 177–210 (1999) 2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). http://dl.acm.org/citation.cfm?id=944919.944937 3. Cheng, J., Lapata, M.: Neural summarization by extracting sentences and words. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, 7–12 August 2016, Berlin, Volume 1: Long Papers (2016) 4. Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Int. Res. 22(1), 457–479 (2004) 5. Fuentes, M., Alfonseca, E., Rodr´ıguez, H.: Support vector machines for queryfocused summarization trained and evaluated on pyramid data. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL 2007, pp. 57–60. Association for Computational Linguistics, Stroudsburg (2007). http://dl.acm.org/citation.cfm?id=1557769.1557788 6. Furui, S., Kikuchi, T., Shinnaka, Y., Hori, C.: Speech-to-text and speech-to-speech summarization of spontaneous speech. IEEE Trans. Speech Audio Process. 12(4), 401–408 (2004) 7. Gong, Y., Liu, X.: Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2001, pp. 19–25. ACM, New York (2001). https://doi.org/10.1145/383952.383955 8. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: MarieFrancine Moens, S.S. (ed.) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona (2004) 9. Lloret, E., Palomar, M.: Text summarisation in progress: a literature review. Artif. Intell. Rev. 37(1), 1–41 (2012) 10. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013) 11. Nallapati, R., Zhai, F., Zhou, B.: Summarunner: a recurrent neural network based sequence model for extractive summarization of documents. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, , San Francisco, 4–9 February 2017, pp. 3075–3081 (2017) 12. Ozsoy, M.G., Cicekli, I., Alpaslan, F.N.: Text summarization of turkish texts using latent semantic analysis. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, pp. 869–876. Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm? id=1873781.1873879 13. Shen, D., Sun, J.T., Li, H., Yang, Q., Chen, Z.: Document summarization using conditional random fields. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI 2007, pp. 2862–2867 (2007) 14. Tur, G., De Mori, R.: Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. Wiley, New York (2011)
The Prosody of Discourse Makers alors and et in French: A Corpus-Based Study on Multiple Speaking Styles George Christodoulides(B) Language Sciences and Metrology Unit, Universit´e de Mons, Place du Parc 18, 7000 Mons, Belgium [email protected]
Abstract. In this study, we investigate the prosodic characteristics of two French discourse markers (DMs), alors and et. Our study is based on a 8-h corpus covering 8 different speaking styles, with an average of 10 speakers per communicative situation. The tokens were classified depending on whether they are being used as discourse markers (DMs) or not; additionally in the case of et used as a conjunction, the type of the co-ordinated syntactic elements was identified. An automated prosodic analysis of all occurrences was performed. Results show that the use of et as a DM was more prevalent in non-planned speech; silent pauses preceded occurrences of alors and et, both as DMs and as non-DMs; the difference in silent pause duration, in the DM uses vs in the non-DM uses, was not statistically significant for alors and was statistically significant for et; DMs did not systematically constitute a separate prosodic unit; a strong prosodic boundary differentiates between the use of et as a DM or as a co-ordinating conjunction between verb phrases and subordinate clauses, and its other non-DM uses. Keywords: Prosody
1
· Discourse markers · Corpus linguistics · French
Introduction
Spoken language comprehension entails multiple tasks for the listener, such as segmenting the incoming stream of speech, lexical access, syntactic parsing, integration of information into some form of cognitive representation, and understanding of discourse relations. Prosody plays an important role in all these steps, by guiding the listener’s comprehension (for a review, see [1,5,7]). The relationship between prosody and information structure, whether specific prosodic structures cue specific discourse relations, and whether prosody can facilitate the processing of discourse relations are research questions whose importance is increasingly recognised. Fraser defines discourse markers as “a class of lexical expressions drawn primarily from the syntactic classes of conjunctions, adverbs, and prepositional phrases [that] with certain exceptions, signal a relationship between the c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 94–102, 2018. https://doi.org/10.1007/978-3-319-99579-3_11
The Prosody of Discourse Makers alors and et in French
95
interpretation of the segment they introduce, S2, and the prior segment, S1. They have a core meaning, which is procedural, not conceptual, and their more specific interpretation is ‘negotiated’ by the context, both linguistic and conceptual” [8]. Discourse markers aid in the segmentation of speech (similarly to punctuation marks in written language), and Schriffin defines them as “sequentially dependent elements which bracket units of talk” [15]. In this study, we investigate the prosodic characteristics associated with the use of two discourse markers in French: alors (then) and et (and). Are there specific prosodic features that can distinguish between the use of these words as a discourse marker, and their use as an adverb or a conjunction (respectively)? When used as conjunction, the token et may link (co-ordinate) two segments at different syntactical levels (e.g. two noun phrases, two adjectives). When used as a discourse marker, et may convey several discourse relations; this is also the case for alors [16]. In this study, we will investigate whether there are prosodic characteristics that distinguish between these uses, on the basis of the C-Phonogenre corpus [10], an 8-h corpus covering 8 different speaking styles.
2
Related Work
Studies have attempted to investigate the phonetic and prosodic properties of discourse markers in speech, using both experimental and corpus-based approaches. For example, [11] confirm the importance of intonation in interpreting the Swedish DM men (but/and/so), and in choosing between its sentential interpretation and its interpretation as a DM. They show that when the token men is used as a discourse marker, it has a positive f0 reset, with a mean value of 13.8 ST when preceded by a glottalisation, and of 5.7 ST without glottalisation; whereas in the case of sentential tokens, the mean value of the f0 reset was 2.2 ST. In English, it has been claimed that DMs constitute a separate prosodic unit surrounded by brief pauses, and that this configuration helps distinguish between DMs and other uses of the same token. However, [12] show that DMs only form a separate intonation unit when opening/closing a conversation or when marking transitions from one topic to another. [12] postulate that the intonation of DMs depends on the speaker’s perception of how important a particular marker is, and therefore the relationship between the function of a DM, its prosodic characteristics and its position in the utterance is arbitrary. It has to be noted that studies on the subject are scarce, and therefore it is not yet possible to draw clear conclusions (also given the large number of different discourse markers, and the fact that few language have been studied). The present study should be read in conjunction with [6], which is a speech elicitation experiment on the use of the DMs alors and et in French. In this experiment, twenty adult native speakers of French were asked to prepare and to read aloud 64 sequences consisting of a first segment, the discourse marker alors or et, and a second segment; all first segments were extracted from a speech corpus. The sequences were constructed in order to convey one of six predefined discourse relations. The prosodic characteristics of the resulting recorded utterances were analysed, and results suggest that the silent pause duration before the
96
G. Christodoulides
DM, as well as the absolute duration of the DM itself are used by the speaker to differentiate between the core meaning of the DM and its less predictable meanings; and that DMs did not systematically constitute a separate prosodic unit. Our study will try to re-evaluate these findings by analysing the occurrences of the tokens alors and et in a corpus that better represents natural and contextualised language use.
3
Corpus and Methodology
3.1
The C-PhonoGenre Corpus
The corpus used in this study is C-PhonoGenre [10], which was compiled to study situation-dependent speaking styles in French and the associated prosodic variation. It contains data from 8 speaking styles: instructional speech [DIDA]; spontaneous narration [NARR]; speeches during “Question Time” at the French parliament [PARL]; religious sermons [RELG]; radio press reviews [RPRW]; three kinds of sports commentary [SPOR]: rugby, basketball and football; presidential New Years wishes [WISH] and weather forecasts [MET]. The average sample duration per speaker is 5:30 min. The corpus composition is presented in Table 1. Table 1. Composition of the C-PhonoGenre corpus. Genre
Sub-Genre
Nb Dur (min) Syll
Tokens Audience Media Prepared Interactive
DIDA
Radio
17
18 717
100
26304
1
2
2
TV
0
2
2
2 0
Lecture
2
0
1
0
NARR
Narration
10
44
11396
9 546
1
0
0
2
PARL
Question
10
20
5710
3 613
2
1
2
1
2
1
1
1
6 141
0
1
2
0
2
1
2
0
17 531
0
2
2
0
5 305
0
2
0
0
1
2
0
2
1 947
0
2
2
0
0
1
2
0
Answer RELG
Mass on the Internet
7
54
8726
15
95
26359
5
35
7601
Sermon on TV RPRW Radio press review SPOR
Basket Rugby/football
MET
Weather forecast
10
WISH
Pres. New Year
15
98
18614
12 578
89
455
107571
75 378
Total
9
2861
The corpus samples were selected using the methodology detailed in [10]. The corpus contains recording of both female and male speakers, originating from 3 different French-speaking areas: Metropolitan France, Belgium and Switzerland. Speaking situations were described by features on four dimensions: audience, media, preparation and interactivity; each dimension had 3 different states: 0 indicates absence of a feature (e.g. Preparation = 0 for spontaneous speech) and 2 the full presence of a feature (e.g. Media = 2 for broadcasts), while the value of 1 indicates intermediate situations. For example, Media = 1 indicates
The Prosody of Discourse Makers alors and et in French
97
speech directed to an individual or a small group, yet in front of a microphone or camera (indirect audience), and Preparation = 1 indicates semi-prepared speech, situated between spontaneous and read speech. In the case of parliamentary debates, a question is prepared, while the answer is semi-prepared. Interactivity indicates whether the main speaker may be interrupted. The values for each dimension and each speaking style in the C-PhonoGenre corpus are also indicated in Table 1. 3.2
Annotation Methodology and Feature Extraction
The C-PhonoGenre corpus has been manually transcribed orthographically and a phonetic transcription and segmentation was obtained using EasyAlign [9]. The alignment was manually corrected. A single annotator added speech delivery information: (i) disfluencies, articulation and phonological phenomena (schwa, vowel lengthening whether associated to hesitation or not, creaky voice, liaison and elision) (ii) symbols to distinguish between complete silence, audible and less audible breaths, and mouth noises; (iii) indices of paralinguistic phenomena (laugh, cough) and external sounds; (iv) overlapping segments and syntactic interruptions. The C-PhonoGenre corpus has been processed using the annotation pipeline for French in Praaline [2]. The DisMo annotator [3] was applied to the entire corpus, providing part-of-speech and disfluency annotations. Pitch stylisation was performed using Prosogram [13]. An automatic annotation of prosodic prominence and prosodic boundaries was performed using Promise [4]. Features extracted using these plug-ins are stored in an SQL database, and include durations (of pauses, segments, syllables etc.), pitch information (e.g. intonation contour descriptors), and symbolic annotations (e.g. prominences and boundaries). The database from Praaline was linked to the R statistical software [14] for analysis. Finally, all occurrences of the tokens alors and et were identified using Praaline’s concordancer, and they were manually annotated depending on whether the token is being used as a discourse marker (cf. the definition given in the Introduction). Additionally, in the case of et used as a conjunction, we have annotated the type of the co-ordinated syntactic elements as follows (Table 2): Table 2. Annotation scheme for et when used as a conjunction and not a DM. Code
Co-ordinated elements
Example
np np
Noun phrase
ses id´ees et ses valeurs
pp pp
Prepositional phrase
dans l’ hˆ opital et dans la m´edecine
adj adj
Adjective/Complement fort et coh´erent
vp vp
Verb phrase
consommons et rejetons
sub sub Subordinate clauses
qui se diront et qui se souviendront
num
Number
vingt et un
other
Other cases
98
4 4.1
G. Christodoulides
Results and Discussion Discourse Markers and Speaking Style
In the following we will present the main results of the analysis of the corpus. There were 1944 occurrences of et and 177 occurrences of alors in all samples. In the case of alors, it was used as a discourse marker in 138 (77.9%) of the cases; in the conjunction alors que (while) in 35 of the cases and as an adverb in 4 cases. The distribution of the different uses of et, normalised by the number of tokens, by speaking style is given in Fig. 1. Genre Total tokens Conjunction np_np pp_pp vp_vp adj_adj sub_sub locution num other Discourse Marker Total
PARL DIDA RELG MET NARR RPRW SPOR WISH Total 3613 18717 6141 1947 9546 17531 5305 12578 75378 1.63% 1.06% 1.87% 1.64% 0.68% 1.19% 0.55% 2.50% 1.35% 0.69% 0.41% 0.47% 0.72% 0.08% 0.55% 0.25% 0.87% 0.49% 0.42% 0.24% 0.47% 0.41% 0.10% 0.25% 0.09% 0.79% 0.34% 0.14% 0.10% 0.67% 0.21% 0.14% 0.15% 0.04% 0.25% 0.19% 0.11% 0.07% 0.11% 0.21% 0.04% 0.11% 0.00% 0.29% 0.12% 0.08% 0.12% 0.10% 0.00% 0.13% 0.07% 0.09% 0.20% 0.11% 0.03% 0.06% 0.03% 0.05% 0.15% 0.03% 0.00% 0.04% 0.05% 0.03% 0.05% 0.02% 0.05% 0.04% 0.02% 0.08% 0.04% 0.04% 0.14% 0.01% 0.00% 0.00% 0.00% 0.03% 0.00% 0.03% 0.02% 1.00% 1.46% 0.70% 0.67% 2.46% 0.79% 2.21% 0.53% 1.22% 2.63% 2.52% 2.57% 2.31% 3.14% 1.98% 2.75% 3.03% 2.58%
Fig. 1. Distribution of different uses of et, by speaking style (normalised by the number of tokens).
We observe that in communicative situations where we have spontaneous, non-planned speech (e.g. NARR, SPOR) the majority of the occurrences of et were discourse markers, while in the more planned speaking styles (e.g. WISH, PARL, RELG), et is used primarily as a conjunction. 4.2
Temporal and Intonational Properties
We then examined the prosodic characteristics of the different uses of alors and et in our corpus. Figure 2 shows the distribution of the length of silent pauses before DM and non-DM uses of the two tokens. We observe that DM are often preceded by silent pauses; we observe that this is also the case for occurrences of et used as a conjunction between verb phrases and subordinate clauses. Furthermore, both DM and non-DM uses of the two tokens were almost never followed by a silent pause. Articulation rate did not significantly vary depending on the DM or non-DM use of the two tokens. A pitch reset is a prosodic signal for segmentation between the end of a discourse segment and a discourse marker introducing the next discourse segment. Figure 3a shows the pitch movement between the last syllable of the segment between the token alors or et, by its use (as a discourse marker or not). We observe that DM uses of alors tend to have a flat contour, but there is no other
The Prosody of Discourse Makers alors and et in French alors
et 1.00
Silent Pause Before (s)
1.00
Silent Pause Before (s)
99
0.75
0.50
0.25
● ● ●
●
0.75
●
● ●
● ● ●
●
●
● ● ● ● ● ● ● ●
● ● ●
0.50
● ● ●
● ● ● ● ● ● ●
0.25 ●
Sub−category
r
M D
C
O N
ot he
b
nu m
su
b_
O N
C
j
p O N
su
vp
_v
ad O N
C
C
C
O N
ad
j_
_p pp
O N C
O N
N
C
on
D
−D
M
M
np
_n
p
0.00
p
0.00
CON np_np
CON adj_adj
CON sub_sub
CON other
CON pp_pp
CON vp_vp
CON num
DM
Non−DM
Fig. 2. Pause duration before the token, for DM and non-DM uses of alors (left) and et (right).
significant use of prosodic cues to differentiate between DM and non-DM uses of alors and et. With respect to the duration of the two tokens, we do not observe a significant difference between DM and non-DM uses, as can be seen on Fig. 3b. 4.3
Prosodic Prominence and Boundaries
We have also examined the percentage of prosodically prominent syllables, and syllables carrying a prosodic boundary, immediately preceding the tokens alors and et. The results for prosodic prominence are shown in Fig. 4, and for prosodic boundaries in Fig. 5. We can observe that uses of alors as a DM are preceded by a strong prosodic boundary in 54% of the occurrences, compared to 38% of alors
et
4
alors
4
0.5
et 0.5
●
3
2
1
0
0.4
3
2
1
DM T2
Non−DM T1
Non−DM T2
Category
●
●
● ● ● ●
● ●
DM T2
Non−DM T1
Non−DM T2
DM
(a) Pitch movement between the end of S1 and the DM alors (left) or et (right).
0.3
● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.2
0.1
0.0 DM T1
Non−DM
0.2
0.4
●
0.1
0 DM T1
0.3
● ● ● ● ● ●
Syllable duration (s)
●
Syllable duration (s)
Inter−syllabic movement (ST)
Inter−syllabic movement (ST)
● ●
0.0 DM T1
DM T2
Non−DM T1
Non−DM T2
Category
DM T1
Non−DM
DM T2
Non−DM T1
Non−DM T2
DM
(b) Duration of the DM, for disyllabic alors (left) and monosyllabic et (right).
Fig. 3. Duration and Pitch reset for DM and non-DM uses of the tokens. T1 and T2 are the first and second syllables of the target DM respectively.
100
G. Christodoulides alors
et 1.00
% prominent syllables
% prominent syllables
1.00
0.75
0.50
0.25
0.75
0.50
0.25
CON adj_adj
CON sub_sub
CON other
CON pp_pp
CON vp_vp
CON num
DM
DM
CON num
CON other
CON vp_vp
CON np_np
CON sub_sub
CON adj_adj
CON np_np
Non−DM
DM
Sub−category
CON pp_pp
0.00 0.00
Non−DM
Fig. 4. Prominent syllables (percentage) at the last syllable before the token alors (left) or et (right).
the occurrences (there is no significant difference for prominence though). We also observe that uses of et as a discourse marker are also preceded by a strong prosodic boundary in 48% of the cases. This finding would not be enough to distinguish between DM and non-DM uses of et, as a strong prosodic boundary is present in 46% of its uses as a conjunction between verb phrases and 40% of its uses as a conjunction between two subordinate clauses. alors
et 1.00
% boundary syllables
% boundary syllables
1.00
0.75
0.50
0.25
0.75
0.50
0.25
CON np_np
CON adj_adj
CON sub_sub
CON other
CON pp_pp
CON vp_vp
CON num
DM
Non−DM
Boundary
DM
CON num
CON other
CON sub_sub
CON vp_vp
CON adj_adj
Non−DM
DM
Sub−category
CON pp_pp
CON np_np
0.00 0.00
B2
Fig. 5. Prosodic boundaries (percentage) at the last syllable before the token alors (left) or et (right). B3 = major prosodic boundary and B2 = medium prosodic boundary.
5
Conclusion and Perspectives
In this study, we investigated the prosodic characteristics of alors and et, two words that are often used as discourse markers in French. We conducted a corpusbased study, based on an 8-h corpus covering 8 different speaking styles, and the results can be summarised as follows: – The use of et as a discourse marker was more prevalent in non-planned speech.
The Prosody of Discourse Makers alors and et in French
101
– Silent pauses preceded occurrences of alors and et, both as DMs and as nonDMs. The Mann-Whitney U non-parametric test shows that the difference between the preceding pause length in the DM uses vs in the non-DM uses was not statistically significant for alors and was statistically significant for et. In this respect, our corpus study only partly confirms the results of the speech elicitation experiment in [6]. – DMs did not systematically constitute a separate prosodic unit, and both DM and non-DM uses of the two tokens were almost never followed by a silent pause. However, in the case of et, a strong prosodic boundary differentiates its use as a discourse marker or as a co-ordinating conjunction between verb phrases and subordinate clauses, and its other non-DM uses. – There were no statistically significant differences in the articulation rate and in token duration, between the DM and non-DM use of alors and et. We plan to expand this study in two directions. First, an annotation of discourse relations expressed by the 138 uses of alors and the 922 uses of et as a discourse marker, in order to further investigate whether specific prosodic cues are linked to specific prosodic relations. Secondly, we plan to replicate this corpus study on a corpus with longer recordings, so that we can test the effects of individual variation (by examining more occurrences of each token produced by the same speaker). An application of the results of the present study is also envisaged. While prosodic cues seem not to be sufficient to distinguish between DM and non-DM uses of et, we would like to test whether the prosodic information identified as pertinent by the present study (i.e. preceding silent pause length and preceding prosodic boundary) can be used to improve the accuracy of statistical parsing of transcriptions. The prosody associated with the expression of discourse relations, or with the use of certain discourse markers, is highly variable. If such an association does indeed exist, for some specific discourse markers, or in some specific cases of discourse relations (e.g. for the purposes of disambiguation), studies on very large corpora will be needed before we are able to extract meaningful patterns from the data. This is because the prosody of an utterance is influenced by multiple factors, including several factors that are totally unrelated to discourse structure, and because the observed individual variation in the prosodic realisation of discourse relations is fairly high. While experimental studies may indicate relevant acoustic correlates, they are not enough and should be reviewed in light of corpus data, to avoid conclusions based on spurious correlations. More studies, on larger corpora and controlling for individual variation, are needed.
References 1. Carlson, K.: How prosody influences sentence comprehension. Lang. Linguist. Compass 3(5), 1188–1200 (2009) 2. Christodoulides, G.: Praaline: integrating tools for speech corpus research. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), May 26–31, Reykjavik, Iceland, pp. 31–34 (2014). http://www. praaline.org
102
G. Christodoulides
3. Christodoulides, G., Avanzi, M., Goldman, J.P.: DisMo: a morphosyntactic, disfluency and multi-word unit annotator. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), May 26–31, Reykjavik, Iceland, pp. 3902–3907 (2014) 4. Christodoulides, G., Avanzi, M., Simon, A.C.: Automatic labelling of prosodic prominence, phrasing and disfluencies in French speech by simulating the perception of na¨ıve and expert listeners. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association, Interspeech 2017, 20–24 August 2017, Stockholm, pp. 3936–3940 (2017) 5. Cutler, A., Dahan, D., van Donselaar, W.: Prosody in the comprehension of spoken language: a literature review. Lang. Speech 40(2), 141–201 (1997) 6. Didirkov´ a, I., Christodoulides, G., Simon, A.C.: The prosody of discourse markers alors and et in French. A speech production study. In: Proceedings of Speech Prosody 2018, Poznan (2018) 7. F´ery, C.: Intonation and Prosodic Structure. Key Topics in Phonology. Cambridge University Press, Cambridge (2017) 8. Fraser, B.: What are discourse markers? J. Pragmat. 31, 931–952 (1999) 9. Goldman, J.P.: EasyAlign: an automatic phonetic alignment tool under Praat. In: Proceedings of the 12th Annual Conference of the International Speech Communication Association, Interspeech 2011, 27–31 August 2011, Florence, pp. 3233–3236 (2011) 10. Goldman, J.P., Prsir, T., Christodoulides, G., Auchlin, A.: Speaking style prosodic variation: an 8-hour 9-style corpus study. In: Campbell, N., Gibbons, Hirst, D. (eds.) Proceedings of Speech Prosody 2014, pp. 105–109 (2014) 11. Horne, M., Hansson, P., Bruce, G., Frid, J., Filipsson, M.: Discourse markers and the segmentation of spontaneous speech: the case of Swedish men ‘but/and/so’. Working Papers, Lund University, Department of Linguistics, vol. 47, pp. 123–139 (1999) 12. Komar, S.: The interface between intonation and function of discourse markers in English. Engl. Lang. Overseas Perspect. Enq. (ELOPE) 4(1–2), 43 (2007). https:// doi.org/10.4312/elope.4.1-2.43-55 13. Mertens, P.: The Prosogram: semi-automatic transcription of prosody based on a tonal perception model. In: Proceedings of Speech Prosody 2004, 23–26 March 2004, Nara, pp. 549–552 (2004) 14. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2017). https://www.R-project.org/ 15. Schiffrin, D.: Discourse Markers. Cambridge University Press, Cambridge (1987) 16. Uygur-Distexhe, D.: Right peripheral discourse markers in SMS: the case of alors, donc and quoi. Papers from the Lancaster University Postgraduate Conference in Linguistics and Language Teaching (2010)
Choosing a Dialogue System’s Modality in Order to Minimize User’s Workload ˇ ıdl, and Jakub Nedvˇed Adam Ch´ ylek(B) , Luboˇs Sm´ NTIS - New Technologies for Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic {chylek,smidl,nedvedj}@ntis.zcu.cz
Abstract. The communication during human-machine interaction often happens only as a secondary task that distract the user’s main focus on a primary task. In our study, the primary task was driving a vehicle and the secondary task was an interaction with a dialogue system on a tablet device using touch and speech. In this paper we present the design and the analysis of a study that can be used to create an optimal strategy for a dialogue manager that takes into consideration several metrics. These include the type of the information we require from the user, the expected cognitive load on the user, the expected duration of a user’s response and the expected error rate.
Keywords: Dialogue system
1
· Choice of modality · Lane change test
Introduction
Multimodal dialogue systems start to play a role in cooperative robotics in industry and in interactive systems in our day-to-day lives. They also present a distraction from some tasks, such as checking your surrounding when walking, operating industrial machines or driving a car. We will focus our attention on secondary tasks that require touch or speech as their input modality. Our motivational use case is filling an electronic journey log while driving (e.g. logging the arrival at a destination or the offloading of a cargo). The electronic logging happens via a device with a touchscreen or using an automatic speech recognition system. The driver’s main focus here should, of course, be on the driving, but we also want to make sure that the log is also filled in a timely manner. This leaves us with a problem of correctly choosing the types of an input that we want from the user and the correct modality that won’t distract the user too much and that also won’t cause too many problems with the actual dialogue (like error corrections, recognition timeouts, etc.). Since driving is the primary task in our use case, we have used the ISO 26022 standard [4] for the assessment of the impact of secondary tasks on a driver of a motor vehicle. This standard provides a lane change task in a simulated c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 103–112, 2018. https://doi.org/10.1007/978-3-319-99579-3_12
104
A. Ch´ ylek et al.
environment, so we can safely test several workload-heavy tasks and later analyse the recorded data. The results of the analysis will allow us to create a situation-aware dialogue system that uses the right modality for the given situation.
2
Related Work
Lane change test (also often called lane change task) is commonly used to assess the effect of visual-manual interaction on the primary task of driving [1,6,8]. Similarly to us, the authors of [6] have evaluated several different styles of visual presentations on handheld devices, but a speech interface was not tested. In [10] a spoken interaction was compared to a visual interaction using questionnaires. The spoken interaction was preferred by the subjects and their perceived cognitive load was lower in that case. We can further improve these findings by analysing in which situations would the visual interaction be beneficial and back these findings by changes in a performance on simulated tasks. Other researchers have focused on incremental dialogue processing [5] that allows the dialogue system to continually monitor the state of the environment and adjust the interactions with the user accordingly. The point of this study is not to decide whether the primary task is influenced by either of the modalities, as it’s been already shown in e.g. [2,3] that both of the modalities do in fact have an impact on the driving. Our goal is to have the basis for a strategy that could minimize the impact on a cognitive load individually for different types of input information and different requirements from the dialogue manager (a duration of the response, an expected error rate). Related to our concept of a dialogue is also a multimodal system that requires fusion of both modalities (as opposed to us using only a single modality at a time). The analysis of modality choices with increasing load was done in [7]. The authors concluded that with increasing difficulty of the tasks users started to prefer more the multimodal interactions over the unimodal.
3
Experiment Design
Our experiment was designed to resemble the motivational example - a simulation of a car and a simulation of a dialogue system on a touchscreen device. 3.1
Hardware and Software Setup
The hardware part of the setup consisted of a PC, 26 LCD display with speakers, a gaming steering wheel with pedals and an Android tablet for the dialogue system (Fig. 1b). On the tablet, there was an offline automatic speech recognition (ASR, based on [9]) system that processed the spoken input on the device itself and an offline text-to-speech (TTS, [11]) system. The tablet presented to the user a graphical user interface (GUI, Fig. 2) for the touch interactions.
Choosing a Dialogue System’s Modality
(a) Software
105
(b) Hardware
Fig. 1. Setup of the experiment.
The PC was running a simulation program called LCTSim1 (Fig. 1a) that had been set up according to the ISO standard (the lane change test). The position of the vehicle on the track as well as steering wheel angle and speed were recorded from the simulator at approximately 200 records per second. The events from a subsystem that handled the secondary task were recorded separately and they were later merged with the simulation’s log. The following types of events were used: a task was displayed, a user’s answered, the answer was correct or incorrect, the task timed out. 3.2
Primary Task
We have chosen a lane change test that conforms to the standard ISO 26022. This test consists of a 3 km three-lane straight road with equally spaced road signs. These road signs appear every 150 meters and indicate to which lane the participant should change. At most 18 lane changes were possible and the subject was expected to finish the scenario and the primary task before the track’s end. The simulator limited the speed to 60 km/h and the participants were instructed not to slow down. 3.3
Secondary Task
The secondary task was designed to represent an interaction with a dialogue system. It consisted of inputting several pieces of information one at a time using the available modalities. We have prepared following templates for the GUI to test common types of input elements (Fig. 2): a short list that fits on a screen, a long list with a search field, a text input, a date input as a spinner, a time input as a spinner, a grid of images and a dialogue window with buttons. Several tasks were created based on these templates. These tasks allowed filling the information using either this GUI or an ASR and will be listed in Sect. 3.4. For each task, the user would see the objective on the screen as well as hear the same text synthesized using TTS. In order to mimic real-world conditions, 1
Downloadable from https://isotc.iso.org/livelink/livelink?func=ll&objId=11560806.
106
A. Ch´ ylek et al.
the users did not have any microphone nor headphones on them. The ASR used the built-in microphone from the tablet and the TTS used the tablet’s speaker. Another speaker was connected to the PC and the simulation program emulated the sound of an engine. The entirety of the test (instructions, tasks, the TTS and the ASR) was in Czech. 3.4
Scenarios
The experiment was divided into 5 sessions. The participants first had to perform a training session. They drove on a track without any secondary tasks to get comfortable with the controls of the vehicle, with the appearance of the signs and with the primary task of changing lanes when instructed by the sign. They were instructed that changing the lanes as quickly and accurately as possible has the highest priority during the rest of the sessions. The subjects could start the next session at their own discretion. The order of the lane changes during the second session was different from the previous one. This session was also without any secondary task. This way we could obtain a reference drive (we have recorded the participant’s reactions to the signs without any workload from a secondary task). The rest of the sessions (3 to 5) continued on the same track (with the same order of lane changes) but now with secondary tasks that had following restrictions: During the third session, the participants were forced to use only the GUI to fulfil the objective. After the last task was completed the same track was loaded again from the start and a set of tasks for the fourth session started. This time the participant had to complete the tasks using only speech. The ASR had a constrained language model in order to recognize only the options that were presented (e.g. colours for 1st task, numbers for 2nd , etc.). After completing this set of tasks the same track was loaded for the last time. The choice of the modality for the last set of tasks was up to the users. To complete the task they could use the GUI or the ASR without any restrictions. This also meant that when the ASR failed to recognize their commands they could use the GUI to complete the task and vice versa. The participant had 20 s to perform the given task. Otherwise, the next task was shown. If an incorrect input was made the subject was notified and could try again until the task succeeded or timed out. The tasks were shown always in the same order but with different values to be filled each time during the test (to mitigate habituation). Throughout the paper, we will refer to them using their order of appearance. The tasks’ objectives were as follows: 1. 2. 3. 4. 5.
choose a colour from a grid input a number into a text field choose from two buttons input a date using a text field choose a picture from thumbnails in a grid
Choosing a Dialogue System’s Modality
6. 7. 8. 9. 10. 11. 12.
107
input a time using a text field choose from a short list of items choose from three buttons choose from a long list of items with an active search field input a date using system’s date input method (a spinner) choose from a short list of items input a time using system’s native time input method
These tasks were designed not only to test all the basic input types on a smart device but also to test whether the amount of the information that is shown or that needs to be typed has any effect on the results. This is why a text field, an image grid, a list of items and buttons are included multiple times. Concretely, the 1st task was designed as an easier image selection version of task 5. The text input in the 2nd task is an easier version of the tasks 4 and 6. The task 3 is simpler than the 8th task. The list of items in the 7th and the 11th task contained fewer items than the task 9. Also, the native date and time input methods (10, 11) were supposed to be easier than typing into a text field (4, 6).
Fig. 2. Example of different input types used for secondary task.
3.5
Participants
There were 20 participants between 21 and 62 years of age (mean age 32.7, standard deviation of 9.7 years). All participants were native Czech speakers familiar with driving a car and using a touchscreen device.
4
Results
For the purpose of our analysis, we chose as a reference a drive through the track without secondary tasks. It is also possible to create a theoretical “ideal” drive based on the position of the signs and a fixed distance needed for a lane change. The results using these references differed only slightly and after manual
108
A. Ch´ ylek et al.
comparison of the results of several sessions, we have concluded that the ideal reference corresponded with the reality less than the chosen reference drive. In the following paragraphs, we will have to distinguish two types of positions on a track. We define the position between the lanes as an offset from the centre of the middle lane (shortly “offset”) and position on a track length-wise as a “distance”. Several metrics will be evaluated to measure the impact of the secondary tasks on the performance during the primary task. These metrics can later be used by a dialogue manager to create a strategy based on the expected impact. A mean of differences between the offset of a reference drive and the drive with a secondary task (referred to only as a “mean difference”) will be one of the metrics we assess. The duration of the task (until successfully finished or until it timed out) was chosen as another metric and finally, the error rate of the answers is the last metric. We have included the overall results regarding mean duration and mean difference for each modality in Table 1. We can see that if a simple strategy is needed we can leave the choice of the modality to the user, as it offers the best overall performance. But this would require the dialogue that uses this strategy to have similar composition as our scenarios. Because that would not often be the case, we will take a closer look at each individual type of a task in the following sections. Table 1. The overall statistics for each type of scenario. Mean difference from a reference pass of each participant and mean duration of a scenario (from the start until all the tasks of the scenario have been finished). Modality Mean difference [m] Mean duration [s]
4.1
Touch Voice 1.05 132.3
0.76
User’s choice 0.73
127.98 123.3
Comparing Mean Offset Differences
Although the overall results can be interesting on their own, we wanted to analyse each kind of an input separately. We compared a mean difference for each given task across all the subjects. These results can be seen in the Table 2. The task numbers refer to the order in which the tasks were shown to the user, as defined in Sect. 3.4. A smaller difference is better. From these results, we can see that using only the touch for the interaction resulted in a bad performance for tasks 3 to 12 (tasks 5 to 10 are significantly the worst with p < 0.05). This metric clearly does not favour using touch, with one exception - the 1st task. On one hand, using touch for the first task of selecting a colour was significantly better (p = 0.1) than using speech. On the other hand, choosing a more complex image from a grid (task 5) proved to be
Choosing a Dialogue System’s Modality
109
Table 2. Mean offset from the reference drive (in meters) for each task based on modality. A standard deviation is in the brackets, best performing modality is in bold and ∗ marks significant difference from the next best performing modality (p < 0.05). Task
1
2
3
4
5
6
7
8
Voice
0.51
0.92
0.76
0.73
0.98
0.94
0.82
0.61∗ 0.51
9
10
11
12
0.75
0.89
0.71
(0.36) (0.83) (0.68) (0.50) (0.97) (0.95) (0.50) (0.52) (0.34) (0.51) (1.06) (0.63) Touch
0.37
0.82
0.85
0.92
1.30
1.29
1.53
1.27
1.10
1.07
1.11
0.94
(0.30) (0.57) (0.65) (0.45) (1.17) (1.00) (1.15) (1.54) (0.82) (0.90) (1.33) (0.42) User’s choice 0.40
0.71
0.73
0.90
0.77∗ 0.67∗ 0.89
0.78
0.61
0.59
0.95
0.82
(0.25) (0.39) (0.67) (0.65) (0.65) (0.69) (0.99) (0.31) (0.32) (0.29) (0.95) (0.49)
more demanding. For the possible human-machine interaction we could argue that the use case would be more often similar to the more complex fifth task than to the 1st one. From this, we can say that forcing the user to use a touch interface does not look like a viable strategy for any of the input types. Leaving the choice of the modality up to the user proved to be marginally beneficial in 3 tasks and significantly better in 2 tasks. It also never was the worst performing setup. The types of input had a common theme - short or simple methods of input. One might think that the user would willingly choose a modality that causes fewer problems during the primary task. But we can argue that some of the users must have chosen a modality that was not optimal - otherwise, the results for the user’s choice of a modality would be similar to one of the forced modalities. This was clearly not the case since the spoken input was marginally better in 4 cases and even significantly better than the rest in 1 case (most of these were input types that would require a lot of typing or visual searching). We can conclude that the users can choose a modality that does not always result in the least amount of cognitive load. Comparing performance based on the amount of information presented (as discussed in Sect. 3.4) for the inputs of the same type we can see that presenting fewer information results in better performance. The same goes for typing, as inputs that required more typing increased the mean difference. 4.2
Comparing Task Duration
We will now focus on another important aspect of an input in the secondary task - its duration. The results for each run are in Table 3 (shorter duration is better). Here we can see an interesting difference from the previous metric: using the touch interface is significantly faster in 5 cases, marginally in 1. These types of input where touch was faster had in common that they did not require many touch events (like typing or tapping a spinner). If the choice of a modality is left up to the user, it is with the exception of task 11 better than the worst performing modality. Using speech is significantly faster only in 1 task (filling a date into a text field), marginally in 2 tasks.
110
A. Ch´ ylek et al.
The worse performance can be partly due to the lag of an ASR system that has to process the input and partly because the participants occasionally had to repeat the input several times because of the errors the ASR makes, as we will show later. We can again compare the tasks with elements of the same type that contain less information versus the ones with more information (e.g. the short list in task 7 versus long list in task 9). The tasks with less information are completed faster if using touch. Using speech these differences are less pronounced. Table 3. Mean duration (in seconds) of each task based on modality. A standard deviation is in the brackets, best performing modality is in bold and ∗ marks significant difference from the next best performing modality (p < 0.05). Task
1
2
3
4
5
6
7 ∗
8
9
10
11
12
Voice
7.2 9.8 8.3 12.2 8.9 10.0 (3.4) (3.3) (2.7) (3.2) (3.0) (2.9)
7.6 8.45 8.6 13.3 7.8 13.0 (0.8) (0.4) (0.8) (7.4) (0.3) (5.4)
Touch
3.0 8.4 4.0∗ 16.0 7.2∗ 13.3 (0.7) (4.1) (1.1) (3.9) (4.8) (8.6)
6.1∗ 4.7∗ 10.6 13.8 5.1∗ 18.5 (1.9) (1.8) (3.6) (7.4) (1.6) (5.9)
User’s choice 3.3 6.4 7.3 11.5 8.3 11.0 (1.1) (1.3) (1.0) (2.0) (1.2) (2.6)
7.2 7.8 9.2 12.7 8.6 14.9 (0.9) (1.0) (2.5) (4.6) (2.8) (6.4)
Table 4. Which modalities did the subject choose and what error rate the modality caused. Task
1
2
3
4
5
Voice input [%]
25
95
70
95
Touch input [%]
75
5
30
0
0
0
5
Input timed out [%] 0 Voice error rate [%]
7
8
9
10
11
12
100 100 65 55
95
85
75
90
0
0
35 45
0
15
20
5
0
0
0
5
0
5
5
0
64.3 13.6 12.5 17.4 4.8 20.0 7.1 21.4 26.9 43.3 28.6 55.0
Touch error rate [%] 16.7 0
4.3
6
0
0
0
0
0
0
0
0
0
0
Comparing Modality Choices and Error Rates
During the last session, the user had a free will at choosing a modality. In this section, we will analyse which modality the subject preferred for which task. The detailed results are in Table 4. We can clearly see that using speech as the input method was preferred in most of the tasks, with the first task being the only exception. Interestingly this theoretically very simple task of choosing a colour resulted in the highest error rate in both modalities. In contrast to this, the similar 5th task (choosing an image) had the lowest error rate. Reasons for this phenomenon could not be found. From the data, it is clear that touch input, although less prone to errors,
Choosing a Dialogue System’s Modality
111
is not preferred by the users and they are willing to try and repeat the input several times using speech. This knowledge is important for a dialogue strategy where we expect the recovery from recognition errors to be difficult. Forcing the use of a touch interface instead of speech in these situations will result in lower error rates. Table 5. Error rates when the user was forced to use one of the modalities. Task
1
2
3
4
5
6
7
8
9
10
11
12
Voice error rate [%] 25.93 40.00 20.83 25.00 16.67 9.09 4.76 0 4.76 40.00 4.76 35.48 Touch error rate [%] 0
9.52 4.76 14.29 0
16.67 9.09 0
0
17.39 0
10.53
Voice timed out [%] 0
10
5
10
0
0
0
0
0
5
0
0
Touch timed out [%] 0
5
0
70
0
25
0
0
10
5
0
15
Our last analysed metric was an error rate of the forced modalities. The detailed results are included in Table 5. The results of the voice input were expected to contain errors in the conditions of the test. The 4th task (typing a date) involved a lot of interaction with a virtual keyboard and most of the users were unable to finish the task in time. From the perspective of a dialogue strategy, this data can provide a valuable insight into an expected error rate of a touch interface. Whenever the user is forced to use a keyboard we should expect increased error rates or longer response times. Choosing from a grid of images or buttons should be preferred.
5
Conclusion
The acquired data and the presented analysis allow us to create a strategy for a dialogue manager that either forces the user to use a certain modality or gives the user a free choice of the modality. Such strategy can be based on several factors that can be used to infer the expected impact on the primary task. For our purposes, this impact was measured as a mean offset from a reference drive without any secondary task, an error rate on the secondary task and as a time needed to accomplish a task. The factors that the manager may take into account are the types of the input (e.g. a choice from a list, a date), the amount of presented data (e.g. choice from two versus twenty images), the requirements on an expected error rate or a limit on the expected duration of the input. The strategy does not have to be based solely on the results of this study. For example, it can be further improved on the fly based on the interaction with the user. If a simple strategy is required, the best overall performance was achieved when the user had a choice of a modality. In the upcoming future, a dialogue manager that uses the data from the experiment as the basis of its strategy will be created and evaluated. This will also allow us to analyse whether the knowledge acquired using driving as a primary task is transferable to other primary tasks (e.g. operating a robotic hand).
112
A. Ch´ ylek et al.
Acknowledgments. This work was supported by the European Regional Development Fund under the project Robotics for Industry 4.0 (reg. no. CZ.02.1.01/0.0/0.0/ 15 003/0000470).
References 1. Benedetto, S., Pedrotti, M., Minin, L., Baccino, T., Re, A., Montanari, R.: Driver workload and eye blink duration. Transp. Res. Part F Traffic Psychol. Behav. 14(3), 199–208 (2011). https://doi.org/10.1016/j.trf.2010.12.001 2. Curin, J., et al.: Dictating and editing short texts while driving. In: Proceedings of the 3rd International Conference on Automotive User Interfaces and Interactive Vehicular Applications - AutomotiveUI 2011 p. 13 (2011). http://dl.acm.org/ citation.cfm?doid=2381416.2381418 3. He, J., et al.: Texting while driving: is speech-based text entry less risky than handheld text entry? Accid. Anal. Prev. 72, 287–295 (2014). https://doi.org/10. 1016/j.aap.2014.07.014 4. Road vehicles – Ergonomic aspects of transport information and control systems – Simulated lane change test to assess in-vehicle secondary task demand. Standard, International Organization for Standardization, Geneva, CH, September 2010 5. Kousidis, S., Kennington, C., Baumann, T., Buschmeier, H., Kopp, S., Schlangen, D.: A multimodal in-car dialogue system that tracks the driver’s attention. In: Proceedings of the 16th International Conference on Multimodal Interaction - ICMI 2014, pp. 26–33 (2014). http://dl.acm.org/citation.cfm?doid=2663204.2663244 6. Louveton, N., McCall, R., Koenig, V., Avanesov, T., Engel, T.: Driving while using a smartphone-based mobility application: evaluating the impact of three multi-choice user interfaces on visual-manual distraction. Appl. Ergon. 54, 196– 204 (2016). https://doi.org/10.1016/j.apergo.2015.11.012 7. Oviatt, S., Coulston, R., Lunsford, R.: When do we interact multimodally? Cognitive load and multimodal communication patterns. In: International Conference on Multimodal Interfaces, pp. 129–136 (2004). http://dl.acm.org/citation.cfm? id=1027957 8. Pitts, M.J., Skrypchuk, L., Wellings, T., Attridge, A., Williams, M.A.: Evaluating user response to in-car haptic feedback touchscreens using the lane change test. Adv. Hum. Comput. Interact. 2012 (2012). https://doi.org/10.1155/2012/598739 9. Praˇza ´k, A., Psutka, J.V., Hoidekr, J., Kanis, J., M¨ uller, L., Psutka, J.: Automatic online subtitling of the Czech parliament meetings. In: Sojka, P., Kopeˇcek, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 501–508. Springer, Heidelberg (2006). https://doi.org/10.1007/11846406 63 10. Silvervarg, A., et al.: Perceived usability and cognitive demand of secondary tasks in spoken versus visual-manual automotive interaction. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1171–1175 (2016). https://doi.org/10.21437/Interspeech. 2016-99 11. Tihelka, D., Stanislav, P.: ARTIC for assistive technologies: transformation to resource-limited hardware. In: WCECS 2011, vol. I, pp. 581–584 (2011)
A Free Synthetic Corpus for Speaker Diarization Research Erik Edwards1(B) , Michael Brenndoerfer2 , Amanda Robinson1 , Najmeh Sadoughi1 , Greg P. Finley1 , Maxim Korenevsky1 , Nico Axtmann3 , Mark Miller1 , and David Suendermann-Oeft1 1
2
EMR.AI Inc., San Francisco, CA, USA [email protected] University of California Berkeley, Berkeley, CA, USA 3 DHBW, Karlsruhe, Germany
Abstract. A synthetic corpus of dialogs was constructed from the LibriSpeech corpus, and is made freely available for diarization research. It includes over 90 h of training data, and over 9 h each of development and test data. Both 2-person and 3-person dialogs, with and without overlap, are included. Timing information is provided in several formats, and includes not only speaker segmentations, but also phoneme segmentations. As such, it is a useful starting point for general, particularly early-stage, diarization system development.
Keywords: Speaker diarization Speech activity detection · Open-source corpora
1 1.1
Introduction Background and Motivation
Speaker diarization is the task of segmenting an audio file with multiple speakers into speaker turns, also known as “speaker indexing” or the “who spoke when” question. This task was first considered for air-traffic control recordings [13,30,34,38], and has since been applied to a variety of applications [1,2,25], most often to 2-person telephone conversations [8,24,36], broadcast radio and television [12,33], and many-person (e.g. 4–10+) meetings [4,43]. Our own application is doctor-patient dialogs [9], usually consisting of 2 speakers, but occasionally 3 speakers, and only very rarely 4+ speakers. We were not able to identify a suitable training corpus for diarization system development, which is understandable given that medical dialogs contain sensitive personal information. A recently-released diarization challenge set (for the “DIHARD” challenge) included some clinical interviews with doctors and autistic children, but it was required to delete the data following the challenge. Also, the speech of children may not be considered to be a typical case study for general system development. Other data sets are proprietary and seem particular to a given recording channel c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 113–122, 2018. https://doi.org/10.1007/978-3-319-99579-3_13
114
E. Edwards et al.
and/or background noise condition (e.g. air-traffic control). These do not seem ideal for our application or for general system development, where one might prefer to obtain clean speech and then corrupt it with background noise suitable to the application [21,46]. We decided therefore to make our own synthetic corpus of dialogs, which we make freely available for general use, particularly for early-stage and general diarization system development. Of course, this is not intended to replace real-world data, and each applied worker must also obtain data from their own domain. The earliest approaches to diarization used a “bottom-up” approach of clustering feature vectors by similarity [13,30]. These are also called “unsupervised” in the sense that they require no labeled training data [34]. Although these approaches have remained heavily used in the literature [2,4], later systems began to introduce “top-down” or “supervised” approaches [12,38,43]. These require a fair amount of labeled training data in addition to test data. In fact, the first such top-down study [38] was also the first to introduce synthetic dialog data for training purposes. Recent diarization approaches utilize neural networks [18,20,23,41,45], and these can likewise require a large amount of training data. However, the manual segmentation of dialog data is remarkably difficult and time-consuming (as we have attempted ourselves), and therefore prohibitive for most groups undertaking to get started with system development. Moreover, to avoid over-tuning to the test set during system development and architecture search, it is strongly preferable to have separate development and test data sets. A final motivation for our synthetic corpus is that we desired to study the issue of “phoneme specificity” or “phone adaptive training” in speaker diarization [5,7,17,31,35,42,44,47]. This refers to the fact that phoneme acoustic differences confound the detection of speaker acoustic differences. That is, for example, the fricatives of two speakers may be more similar than the fricatives and vowels of the same speaker. In order to address this issue, one generally requires a corpus wherein the phone identities and segmentations are available. We introduce such a corpus here, by using methodology from automatic speech recognition (ASR) to obtain forced alignments of phoneme labels. 1.2
Brief Review of Diarization Data Sets
The first diarization data studied was air-traffic control recordings [13,30,34,38], and an early study of a 5-person meeting quickly followed [43]. The 1997 DARPA Speech Recognition Workshop introduced the ARPA Hub4 task, to transcribe radio and television broadcasts [12,33]. This was the first in a series of diarization and related tasks from ARPA (Advanced Research Project Agency) and NIST (National Institute of Standards and Technology), and over 100 publications have been dedicated to the diarization of such broadcasts. We have not been able to locate the past NIST data sets, and recent ones appear to be accessible only with an LDC (Linguistic Data Consortium) account. Also, they can contain music or other background noises, and they do not generally include a large training set or phonemic information. The second major domain of diarization
Diarization Corpus
115
research (also over 100 publications) has been multi-person meetings, particularly following the introduction of widely-used corpora of meeting data, namely the ISL Meeting Corpus [6], the ICSI Meeting corpus [19], the AMI corpus [15], and various meeting data sets from NIST, e.g. [11]. Although these are excellent for their domain of application, they involve many speakers (at least 3 speakers, and 4+ speakers in the great majority) and again a particular audiochannel/background-noise scenario. This may not be suitable for early-stage or general diarization system development, or for research focused on 2–3 speakers. Of these, only the AMI corpus (involving 4+ speakers of British or European English) is freely available with a liberal usage license. The third major domain of diarization research has concerned 2-person telephone conversations, of which the stand-out data set has been the Switchboard corpus [14]. This is by far the closest data set to our intended application, but it also has a few drawbacks: It is only available via LDC account, it is sampled at 8 kHz, it seems particular to the given audio channel, and exact overlapped-speech information may not be obtainable. Therefore, it was deemed that, for general, open-source use, particularly outside of the three major application domains, a free synthetic diarization corpus would be necessary, and likely useful to others as well. We therefore focused on finding previous synthetic diarization corpora. As mentioned above, the first to introduce synthetic dialog data [38] was also the first top-down study, where availability of training data becomes critical. Another early top-down study [39] likewise used a simulated dialog corpus, for which they cited a CD-ROM. Neither of these early synthetic corpora are currently available to our knowledge. Almost no mention of synthetic data was made in the years following the 1997 NIST set. We find exactly 2 artificial conversations made from TIMIT data [8,22,40], a small synthetic test set from TIMIT data [10], and one large synthetic set made from TIMIT [26]. The later was only described in a few sentences, but appears quite similar in motivation to ours (e.g., conversations of 2–6 speakers). Unfortunately, none of these TIMIT-based sets are available to our knowledge. A set of synthetic Spanish conversations was found [3], but we do not consider non-English sets here. Therefore, we have developed our own synthetic corpus as a basic starting point for diarization research, derived from the freely available and open-source LibriSpeech corpus [28]. This synthetic diarization corpus is freely available for download at: https://github.com/EMRAI/emrai-synthetic-diarization-corpus.
2
Synthetic Diarization Corpus
The LibriSpeech corpus consists of sections of English audio books recorded at 16 kHz sample rate [28], usually with clear articulation and high-quality audio. It was expected therefore that forced alignment could produce highly accurate (albeit not perfect) phonemic segmentations. The open-source and widely-used Kaldi speech recognition toolkit [29] includes a recipe for ASR training and alignment of the LibriSpeech corpus. The use of this ASR set is also advantageous
116
E. Edwards et al.
because some analyses from the ASR pipeline can be used in diarization. For example, if a universal background model (UBM) or i-vector extractor is trained on the LibriSpeech ASR corpus, it could be used on the synthetic diarization data as well. In brief outline, we have constructed our synthetic corpus as follows (further details will be available from the download page of the corpus). For training data, we use the “train clean 100” subset of the LibriSpeech corpus with 100.6 h of audio. This consists of 585 chapters read by 251 unique speakers (126 male, 125 female), where each chapter has up to 129 utterances. We ranked chapters according to number of utterances, and discarded chapters with fewer than 4 utterances. Alternating chapters in this ranked list were combined into 2-speaker dialogs, with care not to combine the same speaker into a single dialog. The utterances from the 2 speakers were simply alternated until one of the 2 speakers had no further utterances. This resulted in dialogs with 13–259 utterances (median 84). Speakers were combined without respect to gender, resulting in 73 femalefemale, 65 male-male, and 154 female-male dialogs (292 dialogs total). Dialogs ranged in duration from 2.7–49.6 min (median 17.5 min), yielding 98.15 h in the total training corpus. The LibriSpeech “dev clean”, “dev other”, “test clean”, and “test other” sets were likewise prepared for diarization development and test sets (Table 1). Table 1. Synthetic 2-person corpus with no overlap. Dialogs Utts (Turns) Tokens Hours Train
292
28522
989715 98.15
Dev-clean
48
2673
53765
4.98
Dev-other
45
2822
50227
4.69
Test-clean
43
2605
52279
5.07
Test-other
45
2861
51305
4.85
Inspired by published statistics of natural conversations [16,37], a small random gap was inserted between speaker turns, as sampled from a Rayleigh distribution with scale parameter (mode) of 200 ms. The longest random draws (i.e. from the long tail of the Rayleigh distribution) were discarded, given that gaps in natural conversations are bounded to some finite value. The actually-used samples ranged from 2 to 819 ms with a mean gap of 240 ms. In each original audio file, the leading/trailing silences were tapered linearly to 0 at start/end, such that no audible transient occurs between speaker turns (i.e. the silent portions transition smoothly into each other). Successive wav files were linearly added into the dialog waveform, with the appropriate offsets, and checked so that no sample accidentally exceeded a range of ±1. Timing information is provided in 3 formats: (1) the Kaldi .ctm format; (2) the NIST .rttm format [27], as required by the widely-used md-eval-v21.pl script
Diarization Corpus
117
for computing the diarization error rate (DER) [1]; and (3) a simple frame-byframe list of integer labels. In the later, 0 indicates silence, 1 indicates speaker 1, and 2 indicates speaker 2, etc. Integers greater than 10 indicate overlap. In case the direction of overlap is important, these are coded such that “12” means overlap as speaker 1 transitions to speaker 2, and “21” means overlap as speaker 2 transitions to speaker 1. But if the user is only interested in “overlap”, then all integers greater than 10 can be collapsed into one category. For the NIST .rttm format, we provide two versions. In the first, only speaker turns are indicated (with labels 1, 2, etc.), and where all within-speaker gaps of less than 200 ms are ignored, i.e. labeled as speech. This appears to be the most widely used threshold currently, whereas a previous standard used a threshold of 300 ms [27]. In the second set of .rttm files provided, all silences, including gaps less than 200 ms, are explicitly included (with label 0). From these, users could make other thresholds of within-speaker gaps to ignore. The dialog .ctm files include the timing information for individual phonemes, as obtained by forced alignment (from the tri4b stage of the Kaldi recipe for the LibriSpeech ASR corpus, using the Kaldi “ali2phones” utility [29]). These .ctm files from the original forced alignments were simply mapped to the new timeline of the dialog. We followed the provided standard recipe for the ASR pipeline, except that we used our own lexicon, for reasons that will be presented in a separate contribution. In brief, we have been studying a syllabic approach to ASR, and have developed a lexicon with syllabic phonology for these purposes. This has resulted in ∼20% relative improvement in WER, and so this was preferred for forced alignments as well. Moreover, we sought to investigate the use of syllabic structure in diarization (see companion paper), which requires syllabic information from the alignments. Our expanded phone set can be mapped back to the usual ARPAbet phones [32], if desired. Since forced alignment does not work for out-of-vocabulary words, we manually added all such words to our lexicon. This is one of the reasons that we use only the 100-h “train clean” subset of the full LibriSpeech training data. A second version of the corpus incorporates speaker overlap. Because some users may want to compare diarization with and without overlap (but otherwise identical), we used the exact same utterances and alignments as above, with only one difference – in the overlap version we subtract 200 ms from each betweenspeaker interval. This shifts the mode of the ∼Rayleigh distribution to 0 ms, with a range of −198 to 619 ms (mean 40 ms). This is a fairly realistic range of overlap for natural English conversations [16,37], and therefore barely noticeable to the human ear. Note, however, that real-world conversations also include another type of overlap, where one speaker makes a brief utterance or non-speech sound in the middle of the other speaker’s turn (sometimes called “back-channel” speech). We have no statistics for such events, and it is not possible to imitate these easily with just the LibriSpeech data, so no such “back-channel” speech was included in the synthetic corpus (Table 2). Next, a 3-person synthetic dialog corpus was constructed by the same methods as above. However, we do not want all dialogs to have ∼33% representation
118
E. Edwards et al. Table 2. Synthetic 2-person corpus with overlap. Dialogs Utts (Turns) Tokens Hours Train
292
28522
Dev-clean
48
2673
989715 96.58 53765
4.83
Dev-other
45
2822
50227
4.54
Test-clean
43
2605
52279
4.93
Test-other
45
2861
51305
4.69
of each of the 3 speakers. Although we do not know of any published statistics, it is certainly not the case that all real-world 3-person dialogs have equal time allocated to the 3 speakers. Also, the 3 speakers should not alternate in a simple sequence of 1, 2, 3, 1, 2, 3, etc. As a simple first solution, the sequence was assigned as follows: the first speaker is speaker 1 by definition, and then each subsequent speaker is chosen randomly from the other 2 speakers, until one speaker runs out of available utterances. In this manner, each dialog ends up with a unique sequence of speaker turns, and unique proportions of representation across the 3 speakers. Single speakers took a range of 17.7–44.4% of the dialog turns (mean 33.3%). This method does, however, lose some utterances in each dialog, so the total hours in the corpus is less than for the 2-speaker corpus (Table 3). Dialogs included between 17 and 366 utterances (median 118), and ranged in duration from 2.8–71.5 min (median 24.4 min). Across all 3-speaker dialogs, 22% were same-gender (m-m-m or f-f-f) and 78% were mixed-gender. Table 3. Synthetic 3-person corpus without overlap. Dialogs Utts (Turns) Tokens Hours Train
195
26694
928346 92.11
Dev-clean
32
2430
48899
4.53
Dev-other
30
2560
45664
4.26
Test-clean
29
2406
47639
4.61
Test-other
30
2684
48025
4.53
The inter-speaker intervals were again chosen randomly according to a Rayleigh distribution with mode of 200 ms (as above), and the actual samples ranged from 1 to 803 ms (mean 242 ms). To create the corresponding 3-person corpus with overlap (Table 4), the identical sequences and values were used, except with 200 ms subtracted from the inter-speaker intervals. This yielded intervals of −199 to 603 ms (mean 42 ms).
Diarization Corpus
119
Table 4. Synthetic 3-person corpus with overlap. Dialogs Utts (Turns) Tokens Hours Train
3
195
26694
Dev-clean
32
2430
928346 90.64 48899
4.40
Dev-other
30
2560
45664
4.12
Test-clean
29
2406
47639
4.47
Test-other
30
2684
48025
4.38
Discussion and Conclusion
A synthetic corpus of dialogs was made from the open-source LibriSpeech corpus and released for download: https://github.com/EMRAI/emrai-synthetic-diarization-corpus. The corpus includes timing information in several formats, and includes phoneme as well as speaker segmentations. Both 2-speaker and 3-speaker corpora, with and without overlap, are provided. In the future, we will likely add a 4-speaker corpus. Note that dialogs with different numbers of speakers can be combined by a user to obtain a data set where the number of speakers is not fixed. As a synthetic corpus, there are several deviations from real-world data. First, there is very little background noise (but users could add their own for a better approximation to real conditions [21,46]). Second, conversational statistics were approximately mimicked, but cannot be considered perfectly realistic. Third, we included no intervals of truly multi-speaker speech, i.e., “back-channel” utterances by one speaker that occur fully within the turn of another speaker. Fourth, the LibriSpeech corpus itself consists of high-quality readings of audio books, which has certain advantages (such as high-quality phonetic alignments), but also makes the speech unrealistic to most real-world applications. Fifth, although our corpus is gender-balanced, we include no child or other special categories of speech. Finally, we only include 2-speaker and 3-speaker dialogs (and 4-speaker dialogs will be included in a future release). Thus, we explicitly do NOT suggest that the synthetic corpus replaces the need for real-world data; applied workers must also obtain data for each particular application. Nonetheless, we believe that our general-purpose corpus serves as a useful starting point for diarization research, particularly in the early stages of system development, where a very challenging corpus peculiar to one recording situation is often less desirable. We advise the beginning researcher to attempt first the 2-speaker corpus without overlap, and then move on to consider overlap and more speakers, along with real-world data. It is, however, possible that training on this corpus can produce models that generalize to real-world situations (as in our companion paper).
120
E. Edwards et al.
References 1. Anguera Mir´ o, X.: Robust speaker diarization for meetings. Ph.D. thesis, Univ. Polit`ecnica de Catalunya (2006) 2. Anguera Mir´ o, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012) 3. Anguera Mir´ o, X., Hernando Peric´ as, F.: Evolutive speaker segmentation using a repository system. In: Proceedings of ICSLP, pp. 605–608. ISCA (2004) 4. Anguera, X., Wooters, C., Peskin, B., Aguil´ o, M.: Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 402–414. Springer, Heidelberg (2006). https://doi.org/10.1007/11677482 34 5. Bozonnet, S., Vipperla, R., Evans, N.: Phone adaptive training for speaker diarization. In: Proceedings of INTERSPEECH, pp. 494–497. ISCA (2012) 6. Burger, S., MacLaren, V., Yu, H.: The ISL meeting corpus: the impact of meeting type on speech style. In: Proceedings of ICSLP, pp. 301–304. ISCA (2002) 7. Chen, I.F., Cheng, S.S., Wang, H.M.: Phonetic subspace mixture model for speaker diarization. In: Proceedings of INTERSPEECH, pp. 2298–2301. ISCA (2010) 8. Delacourt, P., Kryze, D., Wellekens, C.: Speaker-based segmentation for audio data indexing. In: Proceedings of ESCA Tutorial and Research Workshop, pp. 78–83. ISCA (1999) 9. Finley, G., et al.: An automated medical scribe for documenting clinical encounters. In: Proceedings of NAACL. ACL (2018) 10. Gangadharaiah, R., Narayanaswamy, B.: A novel method for two-speaker segmentation. In: Proceedings of ICSLP, pp. 2337–2340. ISCA (2004) 11. Garofolo, J., Laprun, C., Michel, M., Stanford, V., Tabassi, E.: The NIST meeting room pilot corpus. In: Proceedings of LREC, p. 4. ELRA (2004) 12. Gauvain, J.L., Adda, G., Lamel, L., Adda-Decker, M.: Transcribing broadcast news: the LIMSI Nov96 Hub4 system. In: Proceedings of DARPA Speech Recognition Workshop, pp. 56–63. DARPA (1997) 13. Gish, H., Siu, M.H., Rohlicek, J.: Segregation of speakers for speech recognition and speaker identification. In: Proceedings of ICASSP, vol. 2, pp. 873–876. IEEE (1991) 14. Godfrey, J., Holliman, E., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: Proceedings of ICASSP, vol. 1, pp. 517–520. IEEE (1992) 15. Hain, T., et al.: The development of the AMI system for the transcription of speech in meetings. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 344–356. Springer, Heidelberg (2006). https://doi.org/10.1007/11677482 30 16. Heldner, M., Edlund, J.: Pauses, gaps and overlaps in conversations. J. Phon. 38(4), 555–568 (2010) 17. Hsieh, C.H., Wu, C.H., Shen, H.P.: Adaptive decision tree-based phone cluster models for speaker clustering. In: Proceedings of INTERSPEECH, pp. 861–864. ISCA (2008) 18. Ikbal, S., Visweswariah, K.: Learning essential speaker sub-space using heteroassociative neural networks for speaker clustering. In: Proceedings of INTERSPEECH, pp. 28–31. ISCA (2008) 19. Janin, A., et al.: The ICSI meeting corpus. In: Proceedings of ICASSP, vol. 1, pp. 364–367. IEEE (2003)
Diarization Corpus
121
20. Jothilakshmi, S., Ramalingam, V., Palanivel, S.: Speaker diarization using autoassociative neural networks. Eng. Appl. Artif. Intell. 22(4–5), 667–675 (2009) 21. Kim, K., Kim, M.: Robust speaker recognition against background noise in an enhanced multi-condition domain. IEEE Trans. Consum. Electron. 56(3), 1684– 1688 (2010) 22. Liu, C., Yan, Y.: Speaker change detection using minimum message length criterion. In: Proceedings of ICSLP, pp. 514–517. ISCA (2000) 23. Meinedo, H., Neto, J.: A stream-based audio segmentation, classification and clustering pre-processing system for broadcast news using ANN models. In: Proceedings of INTERSPEECH, pp. 237–240. ISCA (2005) 24. Metzger, Y.: Blind segmentation of a multi-speaker conversation using two different sets of features. In: Proceedings of Odyssey Workshop, pp. 157–162. ISCA (2001) 25. Moattar, M., Homayounpour, M.: A review on speaker diarization systems and approaches. Speech Commun. 54(10), 1065–1103 (2012) 26. Mohammadi, S., Sameti, H., Langarani, M., Tavanaei, A.: KNNDIST: a nonparametric distance measure for speaker segmentation. In: Proceedings of INTERSPEECH, pp. 2282–2285. ISCA (2012) 27. NIST: Spring 2006 (RT-06S) Rich Transcription Meeting Recognition Evaluation plan. Report RT-06S, National Institute of Standards and Technology, Spring 2006 28. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: Proceedings of ICASSP, pp. 5206–5210. IEEE (2015) 29. Povey, D., et al.: The Kaldi speech recognition toolkit. In: Proceedings of Workshop ASRU, Waikoloa Village, HI, p. 4. IEEE (2011) 30. Rohlicek, J., et al.: Gisting conversational speech. In: Proceedings of ICASSP, vol. 2, pp. 113–116. IEEE (1992) 31. Schindler, C., Draxler, C.: Using spectral moments as a speaker specific feature in nasals and fricatives. In: Proceedings of INTERSPEECH, pp. 2793–2796. ISCA (2013) 32. Shoup, J.: Phonological aspects of speech recognition. In: Lea, W. (ed.) Trends in Speech Recognition, pp. 125–138. Prentice-Hall, Englewood Cliffs (1980) 33. Siegler, M., Jain, U., Raj, B., Stern, R.: Automatic segmentation, classification and clustering of broadcast news audio. In: Proceedings of DARPA Speech Recognition Workshop, pp. 97–99. DARPA (1997) 34. Siu, M.H., Yu, G., Gish, H.: An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers. In: Proceedings of ICASSP, vol. 2, pp. 189–192. IEEE (1992) 35. Soldi, G., Bozonnet, S., Alegre, F., Beaugeant, C., Evans, N.: Short-duration speaker modelling with phone adaptive training. In: Proceedings of Odyssey Workshop, pp. 208–215. ISCA (2014) 36. S¨ onmez, M., Heck, L., Weintraub, M.: Speaker tracking and detection with multiple speakers. In: Proceedings of EUROSPEECH, pp. 2219–2222. ISCA (1999) 37. Stivers, T., et al.: Universals and cultural variation in turn-taking in conversation. Proc. Natl. Acad. Sci U.S.A. 106(26), 10587–10592 (2009) 38. Sugiyama, M., Murakami, J., Watanabe, H.: Speech segmentation and clustering based on speaker features. In: Proceedings of ICASSP, vol. 2, pp. 395–398. IEEE (1993) 39. Takagi, K., Itahashi, S.: Segmentation of spoken dialogue by interjections, disfluent utterances and pauses. In: Proceedings of ICSLP, pp. 697–700. ISCA (1996) 40. Valente, F., Wellekens, C.: Scoring unknown speaker clustering: VB vs. BIC. In: Proceedings of ICSLP, pp. 593–596. ISCA (2004)
122
E. Edwards et al.
41. Vi˜ nals, I., Villalba, J., Ortega, A., Miguel, A., Lleida, E.: Bottleneck based frontend for diarization systems. In: Abad, A., et al. (eds.) IberSPEECH 2016. LNCS (LNAI), vol. 10077, pp. 276–286. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-49169-1 27 42. Wang, G., Wu, X., Zheng, T.: Using phoneme recognition and text-dependent speaker verification to improve speaker segmentation for Chinese speech. In: Proceedings of INTERSPEECH, pp. 1457–1460. ISCA (2010) 43. Wilcox, L., Chen, F., Kimber, D., Balasubramanian, V.: Segmentation of speech using speaker identification. In: Proceedings of ICASSP, vol. 1, pp. 161–164. IEEE (1994) 44. Yella, S., Motl´ıcek, P., Bourlard, H.: Phoneme background model for information bottleneck based speaker diarization. In: Proceedings of INTERSPEECH, pp. 597– 601. ISCA (2014) 45. Yella, S., Stolcke, A., Slaney, M.: Artificial neural network features for speaker diarization. In: Proceedings of SLT Workshop, pp. 402–406. IEEE (2014) 46. Zˆ ao, L., Coelho, R.: Colored noise based multicondition training technique for robust speaker identification. IEEE Signal Process. Lett. 18(11), 675–678 (2011) 47. Zibert, J., Mihelic, F.: Prosodic and phonetic features for speaker clustering in speaker diarization systems. In: Proceedings of INTERSPEECH, pp. 1033–1036. ISCA (2011)
Speaker Diarization: A Top-Down Approach Using Syllabic Phonology Erik Edwards1(B) , Amanda Robinson1 , Najmeh Sadoughi1 , Greg P. Finley1 , Maxim Korenevsky1 , Michael Brenndoerfer2 , Nico Axtmann3 , Mark Miller1 , and David Suendermann-Oeft1 1
2
EMR.AI Inc., San Francisco, CA, USA [email protected] University of California Berkeley, Berkeley, CA, USA 3 DHBW, Karlsruhe, Germany
Abstract. A top-down approach to speaker diarization is developed using a modified Baum-Welch algorithm. The HMM states combine phonemes according to structural positions under syllabic phonological theory. By nature of the structural phonology, there are at most 16 states, and the transition matrix is sparse, allowing efficient decoding to structural phones. This addresses the issue of phoneme specificity in speaker diarization – that speaker similarities/differences are confounded by phonetic similarities/differences. We address this here without the expensive use of a complete set of individual phonemes. The voice activity detection (VAD) issue is likewise addressed, giving a new approach to VAD. Keywords: Speaker diarization
1
· Speech activity detection · Syllable
Introduction
When attempting the “who spoke when” question, i.e. speaker diarization, one must use features that distinguish different speakers of the dialog. These distinctions are confounded by phonemic differences, which are ultimately irrelevant to the labeling of speaker turns. This is the opposite of the situation in automatic speech recognition (ASR), where phone identities must be labeled, and speaker differences ignored. The problem in ASR is that of “speaker adaptation”, whereas the problem in speaker diarization is sometimes referred to as “phoneme specificity” or “phone adaptive training”. We present here a novel speaker diarization system that addresses the problem of phoneme specificity, while remaining highly computationally efficient. The earliest approaches to diarization used a “bottom-up” approach of agglomerative clustering of feature vectors of different frames [14]. These are also called “unsupervised” in the sense that they require no labeled training data [35]. These approaches have remained heavily used in the literature [2,3]. Later systems began to introduce “top-down” approaches in combination with the bottom-up methods [12,37,40], but these require labeled training data. In c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 123–133, 2018. https://doi.org/10.1007/978-3-319-99579-3_14
124
E. Edwards et al.
fact, the first such paper [37] was also the first to introduce synthetic dialog data for training purposes. Another early top-down approach [40] was the first to use HMM models with Baum-Welch training (although not as here, where we use it at diarization time). We first tried the bottom-up approach, where we found the issue of phoneme specificity to be strongly confounding. That is, for example, two fricatives from different speakers can be highly similar in their acoustic features, while a fricative and a vowel from the same speaker can be highly dissimilar. A number of papers have now addressed the problem of phoneme specificity/adaption in speaker diarization [4,5,18,31,36,39,42,43]. This issue is also well known in the larger literature on speaker recognition and verification [8,17]. We therefore abandoned the bottom-up approach in favor of the top-down approach presented here. This required a reliable set of training data, wherein both speaker labels and phone labels are available (since we desire to study phoneme specificity). Therefore, we also introduced our own synthetic corpus (Sect. 2, and described fully in the companion paper). Our motivating application is the segmentation of doctor-patient dialogs, where the diarization is followed by ASR and information extraction [9]. Therefore, several of our basic decisions were guided by this application. First, the ASR stage requires MFCC features [7], so we attempt speaker diarization with the same MFCC features, but supplemented with a small number of auxiliary features. Second, we focus on the case of 2-speaker dialogs, which covers the great majority of doctor-patient encounters (although our approach is easily generalizable to 3+ speakers). Third, the issue of overlapped speech is less problematic in doctor-patient dialogs, because it is a situation where both members of the dialog have a high motivation to listen and to respect speaker turn taking. Other than yes/no responses, most medically critical information is delivered in longer turns with little or no overlap. Therefore, for our first system presented here, the focus is entirely on correct labeling of speaker identity, but not necessarily on refining the exact edges of speaker turns. In our system, each speaker-turn segment is submitted to the ASR stage with some leading/trailing audio anyway, so we have adopted the most typical “collar” used in diarization publications, which is 250 ms. The “collar” is a region around the segment boundaries that is ignored for computing the diarization error rate (DER) [1]. Finally, our system must operate in real-time, so there is a strong focus here on remaining computationally efficient at the time of diarization.
2
Synthetic Diarization Corpus
Doctor-patient dialogs are not freely available for diarization research. Existing data sets for diarization contain many speakers (e.g. meetings with 4 to 10+ speakers); or seem particular to a given situation or audio channel; or have speaker turns labeled, but not phonemic segmentations; or lack a large quantity of training data in addition to test sets; or cannot be obtained freely for general use. Therefore, we have developed a synthetic corpus as a basic starting point for diarization research, utilizing the open-source LibriSpeech corpus [27]. This synthetic corpus (Table 1) is described fully in the companion paper.
Speaker Diarization: A Top-Down Approach Using Syllabic Phonology
125
Table 1. Corpus of synthetic LibriSpeech dialogs. Dialogs Utts (Turns) Tokens Hours Train
3
292
28522
Dev-clean
48
2673
989715 98.15 53765
4.98
Dev-other
45
2822
50227
4.69
Test-clean
43
2605
52279
5.07
Test-other
45
2861
51305
4.85
New Lexicon with Syllabic Phonology
The concept of the syllable has a long tradition in linguistics, dating at least to the ancient Greek συλλαβη and Latin syllaba [16,25,38]. Use of the syllable in ASR dates to one of the earliest systems [26], and has recurred many times since [11,20]. However, syllabic approaches have consistently remained outside of the mainstream of ASR, and have been used only very rarely in speaker recognition [23,24,34]. We know of no syllable-based work in the speaker diarization or VAD literatures. One contributing factor may be the absence of a lexicon from which syllabic segmentations can be obtained directly. There is no simple method for obtaining syllabifications from ARPAbet-based lexicons [33], such as the widely-used CMUdict [28]. We have therefore developed an English lexicon utilizing syllabic phonology. For present purposes, this essentially means that each phoneme is assigned a structural position (i.e. Affix, Onset, Peak, Coda, Suffix), according to the most widely-accepted phonological theory [10,15,19,32]. The immediate practical motivation for introducing syllabic positions into our diarization work is that we would like to address phoneme specificity without however introducing a full phoneme-based decoding (as in ASR), which would be computationally expensive. On the other hand, there are only a handful of syllabic structural positions (5-15, depending on how many sub-positions are used), and the transition matrix for the structural positions is sparse. Thus, in the above 5-position scheme, Affix can only precede Onset; Onset can only precede Peak; Coda can only follow Peak; and Suffix can only follow Coda. An English utterance is a rather predictable succession of structural positions, and a dialog simply allows these to transition between speakers. Since the vowel phones occur exclusively in the Peak position, and since vowel segments are the dominant source of speaker characteristics, the Peak segments can be primarily used to distinguish speakers. This is the original idea and motivation; the resulting system in practice is given next.
4
Diarization Method
Our speaker diarization system proceeds in two general stages: (1) Feature extraction and decorrelation/dimensionality reduction; (2) an expectation
126
E. Edwards et al.
maximization (EM) algorithm to obtain posterior probabilities of HMM states, from which the speech/silence and speaker labels are obtained. All coding was done in C. 4.1
Feature Extraction
Our total system cascades an ASR stage following diarization, so, for efficiency, we begin with the ASR acoustic features (40-D MFCCs [7]), supplemented with a small number of auxiliary features. Specifically, we append the 4-D Kaldi pitch features [13] and the 5-D VAD features of [29]. These are supplemented with Δ features, making a total 98-D feature set. This is reduced by PCA (principal component analysis) to 32-D output, followed by multi-class LDA (linear discriminant analysis) [41]. LDA was trained on labels defined by the 7 syllabic phone categories below, with vowels differentiated by the 251 unique speakers, giving 258 LDA labels total (1 silence, 6 consonant, and 251 vowel labels). All results presented here use a reduced set of 12-D LDA components. Finally, we change the 12-D LDA features to percentile units, where 128 bins were learned for each LDA feature from the training data. This allows the features to be held as char variables (the smallest data type in C), and used for direct table lookup, leading to greater computational efficiency at the time of diarization. Also, since the features are decorrelated by PCA/LDA, this allows the use of a direct (binned) probability representation, whereas GMM probability representations were found to perform worse and take > 2× longer computationally. 4.2
Modified Baum-Welch Algorithm and HMM States
The Baum-Welch algorithm is a method to iteratively estimate the parameters of an HMM model [21]. As such, it is usually applied during training, and the resulting parameters fixed at decoding time. However, here we adapt the BaumWelch algorithm to perform diarization on test data. The training data is only used to initialize the HMM parameters, and then the modified Baum-Welch algorithm adapts to the audio file under consideration by EM iterations. The update equations of the Baum-Welch are well-known and not covered here. More importantly, we have arrived at a method of progressive untying of HMM states with successive stages of iterations, such that stage 1 essentially provides a soft VAD output, and the last stage achieves the full diarization. A recorded 2-person dialog consists of an initial segment of silence, alternating utterances of speakers 1 and 2 (with silent gaps within and between), and then a final segment of silence. The first person to speak is labeled “speaker 1” by definition, and “silence” includes any irrelevant background noise and often breath sounds. Note that initial silence is special in terms of the HMM A matrix, because the dialog must begin in this state, and this state must transition to speaker 1. However, we found no advantage to keep the final silence as a separate state, nor to keep within- vs. between-speaker silences separate. Thus, our HMM model has 4 overall states: (1) Speaker 1; (2) Speaker 2; (3) Initial silence;
Speaker Diarization: A Top-Down Approach Using Syllabic Phonology
127
(4) Other silence. For the B matrix (emission probabilities), all silences remain tied together in one “tie-group”. Next, we split the Speaker 1 and 2 states according to syllabic phonology, in order to address phoneme specificity (see Introduction). The following split into 7 phoneme categories was found so far to perform best: 1. 2. 3. 4. 5. 6. 7.
Prevocalic stops (B, D, G, K, P, T) Prevocalic fricatives/affricates (CH, DH, F, HH, JH, S, SH, TH, V, Z, ZH) Prevocalic liquids/nasals/semi-vowels (L, N, M, NG, R, W, Y) Vowels (AA, AE, AH, ..., UW) (inclusive of all stress levels) Postvocalic liquids/nasals/semi-vowels (L, N, M, NG, R, W, Y) Postvocalic stops (B, D, G, K, P, T) Postvocalic fricatives/affricates (CH, DH, F, HH, ..., Z, ZH).
This breakdown uses the most important phonemic distinction according to syllabic positions, which is the pre- vs. postvocalic distinction. This refers to consonants which lie before vs. after the vowel within the syllable. This distinction was emphasized already by Saussure (his “explosive” vs. “implosive” consonants) [30], and by the early Haskins studies of speech (their “initial” vs. “final” consonants) [6,22]. In terms of syllabic phonology, prevocalic merges Affix and Onset positions, postvocalic merges Coda and Suffix positions, and vowel is the same as Peak position. The pre- vs. postvocalic split was found to improve performance already at the VAD stage, whereas fewer distinctions (4 phone categories) and more refined distinctions (up to 15 phone categories) deteriorated performance. Thus, we proceed with the 7 structural-phone categories. These phone categories define 7 HMM states per speaker, now giving 16 HMM states total (2 silence states + 7 states per speaker). Finally, we use the traditional 3 left-to-right substates per basic state, giving a grand total of N = 48 HMM states. Note that the major purpose of the 3 substates is to provide more realistic durational modeling by the transition matrix (A). For concreteness, we list these HMM states explicitly: – – – – – – – – – – – – –
HMM HMM HMM HMM HMM HMM HMM HMM HMM HMM HMM HMM HMM
States States States States States States States States States States States States States
0-2: 3-5: 6-8: 9-11: 12-14: 15-17: 18-20: 21-23: 24-26: 27-29: 30-32: 33-35: 36-38:
Initial silence Other silence Speaker 1, prevocalic stops Speaker 1, prevocalic fricatives/affricates Speaker 1, prevocalic liquids/nasals/semivowels Speaker 1, vowels Speaker 1, postvocalic liquids/nasals/semivowels Speaker 1, postvocalic stops Speaker 1, postvocalic fricatives/affricates Speaker 2, prevocalic stops Speaker 2, prevocalic fricatives/affricates Speaker 2, prevocalic liquids/nasals/semivowels Speaker 2, vowels
128
E. Edwards et al.
– HMM States 39-41: Speaker 2, postvocalic liquids/nasals/semivowels – HMM States 42-44: Speaker 2, postvocalic stops – HMM States 45-47: Speaker 2, postvocalic fricatives/affricates The HMM A matrix, representing transition probabilities between these states, is learned once from the training data. Importantly, we do not update the A matrix during the modified Baum-Welch iterations. This is the most timeconsuming update computation, and has negligible consequences for diarization. Moreover, it was found that it was better to sparsify the A matrix by setting direct (0-ms lag) Speaker 1 to 2 transitions to 0. The HMM B matrices, representing emission probabilities for each state, are first learned from the training data, and then updated with each iteration of the Baum-Welch during diarization. However, it is common practice to tie HMM states so that their emission probabilities are estimated jointly. This is particularly important if there is too little data. Moreover, most diarization systems begin with a VAD stage (speech vs. silence), before making the more refined distinctions for diarization. An important result of our preliminary investigations was that the B matrices are best updated with strong ties across states initially, and then progressive untying of the states towards the final diarization. We arrived at a 3-stage procedure, wherein the first stage uses only 7 tie groups, the last stage leaves most states untied, and the middle stage uses an intermediate degree of tying. Specifically, using the 48 HMM states enumerated above, the following 3-stages of state tie groups was found to work best: STAGE 1 TYING OF B MATRIX: – – – – – – – –
TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP
0 1 2 3 4 5 6 7
== == == == == == == ==
HMM HMM HMM HMM HMM HMM HMM HMM
States States States States States States States States
0-5 (Silence) 6-8, 27-29 (Prevocalic stops) 9-11, 30-32 (Prevocalic fricatives) 12-14, 33-35 (Prevocalic liquids/nasals) 15-17, 36-38 (Vowels) 18-19, 39-41 (Postvocalic liquids/nasals) 20-22, 42-44 (Postvocalic stops) 23-25, 45-47 (Postvocalic fricatives)
It can be seen that no distinction is made in Stage 1 between speakers. This is therefore a speech vs. silence stage, except that speech has been expanded into the 7 structural-phone categories. This is, in fact, a new method of VAD, with soft (posterior probability) outputs. These are then used to initialize Stage 2 of the Baum-Welch iterations, where only the vowels are used to begin the separation of speakers. Thus, TIE-GROUP 4 of Stage 1 is split into 2 tie-groups in Stage 2. STAGE 2 TYING OF B MATRIX: – TIE-GROUP 0 == HMM States 0-5 (Silence) – TIE-GROUP 1 == HMM States 6-8, 27-29 (Prevocalic stops)
Speaker Diarization: A Top-Down Approach Using Syllabic Phonology
– – – – – – –
TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP TIE-GROUP
2 3 4 5 6 7 8
== == == == == == ==
HMM HMM HMM HMM HMM HMM HMM
States States States States States States States
129
9-11, 30-32 (Prevocalic fricatives) 12-14, 33-35 (Prevocalic liquids/nasals) 15-17 (Speaker 1 Vowels) 36-38 (Speaker 2 Vowels) 18-19, 39-41 (Postvocalic liquids/nasals) 20-22, 42-44 (Postvocalic stops) 23-25, 45-47 (Postvocalic fricatives)
It should be kept in mind that speaker distinctions are most usefully obtained from vowels. A major purpose of the consonant categories is just to separate them out from the vowels, so as not to contaminate the acoustic evidence provided during vowels states. Consonants also provide some degree of power to distinguish speakers, but we leave these states tied across speakers until the final iterations, in order not to interfere. Experiments showed that all 3 of these stages (and in this order of course-to-refined) were necessary to achieve the best performance. 8 EM iterations per stage were used for all results here. Following the 24 EM iterations of the 3-stage Baum-Welch algorithm, the posterior probabilities are summed across all Speaker 1 states, all Speaker 2 states, and all Silence states. By this method, it is not important if the algorithm has perfectly separated various consonant categories, because they are all summed together with the vowel states for each Speaker. The final diarization label is taken as the maximum of these three probabilities for each time frame.
5
Results and Discussion
We present results for the synthetic LibriSpeech dialog corpus (Sect. 2), and for 2 recordings of doctor-actor dialogs. In the latter, a real doctor interviewed an actor patient (to avoid privacy issues). The doctors were male, and the patients female. Audio was recorded by a cell phone. The 2 dialogs were 6.4 min and 5.7 min in duration, and used for test data only. All training to initialize the HMM A and B matrices was done on the synthetic corpus. For the synthetic LibriSpeech corpus, we obtain the following DERs, using a collar of 250 ms, as assessed with the widely-used md-eval-v21.pl script (from NIST). The same collar and script was used to asses the VAD error rate (VER) (Table 2). Table 2. Results for synthetic LibriSpeech dialogs. Mean DER Max DER Mean VER Max VER Dev-clean 0.66%
2.44%
0.62%
2.38%
Dev-other 0.94%
3.75%
0.90%
3.75%
Test-clean 0.95%
4.45%
0.78%
4.44%
Test-other 1.18%
5.58%
1.12%
5.42%
130
E. Edwards et al.
It can be seen that, using the liberal collar of 250 ms, the algorithm can successfully detect speech (VAD) and then diarize all of the development and test files. It must be emphasized that this is by no means a guaranteed result, and previous versions of our diarization methods obtained mean DERs closer to 5– 10%, or worse (i.e., early bottom-up method). Also, the present algorithm under different settings would often fail on a small subset of files, e.g. obtain max DERs worse than 20–30%. The influential settings are: inclusion of VAD and pitch features; number of LDA components; types of phonological distinctions; type of probability model for B matrices (e.g. GMMs performed worse); and, critically, the tying and progressive untying of HMM states during successive stages of the EM iterations. Interestingly, the majority of the observed DER is due to VER (VAD error). Thus, the grand-mean DER was 0.93%, and the grand-mean VER was 0.85%, and it was common (under the liberal collar of 250 ms) to observe files with the same DER as VER, meaning that the algorithm rarely struggles to separate speaker characteristics, if the stage-1 (soft VAD) outputs are accurate. In fact, some of the VAD errors obtained may be considered spurious, as breath noise is not consistently treated in the forced alignments. The results imply that future improvements should first focus on the Stage 1 VAD phase. For the live doctor-actor dialog recordings, we obtain (Table 3): Table 3. Results for recorded doctor-actor dialogs. DER Dialog 1
VER
4.06% 3.26%
Dialog 2 10.00% 9.13% Average
7.20% 6.37%
Thus, a reasonable diarization of the real-world recordings was still obtained, despite the fact that the HMM model was trained only on synthetic data with no overlap. The LibriSpeech corpus is primarily American speech, whereas the doctor-actor dialogs here were British speech; and the recording method (cell phone) was quite different than for the training corpus. Also, the real-world dialogs contain many segments of coughing and other non-speech sounds that are not present in the training data, as well as many hesitation sounds (“umm”, “ahh”). Finally, the manual diarization of these dialogs is likely not perfect. Therefore, the average DER of 7.2% is encouraging for the applicability of the general methods reported here, although we will clearly need to obtain matched training data for the methods to fully work.
6
Summary and Conclusion
We have presented our initial speaker diarization system, with the intended application of doctor-patient dialogs. Training on a synthetic corpus, to initialize
Speaker Diarization: A Top-Down Approach Using Syllabic Phonology
131
HMM parameters, allowed successful diarization of recorded doctor-patient dialogs. The HMM parameters are updated in 3 stages of EM iterations, at the time of diarization. Emphasis was on computational efficiency, leading to a reduced Baum-Welch algorithm that omits A-matrix updates, and uses discrete (binned) probability distributions. HMM states are based on only 7 structural phones, as motivated by syllabic phonological theory, with sparse transition matrix, allowing an efficient approach to the phoneme specificity problem. The first of the 3 EM stages replaces the usual VAD stage, also improving total efficiency.
References 1. Anguera Mir´ o, X.: Robust speaker diarization for meetings. Ph.D. thesis, Univ. Polit`ecnica de Catalunya (2006) 2. Anguera Mir´ o, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012) 3. Anguera, X., Wooters, C., Peskin, B., Aguil´ o, M.: Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 402–414. Springer, Heidelberg (2006). https://doi.org/10.1007/11677482 34 4. Bozonnet, S., Vipperla, R., Evans, N.: Phone adaptive training for speaker diarization. In: Proceedings of INTERSPEECH, pp. 494–497. ISCA (2012) 5. Chen, I.F., Cheng, S.S., Wang, H.M.: Phonetic subspace mixture model for speaker diarization. In: Proceedings of INTERSPEECH, pp. 2298–2301. ISCA (2010) 6. Cooper, F., Delattre, P., Liberman, A., Borst, J., Gerstman, L.: Some experiments on the perception of synthetic speech sounds. J. Acoust. Soc. Am. 24(6), 597–606 (1952) 7. Edwards, E., et al.: Medical speech recognition: reaching parity with humans. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 512–524. Springer, Cham (2017). https://doi.org/10.1007/978-3-31966429-3 51 8. Fakotakis, N., Tsopanoglou, A., Kokkinakis, G.: A text-independent speaker recognition system based on vowel spotting. Speech Commun. 12(1), 57–68 (1993) 9. Finley, G., et al.: An automated medical scribe for documenting clinical encounters. In: Proceedings of NAACL. ACL (2018) 10. Fudge, E.: Branching structure within the syllable. J. Linguist. 23(2), 359–377 (1987) 11. Fujimura, O.: Syllable as a unit of speech recognition. IEEE Trans. Acoust. 23(1), 82–87 (1975) 12. Gauvain, J.L., Adda, G., Lamel, L., Adda-Decker, M.: Transcribing broadcast news: the LIMSI Nov96 Hub4 system. In: Proceedings of DARPA Speech Recognition Workshop, pp. 56–63. DARPA (1997) 13. Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., Khudanpur, S.: A pitch extraction algorithm tuned for automatic speech recognition. In: Proceedings of ICASSP, pp. 2494–2498. IEEE (2014) 14. Gish, H., Siu, M.H., Rohlicek, J.: Segregation of speakers for speech recognition and speaker identification. In: Proceedings of ICASSP, vol. 2, pp. 873–876. IEEE (1991)
132
E. Edwards et al.
15. Goldsmith, J.: The syllable. In: Goldsmith, J., Riggle, J., Yu, A. (eds.) The Handbook of Phonological Theory, 2nd edn., pp. 165–196. Wiley, Malden (2011) 16. Guest, E.: A History of English Rhythms. W. Pickering, London (1838) 17. Hansen, E., Slyh, R., Anderson, T.: Speaker recognition using phoneme-specific GMMs. In: Proceedings of Odyssey Workshop, pp. 179–184. ISCA (2004) 18. Hsieh, C.H., Wu, C.H., Shen, H.P.: Adaptive decision tree-based phone cluster models for speaker clustering. In: Proceedings of INTERSPEECH, pp. 861–864. ISCA (2008) 19. Kessler, B., Treiman, R.: Syllable structure and the distribution of phonemes in English syllables. J. Mem. Lang. 37(3), 295–311 (1997) 20. Kozhevnikov, V., Chistovich, L.: Speech: articulation and perception. Translation JPRS 30543, Joint Public Research Service, U.S. Department of Commerce (1965) 21. Levinson, S., Rabiner, L., Sondhi, M.: An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Syst. Tech. J. 62(4), 1035–1074 (1983) 22. Liberman, A., Ingemann, F., Lisker, L., Delattre, P., Cooper, F.: Minimal rules for synthesizing speech. J. Acoust. Soc. Am. 31(11), 1490–1499 (1959) 23. Martin, T., Wong, E., Baker, B., Mason, M., Sridharan, S.: Pitch and energy trajectory modelling in a syllable length temporal framework for language identification. In: Proceedings of Odyssey Workshop, pp. 289–296. ISCA (2004) 24. Mary, L., Yegnanarayana, B.: Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10), 782–796 (2008) 25. Mitford, W.: An Inquiry into the Principles of Harmony in Language, and of the Mechanism of Verse, Modern and Antient, 2nd edn. L. Hansard, London (1804) 26. Olson, H., Belar, H.: Phonetic typewriter. J. Acoust. Soc. Am. 28(6), 1072–1081 (1956) 27. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: Proceedings of ICASSP, pp. 5206–5210. IEEE (2015) 28. Rudnicky, A.: CMUdict 0.7b: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA (2015). https://github.com/Alexir/CMUdict 29. Sadjadi, S., Hansen, J.: Unsupervised speech activity detection using voicing measures and perceptual spectral flux. IEEE Signal Process. Lett. 20(3), 197–200 (2013) 30. Saussure, F.: Cours de linguistique g´en´erale. Payot, Lausanne, Paris (1916) 31. Schindler, C., Draxler, C.: Using spectral moments as a speaker specific feature in nasals and fricatives. In: Proceedings of INTERSPEECH, pp. 2793–2796. ISCA (2013) 32. Selkirk, E.: The syllable. In: van der Hulst, H., Smith, N. (eds.) The Structure of Phonological Representations, vol. 2, pp. 337–384. Foris, Dordrecht (1982) 33. Shoup, J.: Phonological aspects of speech recognition. In: Lea, W. (ed.) Trends in Speech Recognition, pp. 125–138. Prentice-Hall, Englewood Cliffs (1980) 34. Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., Stolcke, A.: Modeling prosodic feature sequences for speaker recognition. Speech Commun. 46(3–4), 455– 472 (2005) 35. Siu, M.H., Yu, G., Gish, H.: An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers. In: Proceedings of ICASSP, vol. 2, pp. 189–192. IEEE (1992) 36. Soldi, G., Bozonnet, S., Alegre, F., Beaugeant, C., Evans, N.: Short-duration speaker modelling with phone adaptive training. In: Proceedings of Odyssey Workshop, pp. 208–215. ISCA (2014)
Speaker Diarization: A Top-Down Approach Using Syllabic Phonology
133
37. Sugiyama, M., Murakami, J., Watanabe, H.: Speech segmentation and clustering based on speaker features. In: Proceedings of ICASSP, vol. 2, pp. 395–398. IEEE (1993) 38. Wallis, J.: Grammatica linguae Anglicanae. L. Lichfield, Oxford (1674) 39. Wang, G., Wu, X., Zheng, T.: Using phoneme recognition and text-dependent speaker verification to improve speaker segmentation for Chinese speech. In: Proceedings of INTERSPEECH, pp. 1457–1460. ISCA (2010) 40. Wilcox, L., Chen, F., Kimber, D., Balasubramanian, V.: Segmentation of speech using speaker identification. In: Proceedings of ICASSP, vol. 1, pp. 161–164. IEEE (1994) 41. Yamada, M., Pezeshki, A., Azimi-Sadjadi, M.: Relation between kernel CCA and kernel FDA. In: Proceedings of IJCNN, pp. 226–231. IEEE (2005) 42. Yella, S., Motl´ıcek, P., Bourlard, H.: Phoneme background model for information bottleneck based speaker diarization. In: Proceedings of INTERSPEECH, pp. 597– 601. ISCA (2014) 43. Zibert, J., Mihelic, F.: Prosodic and phonetic features for speaker clustering in speaker diarization systems. In: Proceedings of INTERSPEECH, pp. 1033–1036. ISCA (2011)
Improving Emotion Recognition Performance by Random-Forest-Based Feature Selection Olga Egorow(B) , Ingo Siegert, and Andreas Wendemuth Cognitive Systems Group, Otto von Guericke University, 39016 Magdeburg, Germany [email protected]
Abstract. As technical systems around us aim at a more natural interaction, the task of automatic emotion recognition from speech receives an ever growing attention. One important question still remains unresolved: The definition of the most suitable features across different data types. In the present paper, we employed a random-forest based feature selection known from other research fields in order to select the most important features for three benchmark datasets. Investigating feature selection on the same corpus as well as across corpora, we achieved an increase in performance using only 40 to 60% of the features of the wellknown emobase feature set.
Keywords: Speech emotion recognition Random forest
1
· Feature selection
Introduction
Speech is a carrier of different kinds of information – besides the pure semantic content of an utterance, there are several layers underneath [14]. In humanhuman interaction (HHI), the interlocutors try to extract this additional information, often using multiple channels – simply speaking, by listening not only to what is said but also how it is said. One such layer of information is the emotional layer – the same sentence can have different meanings depending on its emotional toning. This can be transferred to the domain of human-computer interaction (HCI) to enable computer systems to understand the emotional level in order to make HCI more natural and pleasant for the user. Unfortunately, the recent performance boost in speech recognition provided by deep learning did not improve the performance of emotion recognition alike: Although there are first attempts to implement end-to-end approaches [24], they are still in their infancy and rely on multimodal data. As long as the required massive data amounts are not yet available for audio-based emotion recognition, it is necessary to explore the existing possibilities and to look for other ways to improve the performance of current systems. One such way is the extraction and selection of the most suitable features. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 134–144, 2018. https://doi.org/10.1007/978-3-319-99579-3_15
Improving Emotion Recognition Performance
135
Since the Interspeech 2009 Emotion challenge [21], the emobase feature set (as described in detail in [10]) is often used as a go-to feature set for various acoustic recognition systems: e.g. dialogue performance [19], user state detection [8], physical pain detection [17], etc. It contains 988 features based on 19 functionals of 26 Low-Level-Descriptor (LLDs) and their deltas: Mel-Frequency Cepstral Coefficient (MFCC), Line Spectral Pairs (LSPs), intensity, fundamental frequency, and other – there are also larger versions of this set such as the 2010 emobase version and the emo large version containing 1582 and 6552 features, respectively. Besides these large feature sets, there are also relatively small ones, such as the GeMaps set [9], containing 18 LLDs (based on frequency and spectrum) and their derivatives, resulting in a total of only 62 features for the minimalistic and 88 features for the extended set. Although widely used, these sets are not perfect. So, the 988 features of emobase are often used to classify relatively small amounts of samples. The GeMaps set on the other hand, while having not as many features, does not achieve the same performance as emobase [9]. In the present study, we want to examine two questions. Our first research question is whether the emotion recognition performance achieved using the emobase feature set is the best possible, or whether the same or even better performance can be achieved with less features using a data-driven feature selection process. Our second question is whether the same features are important for different data types. To investigate these questions, we employ a Random Forest (RF)-based feature ranking procedure on three different corpora and conduct classification experiments using same-corpus as well as cross-corpus features. 1.1
Literature Review
As early as 2003, Kwon et al. have deducted that the extraction of good features is more important to the emotion recognition task than the choice of the optimal classifier [13]. The most frequently used features comprise prosodic and spectral information. One problem concerning such features is that their values depend on the individual speaker’s voice characteristics. Possible solutions are the calculation of speaker-independent features, such as the changes instead of the absolute values [15], or different normalisation methods [3]. Some research questions have already been answered: For example, it was shown that suprasegmental features perform better than segmental ones [22] or that features are not language-independent [26]. The choice of the best suitable features was also addressed in different investigations. So, Bitouk et al. used spectral features to classify emotions on two corpora and investigate the influence of different feature selection techniques, but none of the employed methods lead to clear gains [2]. Gharavian et al. presented a sophisticated feature selection approach based on fast correlation-based filters and genetic-algorithm-based optimisation to achieve 5% absolute improvement in terms of accuracy [11]. Unfortunately, the authors opted for a training and test set evaluation procedure instead of a true LeaveOne-Speaker-Out (LOSO) setting and therefore did not report on differences
136
O. Egorow et al.
between the speakers. Besides the usually employed prosodic and spectral features, there are also approaches investigating novel feature sets – for instance based on the Fourier parameters [25] and wavelets [18]. In the present study, we investigate the performance of RF-based feature selection on three benchmark emotional datasets in a LOSO setting and compare the features selected for different data types.
2
Datasets
In order to be able to answer our research questions in the most possible generalised way, we employed three famous benchmark corpora with different languages, emotion types and recording conditions. The Audiovisual Interest Corpus (AVIC) [20] is a dataset built around a product presenter in an English commercial presentation. The recordings were made in an office environment and contain three levels of interest (loi1 - loi3) as classes. The Berlin Emotional Speech Database (emoDB) [5] is a studio-recorded German dataset containing recordings of ten emotionally neutral sentences with seven emotions: anger, boredom, disgust, fear, joy, neutral, and sadness. The Speech Under Simulated and Actual Stress (SUSAS) dataset [12] contains acted and spontaneous emotional utterances of English speakers in four different conditions: neutral, medium stress, high stress and screaming. Some of the utterances also contain field noise. An overview over the details of the corpora is given in Table 1. Table 1. Characteristics of the selected corpora. Property
AVIC
emoDB SUSAS
Quality
Office
Studio
Language
English German English
Emotion type Spont
3
Acted
Noisy Mixed
# Speakers
21 (10f) 10 (5f)
7 (3f)
# Emotions
3
7
4
# Samples
3002
535
3593
Feature Selection with Random Forests
In order to find the optimal amount of features, we first ranked the features according to their importance for the classification task using RF. We then analysed the obtained feature rankings and compared them for different speakers of the same corpus as well as between the different corpora. In the last step, we compared the classification performance using an increasing number of features to find an optimum.
Improving Emotion Recognition Performance
3.1
137
Feature Extraction
For feature extraction, we used the emobase feature set of the openSMILE toolkit mentioned above, providing 988 spectral and prosodic features extracted on utterance level (cf. [10] for details). In order to establish comparability of the features among different speakers, we standardised the data to zero mean and unit variance. 3.2
Feature Ranking
In order to select the most important features, it is necessary to rank the features according to their importance. One possibility for this is a feature ranking routine based on RF – an ensemble learning method combining a typically high number of binary decision trees [4]. In each decision tree, each node samples a random subset of features and chooses the feature that is suited best to split the data into classes based on the impurity measure (e.g. the Gini index or information gain). By iterating this process, the features can be ranked according to their ability to decrease the impurity. A detailed explanation can be found in [7, 23]. The method was tested for several applications, for example in the field of spectroscopy analysis [16]. To realise this feature ranking procedure, we used the random forest implementation provided by KNIME [1]. The procedure consists of three steps as illustrated in Fig. 1. In the first step, a random forest containing a high number of trees with k levels each (k can be a low number since the most relevant features are close to the root) is built on the training portion of the data in order to obtain two statistical values for each feature f : the number of models Mi which use f as split on a tree level i, and the number of times Ti f was in the feature sample for the level i. Their quotient summed up over all levels is the score Sf for each f : k Mi Sf = Ti i=0 In a second step, a random score Srandf is generated by calculating the score in the same way, but now with randomly shuffled labels – this is done in order to eliminate a bias that might be contained in the data. In order to balance the influence of randomness, both Sf and Srandf are calculated ten times and then averaged. The new score Snewf is then obtained in a final third step by subtracting Srandf from Sf : Snewf = Sf − Srandf . The features are then sorted according to their final scores, the ranking indicating their importance. In order to avoid overfitting to the data, this procedure is executed in a LOSO manner: For each speaker, the feature ranking is performed only on the data of all the other speakers, excluding the data of the current speaker, which is reserved for later testing.
138
O. Egorow et al. Data
Training 1 - (n-1)
Test n
Feature Ranking Feature Extraction
repeat 10-times RF Training
True score
avg. true score
Final score
RF Training
Random score
avg. random score
Ranked features
repeat 10-times Target Shuffling
Fig. 1. An overview over the RF-based feature ranking procedure.
3.3
Comparison of Feature Rankings
One of our research questions was to investigate whether there are generally important features carrying emotional information or whether the most important features differ depending on the data. In order to answer this question we compared the feature rankings obtained on the three employed datasets and conduct several Pearson’s correlation tests – between the feature selection rankings of different speakers of the same corpus for intra-corpus comparison as well as between the feature selection rankings of different corpora for inter-corpus comparison. Intra-corpus Comparison. In order to test whether the feature rankings are consistent for all speakers within a corpus, we compared the LOSO rankings by conducting Pearson’s correlation tests. For AVIC, the Pearson’s correlation coefficient r between the feature rankings of the individual speakers lies between 0.95 and 0.98 (r = 0.97 ± 0.008), leading to the conclusion that the feature rankings of the speakers are very similar. Our idea was now to construct an average feature ranking for the whole corpus by averaging the feature rankings over all speakers, FAV IC . Naturally, the Pearson’s correlation between FAV IC and the feature rankings of the individual speakers is just as high as between the speakers, with values between 0.96 and 0.99 (r = 0.98 ± 0.008). The LLDs occurring most frequently in the top 100 are illustrated in Fig. 2a. For EmoDB, we implemented the same procedure. Here the correlations between the speakers are about as high as for AVIC, with r values between 0.95 and 0.98 (r = 0.98 ± 0.01) indicating that the feature rankings are consistent. Also, in the same way as for AVIC, we constructed a new average feature ranking FEmoDB . Again, r between FEmoDB and the feature rankings of the individual speakers is between 0.97 and 0.99 (r = 0.99 ± 0.006). The LLDs occurring most frequently in the top 100 are illustrated in Fig. 2b.
Improving Emotion Recognition Performance
(a) AVIC
(b) EmoDB
139
(c) SUSAS
Fig. 2. Word clouds of the LLDs most frequently occurring in the top 100 of the feature rankings for AVIC, EmoDB and SUSAS. The LLDs occurring for all three corpora are written in red. (Color figure online)
Finally, we repeated this procedure for SUSAS. The correlations between the feature rankings of the individual speakers are slightly lower than for EmoDB, with r values between 0.87 and 0.96 (r = 0.92 ± 0.03) but still sufficiently high to conclude that the feature rankings are consistent. The correlations between the average feature ranking FSU SAS and the individual rankings are between 0.92 and 0.98 (r = 0.96 ± 0.02). The LLDs occurring most frequently in the top 100 are illustrated in Fig. 2c. Inter-corpus Comparison. In the second step of our analysis, we compared the inter-corpus results in order to find whether the feature rankings are similar between the different types of data used. For this, we calculated the Pearson’s correlation coefficients between the previously constructed average feature rankings FemoDB , FSU SAS and FAV IC . In contrast to the intra-corpus comparison presented above, the results lead to the conclusion that there are no correlations between the feature rankings of the different corpora. For the correlation between FEmoDB and FAV IC , the r value is 0.18. For the correlation between FEmoDB and FSU SAS , r is even lower, 0.14. For FSU SAS and FAV IC , r is negative, −0.07. These results are shown in Fig. 2: There are only two LLDs shared by all three datasets (MFCC[5]and its derivative as well as the derivative of MFCC[10]). This means that, unfortunately, the feature rankings are not universally transferable for different types of data. However, there are similarities – different MFCCs seem to be the most important features, since they occur relatively often in the top 100 features for all three datasets. 3.4
Selecting the Optimal Number of Features
In the next part, we searched for an optimal number of features for each of the corpora. For this, we classified the data using an increasing number of features, starting with 50 features with the highest RF-scores and then consecutively adding 50 more features with decreasing scores in each step, until we reached the full 988 emobase feature set. In order to avoid overfitting, we again used a LOSO validation setting. For each feature subset, we calculated the Unweighted Average Recall (UAR) over all classes and speakers. The UARs achieved during this optimisation procedure are shown in Fig. 3. Here, AVIC and EmoDB show
140
O. Egorow et al. 100 AVIC
EmoDB
SUSAS
UAR %
82.67% 80 57.08% 60
52.01%
40 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 Number of used features
Fig. 3. The UAR of the classification performance on the three datasets depending on the number of selected features. The results achieved using the full number of features are indicated by the dashed line
similar results: after starting with a rather low UAR value for low numbers of features, the UAR rises rapidly and stays at a stable value. However, for SUSAS the number of features seems to have less influence, since the UAR does not change as much as for the other two corpora.
4
Classification Using Previously Selected Features
After selecting the optimal number of features, we conducted classification experiments in order to evaluate and compare the performance of the selected features to the full emobase feature set. 4.1
Classification Setup
For the classification, we again implemented the LOSO procedure as described above. Since we obtained between 7 and 21 models for each corpus, we decided against parameter fine-tuning and employed default employed default Support Vector Machine (SVM) parameters as provided by the LibSVM library [6]. For evaluation, we computed the unweighted average f-measure (UAF) as the harmonic mean of the unweighted average recall and precision over all classes of one speaker, and then the unweighted average over all speakers. In order to include variations over speakers, we report the average values as well as the standard deviation as performance measures. 4.2
Classification Performance
The classification results are shown in Fig. 4 – we report the classification performance for each dataset, the baseline performance using all 988 emobase features and the performance using the previously selected features. Furthermore, we also report the results using cross-corpus feature selection. For this, we performed the
Improving Emotion Recognition Performance
FE
UAF %
80
FS
FCC1
141
FCC2
60
40 AVIC
emoDB
SUSAS
Fig. 4. The UAF of the classification performance for the emobase feature set FE , the best feature selection set FS , the cross-corpus feature set with the lower correlation FCC1 and with the higher correlation FCC2
classification on one dataset using the feature set obtained on another one. Since we used three corpora, this procedure results in two additional values per corpus: FCC1 denotes the results using the feature set with the lower correlation coefficient (as obtained in Sect. 3), FCC2 the results with the higher correlation coefficient. The classification with feature selection outperforms the classification using the full emobase feature set for all three corpora by several percent absolute – but the improvements lie within the standard deviation of the average values of the speakers. However, the results show that for all three corpora, a performance improvement can be achieved using between 40 and 60% less features than the original feature set. This is an interesting finding since feature extraction as well as classification are resource-intensive tasks, where a reduction of the processing overhead can be a real benefit – for example in the domain of mobile applications. Regarding the performance of the different feature sets across corpora, we can observe that the results are almost as expected: except for SUSAS, the “alien” feature sets obtained by feature selection on another corpus do not perform as good as the one obtained on the same corpus. Furthermore, FCC2 outperforms FCC1 in all cases (albeit marginally as for emoDB), which corresponds to the higher correlation between FCC2 and FS compared to FCC1 and FS . The only exception is SUSAS, where the FCC2 works about 0.7% better than FS . Based on these results, we can conclude that RF-based feature selection is a viable method to improve emotion recognition performance for different types of data.
5
Conclusion
The first question we aimed to investigate in this study was whether the number of features used for emotion recognition can be reduced achieving the same or even better performance. We have shown that by applying RF-based feature selection, we can reduce the number of features roughly by half and obtain an even better performance than using the full emobase set – furthermore, by using
142
O. Egorow et al.
three different corpora we have shown that this result is independent of the type of emotions, language and recording conditions. The second research question was whether there are inter-corpus similarities in the selected features. Here our finding is that the most important features are not consistent over different corpora, and therefore the feature selection needs to be done for each emotion recognition task separately. However, different MFCCs are among the most important features of all three corpora indicating that there is a common ground of acoustic information. There are two main directions for further research. The first interesting question here is to investigate further feature sets – besides larger versions of the emobase feature set including up to 6552 features also novel and less frequently used features such as the Fourier parameters and wavelet-based features are of interest. The second open question is to consolidate feature classes according to the type of material used – in this investigation, we have seen that features important for EmoDB differ from those for AVIC. The question is whether these differences are based on the type of emotions, on the emotional classes, on the recording conditions, or on some still unknown factors. This needs to be further investigated in order to understand the relations between the features and the information on the emotional status of the speaker contained in them. Acknowledgements. This work has been sponsored by the German Federal Ministry of Education and Research in the program Zwanzig20 – Partnership for Innovation as part of the research alliance 3Dsensation (grant number 03ZZ0414). It was also supported by the project Intention-based Anticipatory Interactive Systems (IAIS) funded by the European Funds for Regional Development (EFRE) and by the Federal State of Sachsen-Anhalt, Germany (grant number ZS/2017/10/88785).
References 1. Berthold, M.R., et al.: KNIME: The konstanz information miner. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds.) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3540-78246-9 38 2. Bitouk, D., Verma, R., Nenkova, A.: Class-level spectral features for emotion recognition. Speech Commun. 52(7–8), 613–625 (2010) 3. B¨ ock, R., Egorow, O., Siegert, I., Wendemuth, A.: Comparative study on normalisation in emotion recognition from speech. In: Horain, P., Achard, C., Mallem, M. (eds.) IHCI 2017. LNCS, vol. 10688, pp. 189–201. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-72038-8 15 4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 5. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH-2005, pp. 1517–1520 (2005) 6. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. Trans. Intell. Syst. Technol. 2, 1–27 (2011)
Improving Emotion Recognition Performance
143
7. Chen, Y.W., Lin, C.J.: Combining SVMs with various feature selection strategies. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A. (eds.) Feature Extraction: Foundations and Applications, pp. 315–324. Springer, Berlin Heidelberg (2006). https://doi.org/10.1007/978-3-540-35488-8 13 8. Egorow, O., Wendemuth, A.: Detection of challenging dialogue stages using acoustic signals and biosignals. In: Proceedings of the 24th International Conference on Computer Graphics, Visualization and Computer Vision, pp. 137–143 (2016) 9. Eyben, F., et al.: The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. Trans. Affect. Comput. 7(2), 190–202 (2016) 10. Eyben, F., W¨ ollmer, M., Schuller, B.: OpenEAR - introducing the Munich opensource emotion and affect recognition toolkit. In: Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 1–6. IEEE (2009) 11. Gharavian, D., Sheikhan, M., Nazerieh, A., Garoucy, S.: Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput. Appl. 21(8), 2115–2126 (2012) 12. Hansen, J., Bou-Ghazale, S.: Getting started with SUSAS: A speech under simulated and actual stress database. In: Proceedings of the EUROSPEECH-1997, pp. 1743–1746 (1997) 13. Kwon, O.W., Chan, K., Hao, J., Lee, T.W.: Emotion recognition by speech signals. In: Proceedings of the 8th European Conference on Speech Communication and Technology (2003) 14. Levinson, S.C., Holler, J.: The origin of human multi-modal communication. Phil. Trans. R. Soc. B 369(1651), 20130302 (2014) 15. Mao, Q., Zhao, X., Zhan, Y.: Extraction and analysis for non-personalized emotion features of speech. Adv. Inf. Sci. Serv. Sci. 3(10), 255–263 (2011) 16. Menze, B.H., et al.: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10(1), 213 (2009) 17. Oshrat, Y., Bloch, A., Lerner, A., Cohen, A., Avigal, M., Zeilig, G.: Speech prosody as a biosignal for physical pain detection. In: Proceedings of Speech Prosody, pp. 420–424 (2016) 18. Palo, H.K., Mohanty, M.N.: Wavelet based feature combination for recognition of emotions. Ain Shams Eng. J. (2017, in Press) 19. Ramanarayanan, V., et al.: Using vision and speech features for automated prediction of performance metrics in multimodal dialogs. ETS Research Report Series 1 (2017) 20. Schuller, B., M¨ uller, R., H¨ ornler, B., H¨ othker, A., Konosu, H., Rigoll, G.: Audiovisual recognition of spontaneous interest within conversations. In: Proceedings of the 9th International Conference on Multimodal interfaces, pp. 30–37. ACM (2007) 21. Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9–10), 1062–1087 (2011) 22. Schuller, B., W¨ ollmer, M., Eyben, F., Rigoll, G.: The role of prosody in affective speech, linguistic insights, studies in language and communication. Lang. Commun. 97, 285–307 (2009) 23. Silipo, R., Adae, I., Hart, A., Berthold, M.: Seven techniques for dimensionality reduction. Technical report, KNIME (2014)
144
O. Egorow et al.
24. Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: Endto-end multimodal emotion recognition using deep neural networks. J. Sel. Top. Signal Process. 11(8), 1301–1309 (2017) 25. Wang, K., An, N., Li, B.N., Zhang, Y., Li, L.: Speech emotion recognition using fourier parameters. Trans. Affect. Comput. 6(1), 69–75 (2015) 26. Yang, C., Ji, L., Liu, G.: Study to speech emotion recognition based on TWINsSVM. In: Proceedings of the 5th International Conference on Natural Computation, vol. 2, pp. 312–316. IEEE (2009)
Coherence Understanding Through Cohesion Markers: The Case of Child Spoken Language Polina Eismont1(&) , Vladislav Metelyagin2 and Elena Riekhakaynen2
,
1
Saint Petersburg State University of Aerospace Instrumentation, Bolshaya Morskaya Street 67, 190000 St. Petersburg, Russia [email protected] 2 Saint-Petersburg State University, Universitetskaya Emb. 7/9, 199034 St. Petersburg, Russia [email protected], [email protected]
Abstract. Coherence and cohesion are crucial for organizing text semantics and syntax. They both may be described in terms of topic-focus structure, but the surface syntactic topic-focus structure does not coincide with that of deep semantics, and the automatic analysis of coherence which refers to the meaning of the whole text is complicated. The paper presents a Topic-Focus Annotating Parser (TFAP) that was trained on the corpus of Russian unprepared child oral narratives (213 narratives elicited by native Russian children aged from two years seven months to seven years six months). According to the results, children develop their narrative skills both in coherence and cohesion, but at the earlier stages of language acquisition, parsing errors reflect the speaker’s low level of narrative skills, while at the later stages (from five years seven months to seven years six months), when the basic rules of narrative organization are already acquired, parsing errors may be caused by the deficiencies of the parser. The topic-focus schemes we obtained support Leonid Sakharny’s theoretical approach to cognitive representation of coherence. Keywords: Child language Topic-Focus Annotating Parser Coherence Spoken narrative
Cohesion
1 Introduction The study of spoken language processing has always been a challenge for linguists primarily due to the methodological difficulties with obtaining the data (see, for example, [1] for an overview). Researchers have been describing and discussing phonetic, grammatical, lexical, and pragmatic aspects of spoken Russian since the end of the 1960s. However, we still do not have an accurate aggregate picture of how a speaker and a listener process spoken Russian. One of the ways to accumulate different aspects of spoken discourse is to study the multi-layered structure of oral narratives. The semantic meaning of a narrative (coherence) may be understood only through its surface syntactic structures (cohesion). Cohesion and coherence have been described as the basic principles of textuality [2, 3]. We will discuss current theoretical approaches to © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 145–154, 2018. https://doi.org/10.1007/978-3-319-99579-3_16
146
P. Eismont et al.
these two sides of text organization and some background studies of coherence and cohesion resolution in Natural Language Processing in Sect. 2 of the paper. Children acquire narrative skills gradually, and both coherence and cohesion are difficult for them at the early stages of first language acquisition [4, 5]. The development of coherence and cohesion starts at the age of 4–5 years and continues up to secondary school, but narratives become coherent by the age of 7–8 years. In our study, we will analyse oral narratives elicited by native Russian children aged from four years seven months to seven years six months (for space considerations, we will hereinafter indicate the age of the participants as follows: Years;Months: e.g., 4;7–7;6); the data will be described in Sect. 3. Computer modelling is one of the methods to verify a theoretical approach [6]. In Sect. 4 of the paper, we will provide the results of an automatic analysis of children’s narratives that corresponds to the wide and narrow subject complexes proposed by Sakharny. Section 5 is for conclusions and prospects of the study.
2 Coherence and Cohesion Coherence and cohesion represent the two sides of a text as a linguistic sign. Coherence is defined as: “the way in which the content of connected speech or text hangs together, or is interpreted as hanging together, as distinct from that of random assemblages of sentences” [7]. It reflects the subject of a story, the logical structure of a narration and organizes the links between characters and events. The other side – cohesion – sets up the surface structure of a text bounding the sentences together by referencing the characters, some situational objects, time and space of the events. The topic-focus analysis of a text structure was proposed by Daneš in 1974 [8] and van Dijk in 1977 [9]. They both suggested that topics and focuses of different utterances link to each other and create a coherent text – a sequence of sentences. Van Dijk argued that topics function as the glue for the whole text and we can calculate the text topic from the topics of the sentences it consists of. According to Daneš, the so called functional perspective of a text is dynamic and develops within the text depending on speakers’ intentions and their interpretation of the denoted situation. He suggested three types of thematic progressions: a simple linear progression (the rheme of the previous utterance becomes the topic of the following one); a continuous theme progression (the theme remains the same for several consequent utterances), and a progression with derived themes that emerged from a ‘hypertheme’. All these progressions may combine and run into one another. Sakharny developed Daneš’s ideas in [10] trying to regard text coherence from the cognitive point of view. He suggested that the thematic progressions not only are a specific feature of a text, but also represent the way we think about the subject of a story. He described the so called wide and narrow subject complexes that he understood as the topic-focus structures of coherence. The surface topic-focus structure reflects the deep topic-focus structure of coherence, that is the structure of the speakers’
Coherence Understanding Through Cohesion Markers
147
vision of the meaning of the narrative. The basic structure of each complex is a simple existential structure ‘there is X’, which is later included into a more complicated structure of X and its attributes. These complicated structures of different “X-s” and their attributes may interact with each other and form such subject complexes as a bush structure (one subject has many attributes), a chain structure (the attribute of a subject becomes the subject of the following structure), and combined structures. Unlike in the theory proposed by Daneš, Sakharny’s wide subject complexes reflect the cognitive structure of coherence and are not explicit in the text itself. The development of computer linguistics has risen the questions of an automatic analysis of coherence. Different researchers have analysed written corpora and proposed possible solutions [11–13]. However, transferring cohesive markers into coherent structures still remains problematic. Several attempts have been made in automatic analysis of child oral narratives, but the developers have chosen inductive methods and did not apply cohesive markers for narrative analysis [14]. There are few different ways to provide cohesion: referencing and coreferencing (by means of anaphors, pronouns or various lexical nominations and paraphrasing), syntactic structure changes (e.g., ellipsis or inversion), grammatical iterations. Kibrik attempted to study the text reference in discourse in [15]. He suggested a complex method of analysis that not only involves the verbal component of communication, but also considers gestures, mimics and cognitive mechanisms. The author argues that referencing is one of the universal cognitive mechanisms. It depends on such parameters as antecedent’s actualization and language typological features. For example, Kibrik considers Russian pronominal system to be difficult for unambiguous interpretation, as pronouns are usually the only source of reference information. Reference conflicts that may occur often enough and require a big set of resolution tools are nonetheless peripheral and normally can be easily solved by any communicant in natural situation. But this question becomes highly important if we try to create an automatic tool for parsing oral speech. The multi-layered structure of an oral narrative includes among others the phonetic level. Numerous studies have shown that the boundaries between semantic-syntactic units and phonetic ones do not necessarily coincide in casual speech [16, 17]. However, recent experimental data from casual Russian provides evidence that a listener tends to ignore “irregular” pausation giving preference to the semantic-syntactic relations within an utterance, intonation serving as a complementary source of information [18]. Thus, we assume that at least preliminary testing of the automatic topic-focus analysis can be performed excluding the phonetic level.
3 Data The corpus “KONDUIT” (KOrpus Nepodgotovlennyh Detskih Ustnyh Izvlechennyh Tekstov – Corpus of Child Unprepared Elicited Oral Narratives; [19]) comprises 213 unprepared narratives (or quasi-narratives), elicited during a series of experiments with Russian native children aged 2;7–7;6. The children were divided into 5 age groups, 3 different experimental designs have been suggested depending on the cognitive development of the children [20].
148
P. Eismont et al.
The experiment with the youngest children (aged 2;7–3;6) was in a game manner. Two experiment assistants manipulated 4 glove puppets performing different actions that can be described using the verbs of 14 different semantic classes, e.g. verbs of motion, verbs of communication, emotional verbs, verbs of perception, etc. The second group (children aged 3;6–4;6) had to retell a picture book “Three kittens” (by Vasily Suteev) consisting of 15 pictures and representing a story of three small kittens who try to catch a mouse, hunt a frog, and catch a fish, but everyone escapes, and three upset kittens return to their home wet, tired and hungry. The three oldest age groups were retelling a cartoon about a kitten who makes a disorder at home and goes for a walk. He meets rabbits, beavers and a bear-cub, but no one wants to play with him. All the characters (both in the picture book and in the cartoon) perform different clear actions that can be described using the verbs of the same semantic classes that were expected in the experiment with the youngest age group. Thus, despite some differences in the experimental designs, the narratives elicited by the children of all 5 age groups may be compared for the use of the same verbs describing the same semantics and the similar situations. The experiment was carried out in accordance with the Declaration of Helsinki and the existing Russian and international regulations concerning ethics in research. The parents signed informed consents for their children to take part in the experiment. All narratives were audio- and video-recorded. The orthographic annotation of all the recordings has been performed by the experimenter who conducted the experiment with all children. The corpus includes 25 689 tokens, or 5 763 utterances. Only stories elicited by the children of the three oldest groups (4;7–5;6, 5;7–6;6, and 6;7–7;6) may be considered as standard narratives that have some coherent and cohesive features. All clauses and their topic-focus structures have been manually annotated. The principles of topic identification have been discussed in [21, 22]. We understand ‘topic’ and ‘focus’ as it has been suggested by Prague linguists: topic is the main subject of the utterance, while focus is the information said about this subject. The results of the automatic topic-focus parsing have been compared to this manually annotated corpus.
4 Decision and Discussion 4.1
Parser
We developed a Topic-Focus Annotating Parser (TFAP) based on the morphologically and syntactically annotated texts. Morphological analysis that differentiates parts of speech and identifies gender, case and number of nouns and tense, person and number of verbs, etc. is performed by pymorphy2 [23], while syntactic analysis is performed by Google SyntaxNet parser trained on SyntagRus syntactic corpus [24]. TFAP works in three steps. The first step provides the annotations of semantic and syntactic roles and marks grammatical features that are important for the future topicfocus analysis. Most nouns in Nominative case are attributed as Agents, but if it is inanimate it is annotated differently with the most probable semantic role depending on
Coherence Understanding Through Cohesion Markers
149
their most probable referent. At this stage, TFAP also allocates all animate nouns in any case and with any syntactic role as they may function as a topic later. Referencing is done at the second step. If an argument is tagged as a pronoun in Nominative case and its syntactic role is ‘Subject’, this argument requires an antecedent. If there is an argument in any other case except Nominative, but it is animate, it is marked as a possible referent for future topic-focus structure. All other pronouns and inanimate nouns are marked as less possible referents and topics. After these two preliminary steps, TFAP starts structuring the topic-focus schemes and linking them through the text. Each anaphoric word tagged ‘ref’ is linked to the nearest antecedent with the same set of grammatical features. Verbs are linked to the animate nouns in Nominative or pronouns within the clause. If there is no animate subject in Nominative within the clause or within three previous clauses, the verb is linked to an inanimate noun in Nominative within the clause. At this stage, TFAP also seeks for any clause initial adverbs or adverbial expressions of time and place, and if there is a subject in the same clause, tags them as narrative topics. Every found antecedent is marked as topic, while the rest part of the clause is marked as focus. Clauses united by the same topic or narrative topic are referred to as a single topic-focus scheme (cf. Fig. 1):
Fig. 1. Topic-focus scheme of a narrative elicited by a boy, 4;7 (this is kitten // playing / fooling // I have this cartoon // kitten is fooling throwing the balls all around the robs is rolling // rabbits are jumping with the rope // rabbits are playing with the rope / and this is beavers are building a house //// and this is kitten is now crying // a bear is riding a bicy… a scooter //// kitten is climbing down the tree).
150
4.2
P. Eismont et al.
Results and Discussion
Table 1 presents the results of the topic-focus parsing of the narratives elicited by the children aged 4;7–7;6.
Table 1. Parsing results for the narratives elicited by the children aged 4;7–7;6 (* - this narrative was chosen as a sample narrative for TFAP training). Age
Total number of
Parsing correctness (% from all narratives in Mean every age group) length of Narratives Clauses 100% 90–99% 80–89% 50–79% 0–49% utterance (Me)
4;7–5;6 32 5;7–6;6 21 6;7–7;6 17
983 961 986
15.6 4.8 5.9*
25 14.3 17.6
34.4 57.1 64.7
25 23.8 11.8
0 0 0
4.6 5.1 5.4
Average correctness of clause parsing (%) 86 84 85
Parsing correctness for every narrative was estimated by the percentage of correctly parsed utterances. The results show the development of narrative skills in language acquisition. 40.6% of all narratives elicited by the children aged 4;7–5;6 have been parsed with at least 90% correctness as children of this age produce quite primitive narratives, use mostly lexical nominations, do not use pronouns and anaphors (cf. the narrative in Fig. 1 above). Their utterances are short, they do not use narrative topics, they almost do not switch between the situation of the cartoon and the situation in which the experiment takes place. All these features are typical for the narratives of children of this age [8, 10, 25, 26] and they simplify TFAP work. At the same time, children between 4 and 5 years may confuse the gender of characters: acters: (1) kotik igraet v kubiki // ona kitten-M.NOM.SG play-PRS.3SG in block-M.ACC.PL // she-F.3SG vsyo razbrosala everything-N.ACC.SG throw-PST.F.SG ‘the kitten is playing with the blocks // it (she) has thrown everything away’
Coherence Understanding Through Cohesion Markers
151
Or they may omit the lexical nominations of the characters and label only actions listing them as pearls in a necklace, without mentioning any characters: (2) kotik kitten-M.NOM.SG
i and
myachik / ball-M.NOM.SG
i and
rybku fish-F.ACC.SG
lovit / catch- PRS.3SG
eshche also
razbrosaet / throw-PRS/FUT.3SG away
a and
on he/it-M/N.NOM.SG eshche morgaet also wink- PRS.3SG teper’ now
dvizhetsya move-PRS.3SG glazami / eye-M.INS.PL
v myachik igraet / with ball-M.ACC.SG play-PRS.3SG
i razrushil kukushku i klubki razbrosal and break-PST.M.SG cuckoo-F.ACC.SG and cob-M.ACC.SG throw-PST.M.SG away ‘the kitten and the ball / it moves and is catching a fish / is also winking with its eyes / is also throwing away / and now is playing with the ball / and has broken a cuckoo and has thrown away the cobs’
Another feature of Russian child syntax is the combination of subject omission and the preposition of an object (cf. (2) – rybky lovit ‘is catching a fish’, v myachik igraet ‘is playing with a ball’, klubki razbrosal ‘has thrown the cobs away’). This inversion is possible but much less frequent in adult speech, and it is impossible to differentiate between the homonymous forms of Nominative and Accusative Singular for inanimate masculine nouns. TFAP reveals all these deficiencies in child narratives. But they are specific only for the narratives elicited by the children of the youngest age group. Elder children produce more sophisticated narratives, and the percentage of correctly annotated utterances reduces. This is caused by the following parsing deficiencies: – the clauses are too complex, so the distance between the verb and its argument or between the anaphor and its antecedent may be too long; – in Russian, the main character (the kitten) may be labelled with at least three different lexemes: kotyonok ‘kitten’ - masculine, kiska ‘pussy’ and koshka ‘cat’ (the latter two – feminine), and the children often switched the gender of a lexeme and its anaphoric pronoun; – the number of atypical topic-focus structures (narrative topics, thetic rhemes [cf. [27, 28]) increases as children acquire narrative skills: narratives require specific word order and some rules of ellipsis that influence the syntactic structure of separate clauses; – children may switch between the situation of the cartoon and the situation of the experiment; as a result, deictic pronouns appear and the distance between antecedents and their references within the situation of the cartoon increases. Thus, the parsing errors in the narratives elicited by the children aged 4;7–5;6 are caused by the speakers’ imperfection, while the texts are simpler and TFAP can easily parse them. But the parsing errors in the narratives elicited by the children aged 5;7–7;6
152
P. Eismont et al.
are caused by the TFAP’s imperfection while texts are more sophisticated and reflect all specific features of spoken language. At the same time, the text topic-focus schemes suggested by TFAP represent the wide subject complexes suggested by Sakharny [10]. The structures of the narratives elicited by the younger children are much simpler than those of the narratives elicited by the elder children (cf. Fig. 1 representing the structure of the narrative elicited by a boy aged 4;7 and Fig. 2 representing the structure of the narrative elicited by a girl aged 6;10).
Fig. 2. Topic-focus scheme of a narrative elicited by a girl aged 6;10, fragment (there / there is some house spinning // now there is a kitten who is picking the cubes // now the kitten / he / now the kitten has decided to play with a ball / and accidentally knocked the clocks down // yes the mommy-cat returned home / and saw that there is a mess at home / and and the kitten confessed that it / was him who made the mess / then the mommy-cat got a telephone call / and the mommy-cat took a jar and went to do her stuff / and the kitten stayed at home // and the kitten jumped out to the street and saw the rabbits who / were jumping with a rope / the kitten then the kitten went out to the street to play with the rabbits / the kitten ran up and wanted to jump but / a rabbit said that the kittens are not allowed / then the kitten saw the beavers who were building a house // the kitten saw and wanted / and also wanted / to help the beavers / but a beaver / but a beaver told to the kitten that kittens are not allowed to be builders / the kitten / felt // the kitten felt sad / and that’s why he left / and that’s why he left the rabbits and the beavers).
Coherence Understanding Through Cohesion Markers
153
5 Conclusion and Future Plans In the paper, we reported the results of an automatic topic-focus analysis of Russian children’s narratives performed by the parser TFAP. The data allowed to discuss both the cognitive aspects of cohesion and coherence acquisition and the deficiencies of the parser. The automatic annotation showed to be the most successful for the texts of the youngest group of participants (4;7–5;6), as younger children normally describe separate events or even episodes using a single utterance of a simple structure and do not try to connect these events either in their minds or in their narratives. On the contrary, elder children construct the coherent image of the whole episode in their minds, divide it into several connected events and represent this complex structure with their narratives. Thus, our results are similar to the schemes proposed by Sakharny. As we mentioned in Sect. 2, we decided to omit the analysis of phonetic information for the current study. However, further development of the automatic topicfocus analysis as well as a psycholinguistic description of narrative’s processing in children will definitely benefit from the inclusion of the phonetic level. Thus, we are now performing manual acoustic-phonetic transcription of the Corpus based on the principles used in the Corpus of Transcribed Russian Oral Texts [17]. Acknowledgements. The work is supported by the research grant number 16-04-50114 (dir. P. Eismont) from the Russian Foundation for Humanities and the research grant number MК-6776.2018.6 (dir. E. Riekhakaynen) from the President of the Russian Federation.
References 1. Warner, N.: Methods for studying spontaneous speech. In: Cohn, A., Fougeron, C., Huffman, M. (eds.) Handbook of Laboratory Phonology, pp. 612–633. Oxford University Press, Oxford (2012) 2. De Beaugrande, R., Dressler, W.U.: Introduction to Text Linguistics. Longman, London, New York (1981) 3. Murzin, L.N., Stern, A.S.: Text and Its Perception. UGU Press, Sverdlovsk (1991). (in Russian) 4. Berman, R.A., Slobin, D.I.: Relating Events in Narrative: A Crosslinguistic Developmental Study. Lawrence Erlbaum, Hillsdale (1994) 5. Manhardt, J., Rescorla, L.: Oral narrative skills of late talkers at ages 8 and 9. Appl. Psycholinguist. 23, 1–21 (2002) 6. Frauenfelder, U.H., Peeters, G.: Lexical segmentation in TRACE: an exercise in simulation. In: Cognitive Models of Speech Processing: Psycholinguistic and Computational Perspectives, pp. 51–86. MIT Press, Cambridge Mass (1990) 7. Matthews, P.H.: Oxford Concise Dictionary of Linguistics. Oxford University Press, Oxford (2003) 8. Daneš, F.: Functional sentence perspective and the organization of the text. In: Papers on Functional Sentence Perspective, pp. 106–128. Academia, Prague (1974) 9. Van Dijk, T.A.: Sentence Topic and Discourse Topic. http://www.discourses.org/ OldArticles/Sentence%20topic%20and%20discourse%20topic.pdf. Accessed 02 May 2018
154
P. Eismont et al.
10. Sakharny, L.V.: Topic-focus structure in text: basic notions (in Russian). Lang. Lang. Behav. 1, 7–16 (1998) 11. Hahn, U.: On Text Coherence Parsing. In: COLING, pp. 25–31 (1992) 12. Nikolenko, S.I., Koltcov, S., Koltsova, O.: Topic modelling for qualitative studies. J. Inf. Sci. 43(1), 88–102 (2017) 13. Ionov, M.I.: Automatic detection of the discourse status of a referent of a noun phrase. Rhema 4, 24–42 (2016). (in Russian) 14. Hassanali, Kh., Liu, Y., Solorio, Th. Coherence in child language narratives: a case study of annotation and automatic prediction of coherence. In: Third Workshop on Child, Computer and Interaction (WOCCI 2012) ISCA. http://www.isca-speech.org/archive/wocci_2012. Accessed 24 June 2018 15. Kibrik, A.: Reference in Discourse. Oxford University Press, Oxford (2011) 16. Kibrik, A.A., Podlesskaya, V.I. (eds.): Stories about Dreams: Corpus-based Study of Russian Spoken Discourse. Yazyki slavyanskikh kultur, Moscow (2009). (in Russian) 17. Nigmatulina, J., Raeva, O., Riechakajnen, E., Slepokurova, N., Vencov, A.: How to study spoken word recognition: evidence from Russian. In: Anstatt, T., Gattnar, A., Clasmeier, Ch. (eds.) Slavic Languages in Psycholinguistics: Chances and Challenges for Empirical and Experimental Research, pp. 175–190. Narr Verlag, Tuebingen (2016) 18. Raeva, O.V., Riekhakaynen, E.I.: Spontaneous Russian texts from a listener’s perspective. Soc. Psycholinguist. Res. 3, 67–70 (2015). (in Russian) 19. Eismont, P.M.: “KONDUIT”: Corpus of child oral narratives. In: Proceedings of the International Conference “Corpus Linguistics – 2017”, pp. 373–377. Saint-Petersburg State University, St. Petersburg (2017). (in Russian) 20. Ambridge, B., Rowland, C.F.: Experimental methods in studying child language acquisition. Wiley Interdisc. Rev. Cogn. Sci. 4(2), 149–168 (2013) 21. Kehler, A.: Discourse topics, sentence topics, and coherence. Theor. Linguist. 30, 227–240 (2004) 22. Götze, M., Weskott, Th., Endriss, C., Fiedler, I., Hinterwimmer, St., Petrova, Sv., Schwarz, A., Skopeteas, St., Stoel, R.: Information structure. In: Dipper, St., Götze, M., Skopeteas, St. (eds.) Interdisciplinary Studies on Information Structure, vol. 7 of Working papers of the SFB 632, pp. 147–187. Universitätsverlag, Potsdam (2007) 23. Korobov, M.: Morphological Analyzer and Generator for Russian and Ukrainian Languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, Valeri G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015). https://doi.org/10. 1007/978-3-319-26123-2_31 24. Dyatchenko, P.V. et al.: Current state of the deeply annotated corpus of Russian texts (SynTagRus) (in Russian). In: Russian National Corpus: 10 Years of the Project. Proceedings of the V.V. Vinogradov Russian Language Institute, pp. 272–299. Russian Language Institute, Moscow (2015) 25. Bamberg, M.: The Acquisition of Narratives: Learning to use Language. Mouton de Gruyter, Berlin (1987) 26. Van Dam, F.J.: Development of Cohesion in Normal Children’s Narratives. Project report. https://dspace.library.uu.nl/bitstream/handle/1874/180044/Development%20of%20cohesion %20in%20normal%20children%27s%20narratives%20research%20report.pdf?sequence= 1&isAllowed=y. Accessed 02 May 2018 27. Paducheva, E.V.: Communicative perspective interpretation: basic structures and linearaccentual transformations. Comput. Linguist. Intellect. Technol. 11(18), 522–535 (2012) 28. Zimmerling, A.V.: Thetic sentences: semantics and derivation (in Russian). In: Lyutikova, E. A., Zimmerling, A.V., Konoshenko, M.B. (eds.) Typology of Morphosyntactic Parameters, vol. 1, pp. 223–252, M.A. Sholokhov MGGU, Moscow (2014)
Context Modeling for Cross-Corpus Dimensional Acoustic Emotion Recognition: Challenges and Mixup Dmitrii Fedotov1(B) , Heysem Kaya2 , and Alexey Karpov3 1
3
Institute of Communications Engineering, Ulm University, Ulm, Germany [email protected] 2 Department of Computer Engineering, Tekirda˘ g Namık Kemal University, C ¸ orlu, Turkey [email protected] St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia [email protected]
Abstract. Recently, focus of research in the field of affective computing was shifted to spontaneous interactions and time-continuous annotations. Such data enlarge the possibility for real-world emotion recognition in the wild, but also introduce new challenges. Affective computing is a research area, where data collection is not a trivial and cheap task; therefore it would be rational to use all the data available. However, due to the subjective nature of emotions, differences in cultural and linguistic features as well as environmental conditions, combining affective speech data is not a straightforward process. In this paper, we analyze difficulties of automatic emotion recognition in time-continuous, dimensional scenario using data from RECOLA, SEMAINE and CreativeIT databases. We propose to employ a simple but effective strategy called “mixup” to overcome the gap in feature-target and target-target covariance structures across corpora. We showcase the performance of our system in three different cross-corpus experimental setups: singlecorpus training, two-corpora training and training on augmented (mixed up) data. Findings show that the prediction behavior of trained models heavily depends on the covariance structure of the training corpus, and mixup is very effective in improving cross-corpus acoustic emotion recognition performance of context dependent LSTM models.
Keywords: Cross-corpus emotion recognition Time-continuous emotion recognition · Data augmentation
1
Introduction
Automatic affect recognition is a popular research topic, which brings researchers from psychological and technical areas together [19,24]. It can be beneficial in c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 155–165, 2018. https://doi.org/10.1007/978-3-319-99579-3_17
156
D. Fedotov et al.
a variety of applications in areas of human-computer interaction (HCI) and human-human interaction (HHI). Emotional component in an HCI system allows it to perceive the emotional state of speaker and adjust the response to increase the quality of interaction. Although emotion recognition has been a hot topic for a long period and a high amount of research was conducted, the problem is far from being solved. Less than two decades ago, emotion recognition has left the laboratory conditions and faced the real-world data and problems; such as cultural, linguistic and environmental differences [10,22]. Combination of different corpora, which could solve the problem of data shortage, could not be applied in a straightforward manner in the context of acoustic emotion recognition. The main difficulty lies in the subjective nature of emotions, resulting in diverse and controversial annotations. Despite these issues, data combination and augmentation may lead to a dramatic increase in performance of affect recognition systems. In this paper, we dealt with problems of cross-corpus time-continuous dimensional emotion recognition and proposed ways to overcome them. We observed that a pure cross-corpus emotion recognition may not work properly if data have different label distributions. We also showed that this problem can be partially solved by combining and augmenting data. This paper is structured as follows: we introduce the related work in Sect. 2; provide information on corpora used, data preprocessing techniques and methodology in Sect. 3; present results of different cross-corpora emotion recognition settings in Sect. 4; and conclude the paper in Sect. 5.
2
Related Work
Most of the previous research on emotion recognition dealt with acted, categorically labeled corpora, providing information at utterance-level [1,7,11]. Continuously annotated databases of spontaneous interactions provide more naturalistic data, but also introduce several challenges, such as diversity in annotations [16,17], reaction lags between actual appearance of an emotion and its annotation [12] and amount of contextual information the system needs [5,6]. Problem of cross-corpus emotion recognition was investigated by several research groups. Schuller et al. studied this problem with acted, categorically annotated databases [22]. Performance of the proposed methodology was poor if some differences in environmental conditions were present. For some of the emotions, classification accuracy of used Support Vector Machine (SVM) based model was below the chance level. Authors also showed that normalization strategy has a crucial role in the cross-corpus scenario and concluded that speakerlevel normalization leads to the best performance, compared to other approaches. The study of normalization effect on cross-corpus emotion recognition performance was extended and cascaded normalization techniques, which are comprised of speaker, value and instance level normalization, were recently introduced and tested in [9]. The proposed approach achieved increased performance reducing cross-corpus differences with respect to suprasegmental acoustic features.
Context Modeling for Cross-Corpus Dimensional Emotion Recognition
157
Resent study focused on cross-corpus recognition of self-assessed affect. Cross-corpus predictions of affective primitives were used as a data for extracting functionals and then combined with predictions of other sub-systems to improve performance [8]. These studies provided a starting point for the paper in-hand and a speaker normalization technique was used. Cross-corpus emotion recognition with timecontinuous data is poorly studied, which served as motivation to conduct our research.
3
Data and Methodology
Three corpora of spontaneous, emotionally-rich interactions are used in this study: RECOLA [20], SEMAINE [13] and CreativeIT [14]. All corpora are annotated at frame level using two affective scales: arousal (activation) and valence (positivity). A brief overview of the used corpora is presented in Table 1. Table 1. Overview of used corpora. Corpus
Duration Recordings Participants Gender Age (min) (m/f) μ (σ)
Annotation rate (Hz)
RECOLA
115
23
23
10/13
21.4 (2.0)
25
SEMAINE 435
24
20
8/12
30.4 (10.4)
50
CreativeIT 132
31
15
7/8
N/A
60
3.1
RECOLA
RECOLA (Remote COLlaborative and Affective interactions) database was collected during spontaneous dyadic interactions between people while solving a cooperative problem. From 46 people participating in the database collection, 34 gave their consent to share the data publicly available and recordings from 23 users are presented in the current version of the database, shared with research community. Each recording has duration of five minutes, yielding 115 min of speech in total. Participants are aged between 18 and 25 years and have different mother tongues although spoke French during the database collection process: 17 of them have French as a mother tongue, 3 – Italian and 3 – German. The corpus was recorded in four modalities: audio, video, electrocardiogram and electro-dermal activity. Recordings were continuously annotated by 6 equally gender-distributed persons via ANNEMO (ANNotating EMOtions) annotation tool [20] in two affective scales (arousal and valence) and five social behavior scales (agreement, dominance, engagement, performance, rapport).
158
3.2
D. Fedotov et al.
SEMAINE
SEMAINE (Sustained Emotionally coloured Machine-human Interaction using Nonverbal Expression) database was collected within a project, where the aim was to build a system that could engage a person in a sustained conversation with a Sensitive Artificial Listener (SAL) agent. Three scenarios are used in the context of this project: Solid SAL, where the agent’s role was played by a real human-operator; Semi-Automatic SAL, where system spoke phrases chosen by a human operator from a pre-defined list; and Automatic SAL, where the system chose phrases and non-verbal signals by itself. Only data collected from users (not operators) in Solid SAL scenario were used in this study. The corpus consists of 24 recordings in English from 20 speakers, whose age range from 20 to 58 years. Recordings have durations from 11 to 30 min resulting in the total corpus length of 435 min. The corpus was recorded in two modalities: audio and video and annotated via FeelTrace annotation tool [3] in different dimensions and emotional labels: valence, arousal, power, anticipation, intensity, fear, anger, happiness, sadness, disgust, contempt and amusement. 3.3
CreativeIT
CreativeIT database was collected to serve as a multidisciplinary resource for theatrical performance improvement and emotion recognition. It was recorded by actors, coordinated by a director with an expert qualification in Active Analysis introduced by Stanislavsky. Two scenarios were used during the database collection: two-sentence exercise, where actors were permitted to use only one predefined phrase each; and paraphrase of script, where actors were following general script without any constraints on words and expressions. Only the paraphrase part of corpus was used in this study as it meets the conditions of spontaneous interaction closely. Selected part of the corpus consists of 31 recordings in English from 15 participants. Duration of recordings ranges from 2 to 7 min, with a total of 132 min. In addition to audio data from close-up microphones, motion capture data is available for each recording, representing body language of actors during interactions. Recordings were annotated via FeelTrace annotation tool [3] by three groups of evaluators: theater experts, actors and naive audience in different dimensional groups, such as emotional descriptors (arousal, valence) and theatrical performance ratings (naturalness, creativity). 3.4
Features and Labels
For cross-corpus emotion recognition, the audio modality was used in this study, as it is presented in each corpus described above. Audio features were extracted with openSMILE tool [4]. They consist of 65 low-level descriptors (LLDs) and their first order derivatives [21]. Feature step size was set to 0.01 sec. resulting in a feature extraction rate of 100 Hz. As corpora have different annotation rates (see Table 1) they were brought to the same data frequency to be able to share
Context Modeling for Cross-Corpus Dimensional Emotion Recognition
159
the same prediction models. The lowest annotation frequency of 25 Hz, present in RECOLA, was used to subsample other two corpora. Extracted features were speaker-level z-normalized, as it was previously shown to have a better performance in cross-corpus experiments [9]. Annotations of two main affective dimensions - arousal and valence - were used in this study as labels. Distributions of labels for corpora described above are presented in Fig. 1.
(a) Arousal
(b) Valence
Fig. 1. Label distributions in three emotional corpora.
The label distribution of RECOLA is narrower in both affective dimensions, than remaining corpora. It can be a result of its pure spontaneous nature. Although all corpora used in this study are designed to be naturalistic, SEMAINE can simulate four personality prototypes, which affect operators’ behavior and hence, the user. Even though actors participating in collection of CreativeIT database were not restricted lexically to choose the words for interaction, they had to follow the general scenario and the role. These conditions could have led to more idiosyncratic nature of emotions in both SEMAINE and CreativeIT. 3.5
Modeling
In this study, recurrent neural network with long short-term memory (LSTMRNN) was used for context modeling. The model comprises of two layers with 80 and 60 neurons with ReLU activation function [15], respectively, each followed a dropout layer with p = 0.3 [23]. The models were optimized by root mean square propagation (RMSprop) using the concordance correlation coefficient as a metric function. We use the LSTM implementation provided by Keras toolkit [2]. Our recent study has revealed that performance of time-continuous emotion recognition has a strong relation with the amount of acoustic context used in recurrent neural network (RNN) models regardless of the number of time steps [5]. The required amount of context could be set by combination of two parameters: number of time steps fed into RNN model and a sparsing coefficient, which is responsible for decreasing the amount of data in each sample by skipping
160
D. Fedotov et al.
frames. Regardless of sparsing coefficient, the step size between samples is one frame, hence there is no loss in total amount of information. The amount of context in seconds is then represented as: C=
SC × T W , FR
(1)
where SC is the sparsing coefficient that determines the amount of frames to skip, TW is the time window size and FR is the frame rate in Hz. Based on our previous research [5], a context size of 7.68 s, which is obtained from the combination of SC=12 and TW=16, was selected for this study. The same procedure of sparsing applies to respective labels. Sequence-to-sequence modeling is used in this study, thus features of TW previous frames were used to predict the corresponding labels for these frames. After prediction phase, the values of labels obtained for the same frame at different time steps were averaged to smooth final prediction. 3.6
Mixup for Data Augmentation and Corpus Adaptation
To combine data from different corpora, a recently introduced methodology called mixup was used in this study [25]. mixup is a data augmentation technique, that constructs virtual training examples based on existing ones, using weights drawn from Beta distribution to regulate their contribution to the synthetic instance: xnew = λxi + (1 − λ)xj , ynew = λyi + (1 − λ)yj ,
(2) (3)
where λ ∼ Beta(α, α), α is a hyper-parameter for the Beta distribution, xi , xj are feature vectors, and yi , yj are label values/vectors. This kind of data augmentation encourages the model to behave more linearly in-between training examples, which can be useful for cross-corpus learning. In this study, feature vectors xi , xj and corresponding labels yi , yj were taken from two different corpora. To create different sets of augmented data, hyperparameter α of mixup routine was varied (see Fig. 2). Three values were tested: α = 0.1, which provides slight changes to original data and a minor contribution of the second corpus; α = 1, which provides an uniformly distributed level of contribution of both corpora to augmented data; and α = 10, which creates most examples in the middle of feature-label space between two samples. To preserve sequential nature of data, streams were mixed up at the recording level with consecutive frames.
4
Experimental Results
In this paper, the problem of cross-corpus multi-dimensional emotion recognition is considered. To study the issues and particularities of time-continuous
Context Modeling for Cross-Corpus Dimensional Emotion Recognition
(a) α = 0.1
(b) α = 1
161
(c) α = 10
Fig. 2. Beta-distribution with three different values of parameter α.
and multidimensional emotion recognition, three experimental setups were used: single-corpus training, two-corpora training and training on augmented data. The performance of cross-corpus prediction was estimated using Pearson’s correlation coefficient (ρ). 4.1
Single-Corpus Training
The first problem definition was to predict values on an unseen corpus using model trained on single corpus. Two models (for arousal and valence) were trained on whole data available for one corpus up to 5 epochs, then they were used to generate predictions for different corpora, including the training corpus itself (to show the ground-truth label distributions). Scatter plots of prediction in the single-corpus training settings are presented in Fig. 3. Test corpus RECOLA
SEMAINE
CreativeIT
1.0
RECOLA
0.5 0.0 -0.5 -1.0
0.5
SEMAINE
Train corpus
1.0
0.0 -0.5 -1.0 1.0
CreativeIT
0.5 0.0 -0.5
-1.0
-0.5
0.0
0.5
1.0
-1.0
-0.5
0.0
0.5
1.0
-1.0
-0.5
0.0
0.5
-1.0 1.0
Fig. 3. Single-corpus training (x-axis – valence, y-axis – arousal).
162
D. Fedotov et al.
Table 2. Pearson correlation scores (arousal/valence) for single-corpus training. Train on
Test on RECOLA
SEMAINE
CreativeIT
RECOLA
0.923/0.890
0.375/0.223
0.337/−0.024
SEMAINE 0.533/0.170
0.821/0.750
0.322/−0.065
CreativeIT −0.027/0.009 0.306/−0.013 0.953/0.952
Distributions of each corpus labels can be seen as self-prediction (main diagonal). Figure 3 shows that models predict only in the limits of their own annotation distributions and exhibit the same tendencies regardless of the test data. This results in a low cross-corpus prediction performance, even in some cases leading to a negative correlation (see Table 2). Negative correlations may also be attributed to the use of different annotation tools. ANNEMO tool has two separate bars for arousal and valence that are manipulated by the user independently. However, FeelTrace toolkit provides the two-dimensional emotion representation with basic emotions played on the graph, which are in some cases drastically converse (e.g. for “afraid”) to other research [18]. 4.2
Multi-corpus Training
The second research problem was to predict affect primitives on an unseen corpus using the model trained on two remaining corpora. Other experimental parameters were left the same as in the single-corpus training setting. We refer this multi-corpus training scheme as “combining”. The third research problem was to predict arousal and valence on one of the corpora using the model trained on fully synthetic data, generated from the remaining corpora with mixup routine. Comparative multi-corpus training results with combining and mixup strategies are presented in Table 3, where the improved performance of multi-corpus training over the best single corpus training performance on a target corpus is shown in bold. Table 3. Pearson correlation scores (arousal/valence) for leave-one-corpus-out training results. Test on
RECOLA + CreativeIT
SEMAINE 0.359/0.050
RECOLA + SEMAINE
CreativeIT 0.435/−0.016 0.431 (1)/−0.041 (0.1)
CreativeIT + SEMAINE RECOLA
Combined
Mixed up (best α)
Train on
0.222/0.149
0.368 (1)/−0.012 (10) 0.695 (1)/0.294 (10)
Compared to the single-corpus training, combination of data results in approximate averaging of performances of two corpora used for training. Only a
Context Modeling for Cross-Corpus Dimensional Emotion Recognition
163
combination of SEMAINE and RECOLA provides better results for CreativeIT as the test corpus with arousal dimension. Mixup based data augmentation allows model to benefit more from differences in databases, creating synthetic samples that train a model having higher generalization ability. Thus, mixup dramatically improves over single-corpus training on two corpora, and renders a relatively slight performance decrease (from 0.375 to 0.368) on SEMAINE arousal dimension. The advantage of using mixup over simple combination is seen clearly on RECOLA corpus: while combining approach markedly underperforms the single-corpus performance, mixup improves it in both arousal and valence dimensions.
5
Conclusions and Future Work
In this paper, we studied problems of time-continuous multidimensional crosscorpus emotion recognition. In addition to the feature distribution problem that is present in other cross-corpus settings and could be partially solved by a speaker-level normalization, the dimensional approach introduces the challenge of different label distributions. It can be caused by initial database collection scenario, different annotation software or people’s perception of emotions. Nevertheless, it may serve as a limiting factor for the system, may not let it predict outside originally trained distribution and may even result in converse behavior. In future work, a cross-task approach will be introduced to the current research to increase coverage of arousal-valence space by using corpora with categorical annotation. The question of mapping emotion labels between corpora is still poorly studied, but an effective approach may increase amount of data available for different experimental settings, which will have a positive effect on the performance of the emotion recognition system. Acknowledgments. This research is supported by the Russian Science Foundation (project No. 18-11-00145).
References 1. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Ninth European Conference on Speech Communication and Technology (2005) 2. Chollet, F., et al.: Keras (2015). https://keras.io 3. Cowie, R., Douglas-Cowie, E., Savvidou*, S., McMahon, E., Sawey, M., Schr¨ oder, M.: ‘FEELTRACE’: An instrument for recording perceived emotion in real time. In: ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (2000) 4. Eyben, F., W¨ ollmer, M., Schuller, B.: Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462. ACM (2010) 5. Fedotov, D., Ivanko, D., Sidorov, M., Minker, W.: Contextual dependencies in timecontinuous multidimensional affect recognition. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC) (2018)
164
D. Fedotov et al.
6. Gunes, H., Pantic, M.: Automatic, dimensional and continuous emotion recognition. Int. J. Synth. Emotions 1(1), 68–99 (2010) 7. Haq, S., Jackson, P.J.: Multimodal emotion recognition. Machine audition: principles, algorithms and systems, pp. 398–423 (2010) 8. Kaya, H., Fedotov, D., Ye¸silkanat, A., Verkholyak, O., Zhang, Y., Karpov, A.: LSTM based cross-corpus and cross-task acoustic emotion recognition. In: INTERSPEECH 2018. ISCA (2018, in press) 9. Kaya, H., Karpov, A.A.: Efficient and effective strategies for cross-corpus acoustic emotion recognition. Neurocomputing 275, 1028–1034 (2018) 10. Lim, N.: Cultural differences in emotion: differences in emotional arousal level between the east and the west. Integr. Med. Res. 5(2), 105–109 (2016) 11. Makarova, V., Petrushin, V.A.: RUSLANA: A database of Russian emotional utterances. In: Seventh International Conference on Spoken Language Processing (2002) 12. Mariooryad, S., Busso, C.: Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In: Affective Computing and Intelligent Interaction (ACII), pp. 85–90. IEEE (2013) 13. McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The SEMAINE database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2012) 14. Metallinou, A., Lee, C.C., Busso, C., Carnicke, S., Narayanan, S.: The USC CreativeIT database: a multimodal database of theatrical improvisation. In: Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, p. 55 (2010) 15. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010) 16. Nicolaou, M.A., Gunes, H., Pantic, M.: Automatic segmentation of spontaneous data using dimensional labels from multiple coders. In: Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality. German Research Center for AI (DFKI) (2010) 17. Nicolle, J., Rapp, V., Bailly, K., Prevost, L., Chetouani, M.: Robust continuous prediction of human emotions using multiscale dynamic cues. In: Proceedings of the ACM International Conference on Multimodal Interaction, pp. 501–508 (2012) 18. Paltoglou, G., Thelwall, M.: Seeing stars of valence and arousal in blog posts. IEEE Trans. Affect. Comput. 4(1), 116–123 (2013) 19. Petta, P., Pelachaud, C., Cowie, R.: Emotion-Oriented Systems: The HUMAINE Handbook. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-64215184-2 20. Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013) 21. Schuller, B., Steidl, S., Batliner, A., Epps, J., Eyben, F., Ringeval, F., Marchi, E., Zhang, Y.: The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load. In: Fifteenth Annual Conference of the International Speech Communication Association (2014) 22. Schuller, B., Vlasenko, B., Eyben, F., Wollmer, M., Stuhlsatz, A., Wendemuth, A., Rigoll, G.: Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans. Affect. Comput. 1(2), 119–131 (2010)
Context Modeling for Cross-Corpus Dimensional Emotion Recognition
165
23. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 24. Valstar, M., Gratch, J., Schuller, B., Ringeval, F., Lalanne, D., Torres Torres, M., Scherer, S., Stratou, G., Cowie, R., Pantic, M.: Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 3–10. ACM (2016) 25. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Functional Mapping of Inner Speech Areas: A Preliminary Study with Portuguese Speakers Carlos Ferreira1,2,10 , Bruno Direito1 , Alexandre Sayal1,2 , Marco Sim˜ oes2,3,4 , Inˆes Cad´orio5,6 , Paula Martins5,7,8 , 5,6 , Daniela Figueiredo5,6 , Miguel Castelo-Branco2,3 , Marisa Lousada and Ant´ onio Teixeira8,9(B) 1
2
4
Institute of Nuclear Sciences Applied to Health, University of Coimbra, Coimbra, Portugal [email protected], c [email protected] CIBIT Coimbra Institute for Biomedical Imaging and Translational Research, ICNAS, University of Coimbra, Coimbra, Portugal 3 Faculty of Medicine, University of Coimbra, Coimbra, Portugal Center for Informatics and Systems, University of Coimbra, Coimbra, Portugal 5 School of Health Sciences, University of Aveiro, Aveiro, Portugal 6 Center for Health Technology and Services Research, University of Aveiro, Aveiro, Portugal 7 Institute of Biomedicine, University of Aveiro, Aveiro, Portugal 8 Institute of Electronics and Telematics Engineering of Aveiro (IEETA), Aveiro, Portugal 9 Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal [email protected] 10 Perspectum Diagnostics, Oxford, UK
Abstract. Inner speech can be defined as the act of talking silently with ourselves. Several studies aimed to understand how this process is related to speech organization and language. Despite the advances, some results are still contradictory. Importantly, language dependency is scarcely studied. For this first study of inner speech for Portuguese native speakers using fMRI, we selected a confrontation naming task, consisting of 40 black and white line drawings. Five healthy participants were instructed to name in inner and in overt speech the visually presented image. fMRI data analysis considering the proposed inner speech paradigm identified several brain areas such as the left inferior frontal gyrus, including Broca’s area, supplementary motor area, precentral gyrus and left middle temporal gyrus including Wernicke’s area. Our results also show more pronounced bilateral activations during the overt speech task when compared to inner speech, suggesting that inner and overt speech activate similar areas but stronger activation can be found in the later. However, this difference stems in particular from significant activation differences in the right pre-central gyrus and middle temporal gyrus. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 166–176, 2018. https://doi.org/10.1007/978-3-319-99579-3_18
Functional Mapping of Areas Related with Inner Speech in Portuguese Keywords: Inner speech portuguese fMRI · Portuguese
1
167
· First keyword · Overt speech
Introduction
Inner Speech is defined as the act of talking to ourselves silently [6,13,15]. Several studies imply inner speech in memory tasks, reading, comprehension, consciousness, inner thought (self-reflection tasks) [6,14] and prospective thought [16]. According to the literature, two levels of inner speech can be defined: one (more abstract) designated of “language of mind” where the syntax is not fully structured and semantics is more personal and subjective; the other level is more concrete and phonological and phonetic components can be present [6]. Aside these two features intrinsic to inner speech, there is still lack of comprehension on how inner speech is related with speech organization and language. To that end, recent work has been developed to better understand the relation between inner and overt speech, and their correspondence to language pathways. Despite all previous efforts, there is still lack of consensus regarding the relation between inner and overt speech [6,18]. Some of the factors that contribute to this are: the paradigm variability to explore inner and overt speech, some of the studies did not compare inner and overt speech and others did not monitor participants performance [6,18]. To assess their neural underpinnings features, different methods such as Positron Emission Tomography (PET), electroencephalography (EEG), Transcranial Magnetic Stimulation TMS) and Functional Magnetic Resonance Imaging (fMRI) can be used. The recent advances in the field of Magnetic Resonance Imaging (MRI), combining optimized spatial and improved temporal resolution, multivariate supervised learning methods (allowing assessments in real time), established the use of this technique as one of the most important in the understanding of brain mechanisms. The fact that it does not use ionizing radiation (as in PET imaging) also represents a significant advantage of fMRI to assess brain function. fMRI uses the contrast between oxygenated and deoxygenated blood, the blood-oxygenation-level-dependent (BOLD) effect, which is based on the coupling between the hemodynamic response and neuronal activity. Currently, fMRI using BOLD effect, is one of the preferred methods to map neuronal activity [26]. High field MRI scanners are being used to increase signal-to-noise ratio, ultimately improving the ability to map brain function based on the BOLD signal [10,26]. Recent studies tried to use fMRI to understand and identify brain areas involved in inner speech. Areas like the left inferior frontal gyrus (IFG) (including Broca’s area), Wernicke’s area, right temporal cortex, supplementary motor area (SMA), insula, right superior parietal lobule (SPL) and right superior cerebellar cortex were found to be involved in inner speech [6,9,12,15]. Geva [6] mentions that structural connectivity patterns near the supramarginal gyrus (SMG) (implicated in the dorsal pathway of language) are predictive of internal speech.
168
C. Ferreira et al.
In a critical review, it is mentioned that the planning without speech production and articulation is supported by connections between the prefrontal cortex and the left IFG (Broca’s area) [9]. Also is stated the existence of projections between areas related with speech production and auditory cortex as a relevant process in verbal self-monitorization of internal speech [9]. It is also mentioned that inner speech nature is supported by connections between frontal and temporal regions, as to inform the areas related with language perception of the self-generated nature of the verbal output. To map the areas related with inner speech, several paradigms are being used. One example [19], analyzes the relation between frontal and temporal activity instructing the participants to say the same word (word repetition task) in different time points - at each second, each 4 s (condition fast vs. slow), and at each second, each 2 s, each 4 s (conditions fast vs. medium vs. slow). The moment where the participants need to perform the task was indicated by a visual cue [19]. Other example used a task to map the inner speech during a working memory task where the authors exploit a storage condition and a manipulation condition with sub-vocal reproduction of letters [12]. This paradigm allowed to identify active brain areas related with working memory during an inner speech task. Paradigms that include letter or object naming, animal name generation, verb generation, reading, rhyme judgement, counting or semantic fluency tasks are also used to assess inner speech brain related areas [6,18]. In the present study, we focus in the optimization of a paradigm that can be easily used to study inner and overt speech and possible relation between the areas recruited by both processes. We will use a confrontation naming task to evaluate the variability/differences between both speech mechanisms and try to map areas that could be related only with pure inner speech. We also want to assess the feasibility of assessing inner speech related areas when performing a language task in the context of the European Portuguese language. Paper Structure: The paper is structured as follows: a brief introductory section presenting related work; Sect. 2 details methods of fMRI data acquisition (including the stimulation protocol, MR parameters) and the tools used for image processing and analyses; Sect. 3 provides the most relevant results obtained so far; in Sect. 4 we discuss the results comparing our findings with published literature; finally, the conclusions that can be drawn are presented.
2
Methods
The study consisted in the recording and analysis of fMRI data while native speakers of Portuguese performed inner and overt speech tasks in response to visual stimuli. Participants: Five healthy native Portuguese speakers volunteers (mean age: 22.2 years old; 3 males) were enrolled in this study. All participants had normal or corrected to normal vision, and no history of neurological disorders.
Functional Mapping of Areas Related with Inner Speech in Portuguese
169
The Edinburgh handedness test was applied to the participants to ensure they were all right handed (mean 92% right) and they all declare Portuguese as their native language. The study was approved by the Ethics Commission of the Faculty of Medicine of the University of Coimbra and was conducted in accordance with the declaration of Helsinki. All subjects provided written informed consent to participate in the study. Data Collection: The data was collected using a Siemens Magnetom Trio 3 T scanner (Erlangen, Germany) with a 12-channel head coil. Anatomical images were acquired using a sagittal T1 3D MPRAGE sequence with the following parameters: TR = 2530 ms; TE = 3.42 ms; TI = 1100 ms; flip angle = 7◦ ; 176 slices; matrix size 256 × 256; voxel size 1 × 1 × 1 mm. After the anatomical scan, functional maps were obtained using axial gradient echo-planar imaging BOLD sequences parallel to the bi-commissural plane with the following parameters: TR = 3000 ms; TE = 30 ms; 40 slices; matrix size 70 × 70; voxel size 3 × 3 × 3 mm. Visual stimuli were presented on a NordicNeuroLab (Bergen, Norway) LCD monitor, with a resolution of 1920 × 1080 pixels, refresh rate 60 Hz. Stimulation Protocol: The experimental protocol consisted in a picture naming task - inner and overt speech - of 40 black and white line drawings selected from Snodgrass & Vanderwart corpus [20] (Fig. 1). The selection of black and white line drawings was preferred over colored pictures because of their simplicity. Additionally, ambiguous pictures that could retrieve more than one target word (e.g. bottle with water) were excluded from this task. The inner and overt speech runs consisted in a block design experiment with nine rest blocks of 15 s and 8 task blocks of 30 s where each image was presented during 3 s, 10 images per block and two repetitions per image in the run. Each run had a total duration of 125 volumes (Fig. 1). In the baseline condition, the participants were instructed to focus on the fixation cross presented. During the task condition, each participant was instructed to name the object silently in the inner speech run and overtly in the overt speech run. Data Analysis: Preprocessing and analysis were conducted using BrainVoyager QX 2.8 (Brain Innovation, Maastricht, Netherlands). First, individual functional data were analyzed in order to assess data quality (e.g. head motion) and participants’ engagement and ability to perform the task proposed. All participants successfully performed the task and were included in the analysis. Preprocessing of single-subject fMRI data included slice-time correction, realignment to the first image to compensate for head motion and temporal high-pass filtering to remove low-frequency drifts. The anatomical images were co-registered to the functional volumes and all images were normalized to Talairach coordinate space [24]. After preprocessing, in the first-level analysis of the functional data, general linear model (GLM) analysis was used for each run. Predictors were modeled as a boxcar function with the length of each condition, convolved with the canonical hemodynamic response function (HRF). Six motion parameters (three
170
C. Ferreira et al.
Fig. 1. Stimulation paradigm. Baseline - consisted in a fixation cross and the participants were instructed to focus on it. Task - picture naming task - consisted in a sequence of images visually presented for the participants to name silently (in the inner speech run) or overtly (in the overt speech run). Each image was presented at the screen during 3 s in a total of 10 images per block.
translational and three rotational) and predictors based on spikes (outliers in the BOLD time course) were also included into the GLM as covariates. At the group level, to map the most important brain regions involved in inner and overt speech, we used the contrast “task” > “baseline”. First we applied a 3D spatial smoothing with a Gaussian filter of 6 mm. Taking into account the feasibility nature of our study, we performed a fixed-effects (FFX) analysis. To address the multiple comparison problem, we applied False Discovery Rate (FDR) correction (considering a false discovery rate of 0.01). We also aimed at comparing inner and overt speech mechanisms. To this end, we selected a set of regions of interest (ROI) involved in the speech/word formation network (based on a literature review [5,6,12,15,19,23]. Each individual ROI was selected based on the corresponding anatomic landmarks and on the highest t-statistic voxel of the inner speech run statistical map (contrast “confrontation naming task” > “baseline”). Each ROI was defined as a volume with a maximum of 1000 voxels around the peak value (using BrainVoyager QX interface tool to define ROIs). We then computed and compared ROI-GLM t-statistic per ROI between inner and overt speech. We performed a two-sided Wilcoxon rank sum test (Matlab 2017a) to test the statistical significance of the difference between the results obtained considering the inner and overt speech in the naming task.
3 3.1
Results Whole Brain Analysis - Brain Map of the Naming Task
The FFX-GLM statistical map regarding the inner speech naming task (FFX, q(FDR) < 0.01), considering the contrast of interest “picture naming task” > “baseline” (Fig. 2a), revealed significant activations in the IFG and Middle Frontal Gyrus (MFG) (including Broca’s area), preCentral Gyrus (pGC), SMA,
Functional Mapping of Areas Related with Inner Speech in Portuguese
171
Middle Temporal Gyrus (MTG) (including Wernicke’s area), Intraparietal Sulcus (IPS), Occipital areas and Fusiform Gyrus (FG). Figure 2b presents the FFX-GLM statistical map from the overt speech naming task (FFX, q(FDR) < 0.01), considering the contrast of interest “picture naming task” > “baseline” in which it is possible to identify several brain regions such as the IFG (including Broca’s area), pCG, SMA, MTG (including Wernicke’s area), Occipital areas and FG.
a)
b)
Fig. 2. (a) FFX-GLM group activation map for the inner speech runs (q(FDR) < 0.01), showing areas with higher activation during the task relative to the baseline. (b) FFXGLM group activation map for the overt speech runs (q(FDR) < 0.01), showing areas with higher activation during the task relative to the baseline. The regions in blue, in both, show the expected deactivation particular in the default mode network. (Color figure online)
3.2
Comparing Inner and Overt Speech - ROI-Based Analysis and the Speech Brain Network
One of the aims of the study was to compare inner and overt speech activation patterns. To this end, considering a literature review on speech-related brain networks, we identified a total of 16 ROIs (summarized in Table 1). In order to functionally define each ROI, we identified the relevant anatomical landmarks and selected a ROI around the highest t-statistic voxel considering the whole brain inner speech statistical map. Table 1 presents the coordinates of the center of gravity of each ROI (in Talairach coordinates) and the total number of voxels. The beta weights of the contrast “picture naming task” > “baseline” for each region and condition (ROI-GLM) were extracted per participant and run (these weights reflect the BOLD signal variation during the task condition relative to the baseline). To evaluate the statistical significance of the difference between inner and overt speech naming tasks, we performed a two-sided Wilcoxon rank sum test on the beta values for each ROI. The results are presented in Table 1.
172
C. Ferreira et al.
Table 1. Inner vs Overt speech: Talairach coordinates and number of voxels per ROI. Wilcoxon statistical test results to evaluate main differences between inner and overt speech. Region
Num. voxels
ROI x
inner overt U p-value (median) (median) y
z
Left middle temporal gyrus
738
−47 −46 9
−0.0244
0.2949
3 0.0556
Left frontal lobe (Broca’s area)
1000
−48 146 11
0.0293
0.0908
12 > 0.9999
Left putamen
826
−21 −5
14
0.1152
0.2666
4 0.0952
Right putamen
412
22
0
10
0.1875
0.3125
10 0.6905
Supplementary motor area
962
−3
−5
57
0.2725
0.3740
6 0.2222
Right middle temporal gyrus
414
47
−31 1
0.1699
0.6025
1 0.0159
Left M1 - Precentral gyrus
970
−43 −5
0.1699
0.5049
5 0.1508
Left intraparietal sulcus
979
−46 −41 41
−0.0137
0.0615
7 0.3095
Left inferior parietal lobule
989
−32 −59 45
0.2920
0.1875
15 0.6905
Right inferior parietal lobule
951
32
−58 44
0.2822
−0.1025
17 0.4206
Right M1 precentral gyrus
973
48
−4
0.3379
0.9150
2 0.0317
41
45
Left FG
1000
−42 −50 −14 0.3125
0.4873
7 0.3095
Right FG
1000
41
−60 −20 0.2783
1.5811
3 0.0556
Left inferior occipital gyrus
940
−26 −88 −7
1.4863
1.4590
10 0.6905
Right inferior occipital gyrus
941
27
−88 −8
1.6309
1.6074
9 0.5476
Right inferior frontal gyrus
525
48
36
0.0156
0.3828
6 0.2222
5
Our results show that overt speech elicits a stronger activation pattern. Statistical significant differences were found in the right MTG and the right pCG. Additionally, we computed the subtraction between the overt and inner speech activation maps (Fig. 3). The results suggest that the overt speech activation pattern in most brain structures is higher than the activation pattern presented by the inner speech task, complying with the ROI-GLM results.
Functional Mapping of Areas Related with Inner Speech in Portuguese
173
Fig. 3. Subtraction FFX-GLM group activation map between the overt and inner speech runs (q(FDR) < 0.01), showing areas with higher activation during the overt relative to the inner.
4
Discussion
In this study we sought to assess brain activity patterns when performing two speech tasks - one related with inner speech and other related with overt speech. One of the new findings that has not been reported in other studies, was IPS activity. This finding can be explained by the involvement of IPS in tasks related with working memory, attention and attentional control by left fronto-parietal network that can be flexibly allocated to language processing as a function of task demands [2,7,11]. Another interesting finding is the activation of the FG, specially the visual word form area (VWFA) during both tasks. Usually related with the processing of visually presented letter strings, words, pseudowords but also to nonwords stimuli [3–5,23,25], the VWFA was active during the performance of speech tasks with image presentation (non-verbal material) in both tasks. This can be supported by Cohen [3] that mention the relation between the visual system and left lateralized regions engaged in language processing and by Stevens [22] that mention a functional connectivity between visual word form area and core regions of language processing. Bouhali [1] recently showed functional and anatomical connections between visual word form area and most perisylvian language-related areas, including Broca’s area.
174
C. Ferreira et al.
The major task related difference in the statistical analysis indicates more activation in right precentral gyrus during overt speech task. This is in concordance with some results published in the literature that assume the need of a strong motor response to produce the overt speech controlling all the elements involved in speech production while, the inner speech, as is not so dependent of activating articulatory elements, should have lower pCG activation [8,17,18,21]. Another source of difference is right middle temporal gyrus (rMTG) that shows stronger activation in overt speech when comparing to inner speech. This finding remains controversial in the literature where some authors mentioned that in the inner speech conditions they found high activations in other MTG subregions [18]. These intriguing results can be explained by some limitations in our study. First the small sample size of this exploratory study is more prone to be influenced by single individual results, in particular in an FFX analysis. There is also the possibility that distinct subregions in MTG modulate differentially. Nevertheless, this first approach in Portuguese speaking participants allow us to map the mechanisms involved in inner speech even without the use of verbal material (e.g. words and sentences). This proof of concept/pilot study may pave way to further explore the mechanisms involved in inner speech when using verbal stimuli.
5
Conclusions
In this work we were able to map the inner speech related areas that are in accordance with the literature and explicit a wider bilateral brain activation during overt speech when compared with inner speech, although these differences dominate mainly in two regions (a part of MTG and M1). Future research should focus on expanding the understanding of the neural correlates of inner and overt speech. In this sense, we believe that using a parametric difficulty level paradigm design (e.g. from vowel to sentence) may represent an important tool to evaluate major differences between the several areas engaged in the inner speech performance when difficulty of the task is increasing.
References 1. Bouhali, F., de Schotten, M.T., Pinel, P., Poupon, C., Mangin, J.F., Dehaene, S., Cohen, L.: Anatomical connections of the visual word form area. J. Neurosci. 34(46), 15402–15414 (2014) 2. Bray, S., Almas, R., Arnold, A.E., Iaria, G., MacQueen, G.: Intraparietal sulcus activity and functional connectivity supporting spatial working memory manipulation. Cereb. Cortex 25(5), 1252–1264 (2013) 3. Cohen, L., Dehaene, S., Naccache, L., Leh´ericy, S., Dehaene-Lambertz, G., H´enaff, M.A., Michel, F.: The visual word form area: spatial and temporal characterization of an initial stage of reading in normal subjects and posterior split-brain patients. Brain 123(2), 291–307 (2000)
Functional Mapping of Areas Related with Inner Speech in Portuguese
175
4. Cohen, L., Leh´ericy, S., Chochon, F., Lemer, C., Rivaud, S., Dehaene, S.: Languagespecific tuning of visual cortex? functional properties of the visual word form area. Brain 125(5), 1054–1069 (2002) 5. Dehaene, S., Le Clec’H, G., Poline, J.B., Le Bihan, D., Cohen, L.: The visual word form area: a prelexical representation of visual words in the fusiform gyrus. Neuroreport 13(3), 321–325 (2002) 6. Geva, S., Jones, P.S., Crinion, J.T., Price, C.J., Baron, J.C., Warburton, E.A.: The neural correlates of inner speech defined by voxel-based lesion-symptom mapping. Brain 134(10), 3071–3082 (2011) 7. Grefkes, C., Fink, G.R.: The functional organization of the intraparietal sulcus in humans and monkeys. J. Anat. 207(1), 3–17 (2005) 8. Huang, J., Carr, T.H., Cao, Y.: Comparing cortical activations for silent and overt speech using event-related fMRI. Hum. Brain Mapp. 15(1), 39–53 (2002) 9. Jones, S.R., Fernyhough, C.: Neural correlates of inner speech and auditory verbal hallucinations: a critical review and theoretical integration. Clin. Psychol. Rev. 27(2), 140–154 (2007) 10. Logothetis, N.K.: What we can do and what we cannot do with fMRI. Nature 453(7197), 869 (2008) 11. Majerus, S.: Language repetition and short-term memory: an integrative framework. Front. Hum. Neurosci. 7, 357 (2013) 12. Marvel, C.L., Desmond, J.E.: From storage to manipulation: how the neural correlates of verbal working memory reflect varying demands on inner speech. Brain Lang. 120(1), 42–51 (2012) 13. Morin, A.: Inner speech. In: Hirstein, W. (ed.) Encyclopedia of Human Behavior, pp. 436–443. Elsevier, London (2012) 14. Morin, A., Hamper, B.: Self-reflection and the inner voice: activation of the left inferior frontal gyrus during perceptual and conceptual self-referential thinking. Open Neuroimaging J. 6, 78–89 (2012) 15. Morin, A., Michaud, J.: Self-awareness and the left inferior frontal gyrus: inner speech use during self-related processing. Brain Res. Bull. 74(6), 387–396 (2007) 16. Morin, A., Uttl, B., Hamper, B.: Self-reported frequency, content, and functions of inner speech. Procedia - Soc. Behav. Sci. 30, 1714–1718 (2011) 17. Palmer, E.D., Rosen, H.J., Ojemann, J.G., Buckner, R.L., Kelley, W.M., Petersen, S.E.: An event-related fMRI study of overt and covert word stem completion. Neuroimage 14(1), 182–193 (2001) 18. Perrone-Bertolotti, M., Rapin, L., Lachaux, J.P., Baciu, M., Loevenbruck, H.: What is that little voice inside my head? inner speech phenomenology, its role in cognitive performance, and its relation to self-monitoring. Behav. Brain Res. 261, 220–239 (2014) 19. Shergill, S.S., Brammer, M.J., Fukuda, R., Bullmore, E., Amaro, E., Murray, R.M., McGuire, P.K.: Modulation of activity in temporal cortex during generation of inner speech. Hum. Brain Mapp. 16(4), 219–227 (2002) 20. Snodgrass, J.G., Vanderwart, M.: A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity. J. Exper. Psychol. Hum. Learn. Mem. 6(2), 174 (1980) 21. Stephan, F., Saalbach, H., Rossi, S.: How the brain plans inner and overt speech production: a combined EEG and fNIRS study. In: 23rd Annual Meeting of the Organization for Human Brain Mapping (OHBM), Vancouver, Canada (2017) 22. Stevens, W.D., Kravitz, D.J., Peng, C.S., Tessler, M.H., Martin, A.: Privileged functional connectivity between the visual word form area and the language system. J. Neurosci. 37(21), 5288–5297 (2017)
176
C. Ferreira et al.
23. Tagamets, M.A., Novick, J.M., Chalmers, M.L., Friedman, R.B.: A parametric approach to orthographic processing in the brain: an fMRI study. J. Cogn. Neurosci. 12(2), 281–297 (2000) 24. Talairach, J., Tournoux, P.: Co-planar Stereotaxic Atlas of the Human Brain. Thieme, New York (1988) 25. Vigneau, M., Jobard, G., Mazoyer, B., Tzourio-Mazoyer, N.: Word and non-word reading: what role for the visual word form area? Neuroimage 27(3), 694–705 (2005) 26. Willinek, W.A., Schild, H.H.: Clinical advantages of 3.0 T MRI over 1.5 T. Eur. J. Radiol. 65(1), 2–14 (2008)
Semi-Supervised Acoustic Model Retraining for Medical ASR Greg P. Finley1(B) , Erik Edwards1 , Wael Salloum1,2,3 , Amanda Robinson1 , Najmeh Sadoughi1 , Nico Axtmann3 , Maxim Korenevsky1 , Michael Brenndoerfer2 , Mark Miller1 , and David Suendermann-Oeft1 1
2
EMR.AI Inc., San Francisco, CA, USA [email protected] University of California, Berkeley, Berkeley, CA, USA 3 DHBW, Karlsruhe, Germany
Abstract. Training models for speech recognition usually requires accurate word-level transcription of available speech data. For the domain of medical dictations, it is common to have “semi-literal” transcripts available: large numbers of speech files along with their associated formatted episode report, whose content only partially overlaps with the spoken content of the dictation. We present a semi-supervised method for generating acoustic training data by decoding dictations with an existing recognizer, confirming which sections are correct by using the associated report, and repurposing these audio sections for training a new acoustic model. The effectiveness of this method is demonstrated in two applications: first, to adapt a model to new speakers, resulting in a 19.7% reduction in relative word errors for these speakers; and second, to supplement an already diverse and robust acoustic model with a large quantity of additional data (from already known voices), leading to a 5.0% relative error reduction on a large test set of over one thousand speakers.
Keywords: Medical speech recognition Acoustic modeling
1
· ASR · Medical dictation
Introduction
Training automatic speech recognition (ASR) systems requires transcribed speech corpora to build acoustic models (AMs) and language models (LMs). Traditionally, such transcriptions are created by human labor, which imposes limitations on how large such corpora can be, how many speakers they can cover, how quickly they can be created, and how consistently transcriptions are following required guidelines. To overcome these limitations, techniques have been proposed to create transcriptions automatically, substantially increasing the size of the training corpus with relatively little effort. For example, Suendermann et al. perform speech recognition on millions of utterances collected in industrial spoken dialog systems and determine, based upon the recognizer’s c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 177–187, 2018. https://doi.org/10.1007/978-3-319-99579-3_19
178
G. P. Finley et al.
confidence score, which of the hypotheses can be accepted without further review and which ones should undergo human quality assurance [8]. Such fully automatic techniques suffer from the disadvantage that they rely on pre-existing speech recognition models and settings and have no way to acquire new vocabulary or adapt to new domains. Thus, they suffer if there is a significant mismatch between training and adaptation language. The medical transcription domain is a special case, however, in that speech recordings of clinical dictations are almost always subject to transcription into a formatted outpatient report which contains a well-formatted and corrected version of the dictated matter. Note that the process of correcting and modifying a literal transcript into a report is an extensive one and often involves changes that make it impossible to use reports directly as ASR training data: intuiting punctuation, list numbering, etc. when formatting is not explicitly spoken; executing requests by the speaker (“scratch that,”, e.g.); or even inserting material from elsewhere in the patient’s medical history. Strategies for using this very rich set of data for the purpose of model enhancement, and to overcome its lack of word-level correspondence between spoken and written content, have been discussed in the literature for about two decades. Early research showed that this type of data can indeed be used to adapt a speaker-independent model to new speakers [5,9]; the basic approach is to use an ASR engine to decode new audio with matching reports, then use the results that can be verified correct as new training data. However, these studies use very small test sets and speech recognition technology which is widely considered dated. Consequently, the baseline performance is very poor by modern standards, and reported improvements often do not meet statistical significance thresholds. To increase the amount of usable data beyond only the correct outputs of the recognizer, researchers have also explored using LMs for decoding built specifically to the report [5] or have explored the use of phonetic [7] and semantic [2,6] features to correct ASR errors using the report as reference. However, the latter studies either did not test how accuracy of a speech recognizer is impacted when adding the new data, or limited the study to LM adaptation. Outside of the medical domain, this type of semi-supervised approach has more recently been applied to parliamentary transcription, which is a similar case in that large amounts of semi-verbatim transcription data are available [3,4]. To our knowledge, however, no validation of these methods exists for building AMs for a modern, production-scale medical ASR system. In this paper we present such a validation for two applications: adapting a model to previously unseen speakers, and enhancing an already large model with additional data from known speakers.
2
Method
We applied semi-supervised methods to enhance the training corpus for AMs in two different experiments. Experiment 1 represents a case of speaker adaptation, using semi-supervised data for speakers unknown to the original acoustic model.
Semi-Supervised Acoustic Model Retraining for Medical ASR
179
In Experiment 2, on the other hand, we test whether a model can be augmented by adding a large quantity of additional data from many known speakers. The general procedure for both experiments is the same; they differ only in the data sources used. Except where otherwise specified, the methodological details given below are identical for both experiments. For each experiment, we built two AMs and compared their performance in word error rate (WER) on a test set. AM1 was a “traditional” model, trained from fully manually transcribed dictations; AM2 contained all the data of AM1 plus a large set of “virtual transcriptions,” generated by (1) ASR decoding of a large set of untranscribed data, then (2) identifying correct hypotheses by comparing with matching reports. The entire training and testing process, including all data and models, is described in detail in Sect. 2.2 and visualized in Fig. 1. 2.1
Data
The primary source of training data consists of manually transcribed dictations, as do all test sets for results reported in this paper. For Experiment 1, no speakers from Test are represented in Train; for Experiment 2, all speakers in Test have exactly one or two dictations in Train. (Recall that Experiment 1 tests the adaptation of a model to new speakers, and Experiment 2 tests the bolstering of an already comprehensive model with more data.) In addition, we have access to a large number of audio dictations with corresponding reports but no transcripts. This corpus constitutes the “Untranscribed” set for each experiment. See Table 1 for size statistics of all corpora. Table 1. Summary of all dictations. Manual transcriptions are available for Train and Test, and reports for Untranscribed. AM1 was trained on Train and AM2 on Train+Aug. Data set
# Speakers # Utterances # Hours
Experiment 1 Train
245
6,857
Test
26
32
305.0 3.5
Untranscribed 458
12,207
652.8
Augmentation 457
211,909
259.5
Train+Aug.
702
218,766
564.5
Train
2,384
9,214
396.1
Test
1,033
Experiment 2 1,033
28.9
Untranscribed 1,241
93,581
6646.5
Augmentation 1,228
2,269,801
2617.1
Train+Aug.
2,279,015
3013.2
2,384
180
G. P. Finley et al.
In general, corpora used for Experiment 2 are much larger than for Experiment 1. The data also come from different providers, with different speakers, recording conditions, and report styles. Despite the methodological similarity between the two experiments, they should be considered entirely separate cases. Although manual transcriptions are generally considered to be the most accurate source of data for ASR training, medical speech is notoriously difficult due to a number of factors including specialized vocabulary, high rate of speech, etc. [1]. The medical transcriptionists who created Train and Test did so with the aid of matching reports, which themselves were generated through multiple rounds of transcription and quality assurance by other trained transcriptionists. Additionally, to estimate human WER when unaided by reports, we obtained separately three rounds of transcription on a set of 334 dictations: two rounds using reports as a reference, as is our normal procedure, and one “blind” round. These dictations did not overlap with any other data set. Note also that these reports were taken from the same provider as the data from Experiment 2, so any human WER results should only be considered relevant to Experiment 2. 2.2
Generating Additional AM Training Data
The entire Untranscribed set was decoded using our best prior acoustic model and a specially designed language model (AM1 and LM1, described below). Sequences of correctly recognized words in the hypotheses were identified by aligning hypotheses with reports using a dynamic programming algorithm. Any sequence consisting of five or more consecutive words matching perfectly between hypothesis and transcript was excised, alongside its matching audio range, and considered a training utterance in a large set of supplementary, semi-supervised training data, which we call the Augmentation set (see Table 1). We decided upon a five-word window based on an informal assessment of the excised clips; shorter windows exhibited more slight errors in word boundary detection, which we suspected would propagate in re-training. Our approach for generating training data is conservative in that we only allow perfect matches of substantial length between hypothesis and report. This ensures that virtual transcriptions are as accurate as possible. Although we piloted some strategies for correcting hypotheses using reports, we have found that, for the quanitities of data that we are considering, the perfect matches already provide very large training corpora by themselves. 2.3
Acoustic and Language Modeling
Our speech recognizer is based on a state-of-the-art stack with 40-dimensional MFCCs, deltas and delta-deltas, fMMLR, ivectors, SAT, GMM-HMM pre-training, and a DNN acoustic model. Two n-gram LMs were used: a trigram model (LM1) for decoding the large Untranscribed set, and a 4-gram model (LM2) for the experimental results comparing AM1 and AM2. (LM1 is faster to decode with, whereas LM2 is more accurate, so LM1 was chosen for the massive Untranscribed set and LM2 to achieve the best possible results on Test.) To generate
Semi-Supervised Acoustic Model Retraining for Medical ASR
181
LM1, language models are first built for (1) the Train set and (2) the Train + Untranscribed sets; these two are then interpolated, with coefficients tuned to minimize perplexity on a held-out set, to yield the final model. The procedure for LM2 was the same, except that all n-gram counts of Untranscribed were decremented by one, effectively removing singletons and significantly accelerating decoding for an otherwise slow 4-gram model with minimal effect on WER. At no point did we use Augmentation data to train LMs. We suspected that doing so would bias the recognizer towards easy speech and very short utterances. (Note also that some version of the linguistic information from the Augmentation set is already present in the LM, which contains Untranscribed.) This bias is not a concern for AM training, where the currency of recognition is at the phonetic level, and transitional probabilities between words are less important.
LM2 (4-gram) Untranscribed reports Train dictations with manual transcriptions
Untranscribed dictations only
AM1 LM1 (trigram)
Hypotheses AM1 result AM2 result
AM2
Test dictations with manual transcriptions Augmentation dictation segments w/ verified accurate hypotheses
Fig. 1. Experimental training and decoding procedure. Rectangles represent audio, possibly transcribed; cylinders, reports; diamonds, models; ellipses, decoding results.
182
3
G. P. Finley et al.
Results: Experiment 1
For Experiment 1, we compared WER on our test set between the baseline acoustic model (AM1) and the large expanded acoustic model (AM2). AM2 decreases the WER from AM1 by 19.7% relative, from 23.1% WER (5,377 edits out of 23,257 words) to 18.6% (4,317 edits), a statistically significant difference as determined by a test of equal proportions (χ2 = 146.2, p < .001). Out of 26 speakers, 22 exhibit a decline in WER—up to a 52.6% relative reduction in the most extreme case (from 105 errors down to 46 errors, out of 512 words). Of the 4 that see an increase, the highest is an 11.1% relative increase (72 errors up to 80, out of 436 words). The distribution across speakers of relative WER change is visualized in Fig. 2.
3 2 0
1
Number of speakers
4
5
Histogram of error change from AM1 to AM2 (Exp. 1)
−60
−40
−20
0
Relative increase in errors (%)
Fig. 2. Relative WER change by speaker, AM1 to AM2 (Experiment 2)
4
Results: Experiment 2
For Experiment 2, we also measured differences in WER between two acoustic models. Additionally, as the test set contains a much larger number of speakers compared to Experiment 1, we dive deeper into the by-speaker results. Note again that all models and corpora in this Experiment are different than those used in Experiment 1; mentions of ‘AM1’/‘AM2’ in this section refer now to the Experiment 2 versions of these.
Semi-Supervised Acoustic Model Retraining for Medical ASR
183
250 200 150 100 0
50
Number of speakers
300
350
Histogram of error change from AM1 to AM2 (Exp. 2)
−60
−40
−20
0
20
40
60
80
Relative increase in errors (%)
Fig. 3. Relative WER change by speaker, AM1 to AM2 (Experiment 2).
4.1
Decoding Accuracy
Decoding with AM2 decreases the WER from AM1 by 5.0% relative, from 22.0% WER (52,961 edits out of 240,382 words) to 20.9% (50,332 edits). Though this effect is smaller than that demonstrated in Experiment 1, the difference is still statistically significant (χ2 = 85.2, p < .001). The decrease in error rate is far from uniform across all speakers, however: relative WER over each speaker decreases by as much as 56% and increased by as much as 75%. WER increases for 303 out of 1033 speakers. See Fig. 3 for the distribution of relative change for individual speakers. 4.2
Effect of Amount of Data Added
The extreme range of variation between speakers, and the fact that many speakers actually see a deterioration in performance, is a surprising finding that invites an explanation. Towards this end, a natural question is whether there is any relationship between the observed changes in WER and the amount of audio data added from the Augmentation set. Across all speakers, there is a correlation between relative change in WER and minutes of audio added, albeit a weak one (Kendall’s τ = −.046, p = .026; correlation is measured over ranks because time added is non-parametric, with a long right tail). This correlation measurement is only possible given the huge number of speakers in the Experiment 2 test set; no significant similar effect could be observed for Experiment 1. The relationship between time added and WER is visualized in Fig. 4. For this plot, speakers are grouped into bins according to the amount of audio data added,
184
G. P. Finley et al. Change in word error rate by amount of Augmentation audio per speaker (Exp. 2) ●
●
●
0.30
●
0.25
● ●
*
*
● ●
0.20
●
● ●
● ●
*
●
● ● ●
● ●
0.15
● ● ●
● ●
● ●
*
●
●
*
● ● ●
● ●
●
●
●
●
● ●
● ●
●
0.10
Word error rate for entire bin
● ●
●
●
●
●
●
● ●
●
●
0
100
200
300
400
500
600
700
Minutes of audio added (bin width = 10 minutes)
Fig. 4. Binned speaker WERs by amount of audio for each speaker in Augmentation data (Experiment 2). AM1 WER is marked by the narrow end of the bar, AM2 WER by the wide end with a circle. Asterisks underneath bars denote statistical significance of WER change from AM1 to AM2 (α = .05, Bonferroni correction).
with each bin accounting for a 10-min range (inclusive on the low end only). The plot shows WER for AM1 (narrow end of the trapezoid) and AM2 (wide end) for each bin—thus, the trapezoid “points” in the direction of the change—calculated over all utterances in that bin. We performed a test of equal proportions for each bin, applying Bonferroni correction for multiple comparisons; those five bins with p < .05 are starred in the plot. (Note that the degree of change in a bin is not necessarily tied to statistical significance, as bins do not all contain the same number of speakers or spoken words.) These individual bins are rather small, so most do not show statistically significant changes; all those that do are for speakers with fewer than 220 min of speech added. Most interesting, however, is that the only bin to show a significant increase in WER using AM2 is the 0- to 10-minute bin. This increase is driven mostly by the 30 speakers (out of the bin’s 44 total) who had no additional data added and saw an increase in WER of over 2% absolute, 8.5% relative (χ2 = 16.2, p < .001). These 30 speakers stand in stark contrast to the dataset as a whole, which shows a 1.1% decrease in absolute WER. 4.3
Human Word Error Rate
Given the nature of the data used in this work (recordings with at times extreme noise, non-native accent, audio compression artifacts, hesitations, etc.), and inspired by an earlier publication along these lines [1], we decided to study inter-rater consistency of the dataset by measuring the human error rate. Since our standard transcription procedure (Assisted condition) provides transcriptionists with the existing outpatient report of the dictation (which itself had undergone at least two tiers of transcription), we decided to conduct two types of human error rate experiments: (a) compare two transcriptions of the same audio
Semi-Supervised Acoustic Model Retraining for Medical ASR
185
Table 2. Human WER between different sets of transcriptions. The “Assisted” conditions were done by professional transcriptionists using matched final reports as a reference, and “Unassisted” by transcriptionists without access to the reports. Comparison Assisted1–Assisted2
WER 9.3%
Assisted1–Unassisted 18.0% Assisted2–Unassisted 20.1%
files created in the Assisted condition and (b) compare transcriptions created in the Assisted condition with those in the Unassisted condition. We expected (a) to exhibit a lower WER than (b) due to the existence of shared material. The inter-transcriber WERs are given in Table 2. In the Unassisted condition, transcribers differ from the Assisted conditions by 18.0% to 20.1%. From these results, it appears that WER on our data by a single transcriber without pre-generated reference material would approach 20%. Even when such material is available, however, there are notable disagreements or errors in transcription (9.3%), further emphasizing the difficulty of the speech in these dictations. Recall again that we commissioned these transcriptions only for the data used in Experiment 2; human WER for Experiment 1 may not be this high.
5
Discussion
Our proposed method of providing guaranteed accurate data for AM retraining leads to models with lower average decoding error rates. For the purposes of adapting a model to previously unseen speakers, there was a major reduction in WER, eliminating nearly a fifth of all errors. When bolstering an already large model, the gains are somewhat more modest—especially so when considering that AM2 in Experiment 2 was trained on 7.6 times the amount of audio data as AM1. Our human WER measurements do suggest, however, that these dictations are especially difficult, and that we are already approaching human accuracy, so it may simply be the case that performance of the acoustic models has been “saturated” by this point. The more mixed results in Experiment 2, as well as the large and diverse test set used, invite some speculation as to how speakers may be affected differently vis-` a-vis their WER by the data augmentation step. Despite the average drop in WER with AM2, performance did deteriorate in some instances. This was most evident for speakers for whom no data was added to the model. We suspect the cause is that the representation of these speakers in AM2 was diluted compared to their representation in the much smaller AM1. As a concrete recommendation, we would not suggest using an augmented acoustic model for speakers who had no data added, assuming they were already represented in the base model. Other than in this specific case, however, it was difficult to demonstrate any strong relationship between the amount of data added for a speaker and the
186
G. P. Finley et al.
degree of recognition improvement. One explanation may be the presence of a confounding effect: speakers with higher AM1 WERs will naturally have less data in the Augmentation set. Because accurate recognition on Untranscribed is a prerequisite for finding utterances to add to Augmentation, speakers for whom the model already does well tend to have the most added data. Indeed, there is a moderately strong correlation (Kendall’s τ = −.20, p < .001) between AM1 WER and amount of data added per speaker; note that this correlation is visually unmistakable in the general downward trend on the left side of Fig. 4. Thus, speakers with the most added data tend to be those who already showed low WER before augmentation. These same speakers would have had less “room for improvement” from changes to the AM: indeed, those speakers with higher AM1 WER tended to have larger relative improvements than those with lower AM1 WER (τ = −.065, p = .002). Taken together, the effects of prior AM1 WER on WER change and on amount of data added may be obscuring some of the positive effects of having more added data. Further gains in performance may be possible via strategies described in the literature for using reports to correct ASR errors on the Untranscribed set, allowing speech previously missed by the recognizer to be used for training. While our methods are sufficient to produce a very large training set, it is likely that adding more difficult speech to training would improve recognition further. This would effectively be an automated active learning approach, using alignment with reports as a semi-supervised step. We also did not attempt to bolster LMs in the same way we did for AMs; however, fully corrected machine transcripts would make this possible to test also.
6
Conclusion
We presented and evaluated a semi-supervised method for augmenting a speakerindependent AM using large numbers of dictations with matching final reports. Our bolstered AMs achieve a significant reduction in error rates, inching closer to human error rates. The methods detailed here are especially effective as a means of adapting an AM to new speakers. By measuring performance on a large test set of over 1,000 speakers, we were able to note patterns in the procedure’s effects. The amount of data added seems not to matter much, except that those speakers without any added acoustic data saw on average an increase in WER. This leads naturally to the conclusion that, whenever practical, different AMs should be used for different speakers depending on whether or not data from the target speaker was added in the augmentation stage. Future work will include expanding the approach to language modeling and applying more sophisticated techniques to select optimal models, e.g. using speaker clustering. We will also look deeper into the influence of the human error rate on ASR performance in both training and testing cycles and possible techniques to enhance inter-rater reliability for this difficult domain.
Semi-Supervised Acoustic Model Retraining for Medical ASR
187
References 1. Edwards, E., et al.: Medical speech recognition: reaching parity with humans. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 512–524. Springer, Cham (2017). https://doi.org/10.1007/978-3-31966429-3 51 2. Jancsary, J., Klein, A., Matiasek, J., Trost, H.: Semantics-based automatic literal reconstruction of dictations. In: Semantic Representation of Spoken Language, pp. 67–74 (2007) 3. Kawahara, T.: Transcription system using automatic speech recognition for the Japanese Parliament (Diet). In: IAAI (2012) 4. Kleynhans, N., De Wet, F.: Aligning audio samples from the South African parliament with Hansard transcriptions (2014) 5. Pakhomov, S., Schonwetter, M., Bachenko, J.: Generating training data for medical dictations. In: Proceedings of NAACL-HLT, pp. 1–8 (2001) 6. Petrik, S., et al.: Semantic and phonetic automatic reconstruction of medical dictations. Comput. Speech Lang. 25(2), 363–385 (2011) 7. Petrik, S., Kubin, G.: Reconstructing medical dictations from automatically recognized and non-literal transcripts with phonetic similarity matching. In: ICASSP, vol. 4, pp. IV-1125. IEEE (2007) 8. Suendermann, D., Liscombe, J., Pieraccini, R.: How to drink from a fire hose: one person can annoscribe 693 thousand utterances in one month. In: Proceedings of SIGdial, Tokyo, Japan (2010) 9. Wightman, C.W., Harder, T.A.: Semi-supervised adaptation of acoustic models for large-volume dictation. In: Proceedings of Eurospeech, pp. 1371–1374 (1999)
You Sound Like Your Counterpart: Interpersonal Speech Analysis Jing Han1(B) , Maximilian Schmitt1 , and Bj¨ orn Schuller1,2 1
ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany [email protected] 2 GLAM – Group on Language, Audio & Music, Imperial College London, London, UK
Abstract. In social interaction, people tend to mimic their conversational partners both when they agree and disagree. Research on this phenomenon is complex but not recent in theory, and related studies show that mimicry can enhance social relationships, increase affiliation and rapport. However, automatically recognising such a phenomenon is still in its early development. In this paper, we analyse mimicry in the speech domain and propose a novel method by using hand-crafted lowlevel acoustic descriptors and autoencoders (AEs). Specifically, for each conversation, two AEs are built, one for each speaker. After training, the acoustic features of one speaker are tested with the AE that is trained on the features of her counterpart. The proposed approach is evaluated on a database consisting of almost 400 subjects from 6 different cultures, recorded in-the-wild. By calculating the AE’s reconstruction errors of all speakers and analysing the errors at different times in their interactions, we show that, albeit to different degrees from culture to culture, mimicry arises in most interactions. Keywords: Affective computing Computational paralinguistics
1
· Conversation analysis
Introduction
Research in psychology has shown that people unconsciously mimic their counterpart in social interaction, which can be operationalised in varying ways including mimic posture, facial expressions, mannerisms, and other verbal and nonverbal expressions [5]. Moreover, the automatic detection of temporal mimicry behaviour can serve as a powerful indicator of social interaction, e.g., cooperativeness, courtship, empathy, rapport, and social judgement [12]. The previous works focus on automatically detecting mimicry behaviours particularly from head nod and smile, i.e., visual cues [3,23]. In this work, we focus on the acoustic side, given that in social interaction, people mimic others not only by body language, but also in their speech. To the best of our knowledge, this is the first time that identical behaviour is analysed from speech over c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 188–197, 2018. https://doi.org/10.1007/978-3-319-99579-3_20
You Sound Like Your Counterpart: Interpersonal Speech Analysis
189
different cultures in empirical research, though previous works exist where similar topics have been studied in theory [4]. As there is limited related works in this specific topic, we first utilised low-level descriptors (LLDs) such as log-energy, and pitch, and measured the similarities over each conversation turn, but hardly found any obvious trend in these descriptors. Thus, we propose an autoencoder-based framework to leverage the power of machine learning. Specifically, for each interaction, two autoencoders (AEs) are trained on speech from two subjects A and B, respectively. Then, once the training procedure is done, the instances are exchanged and fed into the two autoencoders again, i.e., A is evaluated on the AE trained by data from B while B is evaluated on the AE trained by data from A. This goes under the hypothesis that, when a subject tends to behave similarly to her counterpart, the reconstructed features from the AE trained with her counterpart’s data should have a decreasing error along time. In the following Section, the related work is summarised both from a sociological and a technical perspective. In Sect. 3, we describe the data and acoustic features used in our research. In Sect. 4, we explain the experiments and present the results, before concluding in Sect. 5.
2
Related Work
Mimicry behaviour can be categorised into two different groups: emotional mimicry and motor mimicry [13]. While the first describes mimicry in the underlying affective state, such as, happiness or sadness, the latter considers only imitation of physical expressions, such as, e.g., raising an eye-brow or nodding the head. As can be expected, motor mimicry is much easier to detect than emotional mimicry, given that physical expressions can be classified quite objectively by a human observer and also by automated tools. In the late 1970s, Friesen and Ekman proposed the ‘Facial Action Coding System’ [11] based on so-called facial action units (FAUs). FAUs describe 44 different activations of facial muscles, resulting in a certain facial expression, e.g., ‘raising eye brow’, ‘wrinkling nose’, or ‘opening mouth’. However, several FAUs can be combined and be active at the same time. Ekman and Friesen have also shown that, there is a strong relationship between FAUs and affective states [8] and that those relationships are largely universal despite there are some differences between cultures [7]. FAUs and head movements can be robustly recognised with state-of-the-art tools, such as OpenFace [2]. Motor mimicry is a means of persuasion in human-to-human interaction, by conforming to the other’s opinions and behaviour [13]. Humans are susceptible to mimic behaviours through both the audio and the visual domain [16]. Although mimicry is usually found in interactions both when subjects disagree with each other and when they do not, there are more mimicry interactions where people agree [23]. Moreover, it has been shown that there is usually a tendency to adopt gestures, postures, and behaviour of the chat partner over time during the conversation [5,6].
190
J. Han et al.
From the methodological point of view, for the automatic detection of behavioural mimicry, a temporal regression model has been proposed by Bilakhia et al. predicting audio-visual features of the chat partner using a deep recurrent neural network [3]. An ensemble of models has been trained for each class (mimicry/non-mimicry) and the ensemble providing the lowest reconstruction error determined the class. Mel-frequency cepstral coefficients have been employed as acoustic features and facial landmarks as visual features. Compared to motor mimicry, emotional mimicry has been studied much less. However, it has been found that the tendency to mimic others’ behaviour is much less valid from the emotional perspective [14]. The extent of emotional mimicry highly depends on the social context and emotional mimicry is not present if people do not like each other or each other’s opinion. Scissors et al. found the same analysing the linguistic behaviour [21]. They observed that in a text-based chat system, within-chat mimicry (i.e., repetition of words or phrases) is much higher in chats where subjects trusted each other than in chats with a low level of trust. Furthermore, it was found that linguistic mimicry has a positive effect on the outcome of negotiations [24].
3
Dataset and Features
Our experiments are based on the SEWA corpus of audio-visual interaction inthe-wild1 . Hand-crafted acoustic features have been extracted on the frame-level from the audio of all chats. 3.1
SEWA Video Chat Dataset
In the SEWA database, 197 conversations have been recorded from subjects of six different cultures (Chinese, Hungarian, German, British, Serbian, and Greek). Table 1 summarises the number and total duration of conversations for each culture. The number of subjects is always twice the number of conversations. In these conversations, each lasting up to 3 min, a pair of subjects from the same culture discussed about an advertisement they just watched beforehand on a web platform. Figure 1 illustrates a screenshot of one dyadic conversation. The commercial seen beforehand was a 90 s long video clip advertising a (water) tap. All subjects were recorded in an ‘in-the-wild setting’, i.e., using the subjects’ personal desktop computers or notebooks and recording them either at their homes or in their offices. The chat partners always knew each other beforehand (either family, friends, or colleagues) and were balanced w.r.t. gender constellations (female-male, female-female, male-male). Subjects with an age ranging from 18 to older than 60 are included in the database. The dialogues had to be held in the native language of the chat partners, but there were no restrictions concerning the exact aspects to be discussed during their chat about the commercial. Conversations showed a large variety of emotions and levels of agreement/disagreement or rapport. The SEWA corpus has been used as the official benchmark database in the 2017 and 2018 Audio-Visual Emotion Challenges (AVEC) [17,18]. 1
https://sewaproject.eu/.
You Sound Like Your Counterpart: Interpersonal Speech Analysis
3.2
191
Acoustic Features
We use the established ComParE feature set of acoustic features [9]. For each audio recording, we capture the acoustic low-level descriptors (LLDs) with the openSMILE toolkit [10] at a step size of 10 ms. The ComParE LLDs extracted on frame-level have been introduced at the Interspeech 2013 Computational Paralinguistics Challenge (ComParE) [20]. However, the functionals defined in the feature set, i.e., the statistics summarising the LLDs on utterance level, are not applied in this work, as we are interested in the time-dependent information on frame-level. ComParE comprises 65 LLDs summarised in Table 2, covering spectral, cepstral, prosodic, and voice quality information, extracted from a frame with a size of 20 ms to 60 ms length. In addition, the first order derivatives (deltas) are computed, resulting in a frame-level feature vector of size 130 for each step of 10 ms. Table 1. SEWA corpus: Number of conversations and subjects and total duration for each culture. Index Culture
# Conversations # Subjects Total duration [min]
C1
Chinese
35
70
101
C2
Hungarian
33
66
67
C3
German
32
64
89
C4
British
33
66
94
C5
Serbian
36
72
98
C6
Greek
Sum
28
56
81
197
394
530
Fig. 1. SEWA corpus: Screenshot taken from a sample video chat with one female and one male German subject.
192
J. Han et al.
Table 2. Interspeech 2013 Computational Paralinguistics Challenge (ComParE) feature set. Overview of 65 acoustic low-level descriptors (LLDs). RMS: Root-MeanSquare, RASTA: RelAtive SpecTral Amplitude, MFCC: Mel-Frequency-Cepstral Coefficients, SHS: Sub-Harmonic Summation. 4 energy related LLD
Group
Loudness
Prosodic
Modulation loudness
Prosodic
RMS energy, zero-crossing rate
Prosodic
55 spectral related LLD
Group
RASTA auditory bands 1–26
Spectral
MFCC 1–14
Cepstral
Spectral energy 250–650 Hz, 1–4 kHz Spectral
4
Spectral roll-off pt. .25, .50, .75, .90
Spectral
Spectral flux, entropy, variance
Spectral
Spectral skewness and kurtosis
Spectral
Spectral slope
Spectral
Spectral harmonicity
Spectral
Spectral sharpness (auditory)
Spectral
Spectral centroid (linear)
Spectral
6 voicing related LLD
Group
F0 via SHS
Prosodic
Probability of voicing
Voice quality
Jitter (local and delta)
Voice quality
Shimmer
Voice quality
Log harmonics-to-noise ratio
Voice quality
Behaviour Similarity Tendency Analysis with Autoencoder
To analyse the interpersonal sentiment and investigate the temporal behaviour patterns from speech, we first standardised (zero mean and unit standard deviation) the 130 frame-level features within the same recordings to minimise the differences between different recording conditions. This procedure turned these LLDs into suitable ranges, as the inputs and target outputs of an autoencoder (AE). Before training the AE, we first segmented the LLD sequences based on the transcriptions provided in the SEWA database, where information on the start and end of each speech segment and the subject ID of the corresponding segment is given. After that, the whole LLD sequences of each recording were divided into two sub-sequences, each including features from only one subject. Following the above-mentioned separation process, features from one subject were utlised to train an AE, and features from the other subject in the same
You Sound Like Your Counterpart: Interpersonal Speech Analysis
193
recording were fed into the trained AE for testing. Furthermore, once all features for testing have been reconstructed with the AE, we calculate the root-meansquared errors (RMSEs) of the reconstructed features over time, and examine how and to which extent the RMSE varies along time. Consequently, for each recording, two AEs are learnt based on the two subjects involved in the recording, resulting in two one-dimensional RMSE sequences calculated between the input and the output feature sequences during the testing step. 4.1
Experimental Settings
The AE we applied is a 3-layer encoder with a 3-layer decoder. In the preliminary experiments, the number of nodes in each layer has been chosen as follows: 130-64-32-12-32-64-130, where the output dimension is exactly the same as the input dimension. During network training, the network weights were updated by using mean squared error loss and the Adagrad optimizer, and the training process was ceased after 512 epochs. Furthermore, to accelerate the training process, the network weights were updated after running every batch of 256 LLDs for computation in parallel. The training procedure was performed with Keras, which is a deep learning library for python. After generating the reconstruction errors of the tested subject over time, the resulting sequence is exploited to perform a linear regression, with the assumption that the slope of the learnt line may indicate the changes of the behaviour patterns along time. More specifically, when the slope is negative, it may demonstrate that during the chat session, the tested subject turns to become more similar to the subject who (s)he is talking to. Thence, if the slope is positive, it may imply the opposite. Additionally, the amplitude of the slope can be an indication to denote the level of the similarity or dissimilarity. 4.2
Results and Discussion
We first discuss the results achieved with the data from the first culture, Chinese (C1). From all 35 recordings, the average slope of the RMSE sequences of all 70 subjects is −0.07. From Fig. 2, we notice that, most of the slopes (54 of 70) are negative, whereas only a few (16 of 70) are positive. This indicates that, during the recordings, the acoustic LLD features of the tested subjects have a smaller reconstruction error when time passes by. Considering that the AE is trained with the other subject within the same recording, a smaller reconstruction error may reveal a higher similarity between these two subjects. To sum up, a negative slope implies a decreasing reconstruction error along the time and could indicate a similarity increasing among the speakers during the video chat. Interestingly, similar patterns have also been found in all the other five cultures. Nevertheless, the ratio of the negative slopes and the average slope are different from culture to culture. Given these results, we calculated the average slopes s of all cultures separately, as well as the Pearson correlation coefficients (PCCs) of two slopes obtained from all recordings within the same culture, respectively, with the aim
194
J. Han et al.
Fig. 2. Slope of RMSE sequences of 70 Chinese subjects from 35 recordings. In each recording, there are two subjects as denoted with blue and red bars, respectively. (Color figure online)
to perceive cultural variation in spontaneous remote conversations. Results are given in Table 3. Note that, a negative slope denotes that the subject shows a more similar speech behaviour in a conversation along the time; the more similar a subject is speaking like his partner, the larger the slope is towards the negative direction. From Table 3, one may notice that on average, individuals of all six cultures tend to behave more similar during the conversation, given that the average slopes are all negative. However, cultural variation remains, as the most negative slope (−0.12) is obtained for the Greek (C6) culture and the smallest slope (−0.07) is seen for Chinese (C1) and British (C4). Moreover, taking the PCC into account, we may see the cultural variation from another view. A positive PCC value demonstrates that subjects of a culture tend to converge to a similar state, either both behave like or unlike each other, while a negative PCC may indicate that conversations are more likely to be dominated by one subject. For example, no correlation has been seen in Table 3. Average slope of RMSE sequences of all subjects within six different cultures is listed in the upper row, respectively; the correlation coefficient denoted as pcc of pairs indicates the correlation of behaviours of two subjects and is listed in the last row for each culture (C1: Chinese, C2: Hungarian, C3: German, C4: British, C5: Serbian, and C6: Greek). C1
C2
C3
C4
C5
C6
Average slope −0.07 −0.11 −0.10 −0.07 −0.08 −0.12 pcc of pairs
−0.03 0.34
0.15
0.39
0.39
−0.26
You Sound Like Your Counterpart: Interpersonal Speech Analysis
195
conversations of the Chinese pairs (C1), with a PCC of −0.03, which is close to 0. However, strong linear correlations have been revealed in four cultures, either positive (Hungarian (C2), British (C4), and Serbian (C5)) or negative (Greek (C6)). Besides, a weak positive correlation can be seen in German (C3). These findings need to be verified by literature in sociology, anthropology, and in the anthropologic linguistics domain, particularly in the field of conversation analysis [22], which is, however, out of the scope of this work. Note that, despite that the SEWA database was designed and developed with a control of age and gender of the subjects, discrepancies caused by these or other aspects such as educational background, occupation, and health status cannot be avoided and might still may have an impact on our observations.
5
Conclusion and Outlook
In this work, we have demonstrated that, an autoencoder has a great potential to recognise the spontaneous and unconscious mimicry in the social interaction, by the observation of the reconstruction error using the acoustic features extracted from the speech of a conversational partner. We have given some insights into the synchronisation of vocal behaviour in dyadic conversations of people from six different cultures. Future work will focus on optimised feature representations, such as bag-of-audio-words [19] or learnt features such as auDeep [1]. Moreover, we are going to exploit also the linguistic domain through state-of-the-art word embeddings, such as word2vec [15]. Lastly, other than the slope of the reconstruction errors, additional evaluation strategies to measure the degree of similarity of similarity between subjects will be explored in the future [6]. Acknowledgments. The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under GA No. 645094 (Innovation Action SEWA) and through the EFPIA Innovative Medicines Initiative under GA No. 115902 (RADAR-CNS).
References 1. Amiriparian, S., Freitag, M., Cummins, N., Schuller, B.: Sequence to sequence autoencoders for unsupervised representation learning from audio. In: Proceedings of the DCASE 2017 Workshop, Munich, Germany (2017) 2. Baltruˇsaitis, T., Robinson, P., Morency, L.P.: OpenFace: An open source facial behavior analysis toolkit. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, pp. 1–10 (2016) 3. Bilakhia, S., Petridis, S., Pantic, M.: Audiovisual detection of behavioural mimicry. In: Proceedings of Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), Geneva, Switzerland, pp. 123–128 (2013)
196
J. Han et al.
4. Burgoon, J.K., Hubbard, A.E.: Cross-cultural and intercultural applications of expectancy violations theory and interaction adaptation theory. In: Gudykunst, W.B. (ed.) Theorizing about Intercultural Communication, pp. 149–171. Sage Publications, Beverly Hills (2005) 5. Chartrand, T.L., Bargh, J.A.: The chameleon effect: the perception-behavior link and social interaction. J. Pers. Soc. Psychol. 76(6), 893–910 (1999) 6. Delaherche, E., Chetouani, M., Mahdhaoui, A., Saint-Georges, C., Viaux, S., Cohen, D.: Interpersonal synchrony: a survey of evaluation methods across disciplines. IEEE Trans. Affect. Comput. 3(3), 349–365 (2012) 7. Ekman, P.: Universals and cultural differences in facial expressions of emotion. In: Nebraska Symposium on Motivation. University of Nebraska Press (1971) 8. Ekman, P., Friesen, W.V.: Unmasking the Face: A Guide to Recognizing Emotions from Facial Clues. Ishk, Ujjain (2003) 9. Eyben, F.: Real-Time Speech and Music Classification by Large Audio Feature Space Extraction. Springer, Switzerland (2016). https://doi.org/10.1007/978-3319-27299-3 10. Eyben, F., Weninger, F., Groß, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia (MM), Barcelona, Spain, pp. 835–838 (2013) 11. Friesen, E., Ekman, P.: Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto (1978) 12. Gueguen, N., Jacob, C., Martin, A.: Mimicry in social interaction: its effect on human judgment and behavior. Eur. J. Soc. Sci. 8(2), 253–259 (2009) 13. Hess, U., Fischer, A.: Emotional mimicry as social regulation. Pers. Soc. Psychol. Rev. 17(2), 142–157 (2013) 14. Hess, U., Fischer, A.: Emotional mimicry: why and when we mimic emotions. Soc. Pers. Psychol. Compass 8(2), 45–57 (2014) 15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, NV, pp. 3111–3119 (2013) 16. Parrill, F., Kimbara, I.: Seeing and hearing double: the influence of mimicry in speech and gesture on observers. J. Nonverbal Behav. 30(4), 157 (2006) 17. Ringeval, F., et al.: AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition. In: Proceedings of the 8th Annual Workshop on Audio/Visual Emotion Challenge, Seoul, Korea (2018, to appear) 18. Ringeval, F., et al.: AVEC 2017: Real-life depression, and affect recognition workshop and challenge. In: Proc. of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, pp. 3–9 (2017) 19. Schmitt, M., Ringeval, F., Schuller, B.: At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech. In: Proceedings of INTERSPEECH, San Francisco, CA, pp. 495–499 (2016) 20. Schuller, B., et al.: The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In: Proceedings of INTERSPEECH, Lyon, France, pp. 148–152 (2013) 21. Scissors, L.E., Gill, A.J., Gergle, D.: Linguistic mimicry and trust in text-based CMC. In: Proceedings of the ACM Conference on Computer Supported Cooperative Work, San Diego, CA, pp. 277–280 (2008) 22. Stivers, T., et al.: Universals and cultural variation in turn-taking in conversation. Proc. Nat. Acad. Sci. U.S.A 106(26), 10587–10592 (2009)
You Sound Like Your Counterpart: Interpersonal Speech Analysis
197
23. Sun, X., Nijholt, A., Truong, K.P., Pantic, M.: Automatic visual mimicry expression analysis in interpersonal interaction. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Colorado Springs, CO, pp. 40–46 (2011) 24. Swaab, R.I., Maddux, W.W., Sinaceur, M.: Early words that work: When and how virtual linguistic mimicry facilitates negotiation outcomes. J. Exper. Soc. Psychol. 47(3), 616–621 (2011)
TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation Fran¸cois Hernandez1(B) , Vincent Nguyen1 , Sahar Ghannay2 , Natalia Tomashenko2 , and Yannick Est`eve2 1
Ubiqus, Paris, France [email protected] 2 LIUM, University of Le Mans, Le Mans, France {sahar.ghannay,natalia.tomashenko,yannick.esteve}@univ-lemans.fr https://www.ubiqus.com https://lium.univ-lemans.fr/
Abstract. In this paper, we present TED-LIUM release 3 corpus (TEDLIUM 3 is available on https://lium.univ-lemans.fr/ted-lium3/) dedicated to speech recognition in English, which multiplies the available data to train acoustic models in comparison with TED-LIUM 2, by a factor of more than two. We present the recent development on Automatic Speech Recognition (ASR) systems in comparison with the two previous releases of the TED-LIUM Corpus from 2012 and 2014. We demonstrate that, passing from 207 to 452 h of transcribed speech training data is really more useful for end-to-end ASR systems than for HMM-based state-ofthe-art ones. This is the case even if the HMM-based ASR system still outperforms the end-to-end ASR system when the size of audio training data is 452 h, with a Word Error Rate (WER) of 6.7% and 13.7%, respectively. Finally, we propose two repartitions of the TED-LIUM release 3 corpus: the legacy repartition that is the same as that existing in release 2, and a new repartition, calibrated and designed to make experiments on speaker adaptation. Similar to the two first releases, TED-LIUM 3 corpus will be freely available for the research community. Keywords: Speech recognition · Opensource corpus Speaker adaptation · TED-LIUM
1
· Deep learning
Introduction
In May 2012 and May 2014, the LIUM team released two versions (respectively 118 h of audio and 207 h of audio) from the TED conference videos which were since widely used by the ASR community for research purposes. These corpora were called TED-LIUM, release 1 and release 2, presented respectively in [10,11]. Ubiqus joined these efforts to pursue the improvements both from an increased data standpoint, as well as from a technical achievement one. We believe that c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 198–208, 2018. https://doi.org/10.1007/978-3-319-99579-3_21
TED-LIUM 3 Corpus
199
this corpus has become a reference and will continue to be used by the community to improve further the results. In this paper, we present some enhancements regarding the dataset, by using a new engine to realign the original data, leading to an increased amount of audio/text, and by adding new TED talks, which combined with the new alignment process, gives us 452 h of aligned audio. A new data distribution is also proposed that is more suitable for experimenting with speaker adaptation techniques, in addition to the legacy distribution already used on TED-LIUM release 1 and 2. Section 2 gives details about the new TED-LIUM 3 corpus. We present experimental results with different ASR architectures, by using Time Delay Neural Network (TDNN) [5] and Factored TDNN (TDNN-F) acoustic models [7] on the legacy distribution of TED-LIUM 3 in Sect. 3, and also exploring the use of a pure neural end-to-end system in Sect. 4. In Sect. 5, we report experimental results obtained on the speaker adaptation distribution by exploiting GMM-HMM and TDNN-Long Short-Term Memory (TDNN-LSTM) [6] acoustic models and two standard adaptation techniques (ivectors and feature space maximum linear regression (fMLLR)). The final section is dedicated to discussion and conclusion.
2
TED-LIUM 3 Corpus Description
2.1
Data, Alignment and Filtering
All raw data (acoustic signals and closed captions) were extracted from the TED website. For each talk, we built a sphere audio file, and its corresponding transcript in stm format. The text from each .stm file was automatically aligned to the corresponding .sph file using the Kaldi toolkit [8]. This consists of the adaptation of existing scripts1 , intended to first decode the audio files with a biased language model, and then align the obtained .ctm file with the reference transcript. To maximize the quality of alignments, we used our best model (at the time of corpus preparation) trained on the previous release of the TEDLIUM corpus. This model achieved a WER of 9.2% on both development and test sets without any rescoring. This means the ratio of aligned speech versus audio from the original 1,495 talks of releases 1 and 2 has changed, as well as the quantity of words retained. It increased the amount of usable data from the same basis files by around 40% (Table 1). In the previous release, aligned speech represented only around 58.9% of the total audio duration (351 h). With these new alignments, we now cover around 83.0% of audio. A first set of experiments was conducted to compare equivalent systems trained on the two sets of data (Table 2). With strictly equivalent models, there is no clear improvement of results for the proposed new alignments. Yet, there is no degradation of performance either. We will show in further experiments that the increased amount of data will not just be harmless, but also useful. 1
https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/cleanup/ segment long utterances.sh.
200
F. Hernandez et al. Table 1. Maximizing alignments - TED-LIUM release 2 talks. Characteristic Alignments Evolution Original New Speech
207h
290h 40.1%
Words
2.2M
3.2M 43.1%
Table 2. Comparison of training on original and new alignments for TED-LIUM release 2 data (Experiments conducted with the Kaldi toolkit - details in Sect. 3).
2.2
Model (rescoring)
Original - 207h New - 290h Dev Test Dev Test
HMM-GMM (none)
19.0% 17.6%
18.7% 17.2%
HMM-GMM (Ngram)
17.8% 16.5%
17.7% 16.1%
HMM-TDNN-F (none)
8.5%
8.3%
8.2%
8.3%
HMM-TDNN-F (Ngram)
7.8%
7.8%
7.7%
7.9%
HMM-TDNN-F (RNN)
6.8%
6.8%
6.6%
6.7%
Corpus Distribution: Legacy Version
The whole corpus is released as what we call a legacy version, for which we keep the same development and test sets as the first releases. Table 3 summarizes the characteristics of text and audio data of the new release of the TED-LIUM corpus. Statistics from the previous and new releases are presented, as well as the evolution between the two. Additionally, we mention that aligned speech (including some noises and silences) represents around 82.6% of audio duration (540 h). Table 3. TED-LIUM 3 corpus characteristics. Characteristic
Corpus v2
Evolution v3
Total duration
207 h
452 h
118.4%
- Male
141 h
316 h
124.1%
- Female
66 h
134 h
103.0%
Mean duration
10 min 12 s
11 min 30 s
12.7%
Number of unique speakers
1242
2028
63.3%
Number of talks
1495
2351
Number of segments
92976
268231
188.5%
57.3%
Number of words
2.2M
4.9M
122.7%
TED-LIUM 3 Corpus
2.3
201
Corpus Distribution: Speaker Adaptation Version
Speaker adaptation of acoustic models (AMs) is an important mechanism to reduce the mismatch between the AMs and test data from a particular speaker, and today it is still a very active research area. In order to design a suitable corpus for exploring speaker adaptation algorithms, additional factors and dataset characteristics, such as number of speakers, amount of pure speech data per speaker, and others, should be taken into account. In this paper, we also propose and describe the training, development and test datasets specially designed for the speaker adaptation task. These datasets are obtained from the proposed TED-LIUM 3 training corpus, but the development and test sets are more balanced and representative in characteristics (number of speakers, gender, duration) than the original sets and more suitable for speaker adaptation experiments. In addition, for the development and test datasets we chose only speakers who are not present in the training data set in other talks. The statistics for the proposed data sets are given in Table 4. Table 4. Data sets statistics for the speaker adaptation task. Unlike the other tables, the duration is calculated only for pure speech (excluding silence, noise, etc.). Characteristic Duration of speech, hours
Data set Train Dev.
Test
Total 346.17 3.73 Male 242.22 2.34 Female 104.0 1.39
3.76 2.34 1.41
Duration of speech per speaker, minutes Mean Min. Max.
3
10.7 1.0 25.6
14.0 13.6 14.4
14.1 13.6 14.5
16 10 6
16 10 6
Number of speakers
Total 1938 Male 1303 Female 635
Number of words
Total
4437K 47753 43931
Number of talks
Total
2281
16
16
Experiments with State-of-the-Art HMM-Based ASR System
We conducted a first set of experiments on the TED-LIUM release 2 and 3 corpora using the Kaldi toolkit. These experiments were based on the existing recipe2 , mainly changing model configurations and rescoring strategies. We also kept the lexicon from the original release, containing 159,848 entries. For this, and all other experiments in this paper, no glm files were applied to deal with equivalences between word spelling (e.g. doctor vs. dr). 2
https://github.com/kaldi-asr/kaldi/tree/master/egs/tedlium/s5 r2.
202
F. Hernandez et al.
3.1
Acoustic Models
All experiments were conducted using chain models [9] with the now well-known TDNN architecture [5] as well as the recent TDNN-F architecture [7]. Training audio samples were randomly perturbed in speed and volume during the training process. This approach is commonly called audio augmentation and is known to be beneficial for speech recognition [4]. 3.2
Language Model
Two approaches were used, both aiming at rescoring lattices. The first one is an N-gram model of order 4 trained with the pocolm toolkit3 , which was pruned to 10 million N-grams. We also considered a RNNLM with letter-based features and importance sampling [15], coupled with a pruned approach to lattice-rescoring [14]. The RNNLM we retained was a mixture of three TDNN layers with two interspersed LSTMP layers [12] containing around 10 million parameters. The latter helps to reduce the word error rate drastically. We used the same corpus and vocabulary in both methods, which are those released along with TED-LIUM release 2. These experiments were conducted prior to the full preparation of the new release, so we only appended text from the original alignments of release 2 to this corpus. In total, the textual corpus used to train language models contains approximately 255 million words. These source data are described in [11]. 3.3
Experimental Results
In this section, we present the recent development on Automatic Speech Recognition (ASR) systems that can be compared with the two previous releases of the TED-LIUM Corpus from 2012 and 2014. While the first version of the corpus achieved a WER of 17.4% at that time, the second version decreased it to 11.1% using additional data and Deep Neural Network (DNN) techniques. TDNN. Our basis chain-TDNN setup is based on 6 layers with batch normalization, and a total context of (−15, 12). Prior tuning experiments on TED-LIUM release 2 showed us that the model did not improve beyond the dimension of 450. More than doubling the training data allows the training of bigger, and better, models of the same architecture as shown in Table 5. As part of experiments in tuning Kaldi models, it appeared that a form of L2 regularization could help to allow training for longer with less risk to overfit. This was implemented in Kaldi as the proportional-shrink option. Some tuning on TED-LIUM 2 data gave the best result for a value of 20. All experiments presented in Table 5 were realized with this value to keep a consistent baseline. Aiming to reduce the WER even more, and with time constraints, we chose to train again the model with dimension 1024, with a proportional-shrink value of 10 (as we approximately doubled the size of the corpus). After RNNLM latticerescoring, the WER decreased to 6.2% on the dev set and 6.7% on the test. 3
https://github.com/danpovey/pocolm.
TED-LIUM 3 Corpus
203
Table 5. Tuning the hidden dimension of chain-TDNN setup on TED-LIUM release 3 corpus. Dimension WER WER - Ngram WER - RNN Dev Test Dev Test Dev Test 450
9.0% 9.1% 8.0% 8.4%
6.9% 7.3%
600
8.7% 8.9% 8.0% 8.4%
6.6% 7.3%
768
8.3% 8.6% 7.6% 8.1%
6.5% 7.0%
1024
8.3% 8.5% 7.5% 8.0%
6.4% 6.9%
TDNN-F. As a final set of experiments, we tried the recently-introduced factorized TDNN approach, which again resulted in significant improvements in WER for both TED-LIUM release 2 and 3 corpora (Table 6). Table 6. Factorized TDNN experiments on TED-LIUM release 2 and 3 corpora.
4
Corpus Model
WER WER - Ngram WER - RNN Dev Test Dev Test Dev Test
r2
TDNN-F - 11 layers 1280/256 - ps20
8.5% 8.3% 7.8% 7.8%
6.8% 6.8%
r3
TDNN-F - 11 layers 1280/256 - ps10
7.9% 8.1% 7.4% 7.7%
6.2% 6.7%
Experiments with Fully Neural End-to-End ASR System
We also conducted experiments to evaluate the impact of adding data to the training corpus in order to build a neural end-to-end ASR. The system with which we experimented does not use a vocabulary to produce words, since it emits sequences of characters. 4.1
Model Architecture
The fully end-to-end architecture used in this study is similar to the Deep Speech 2 neural ASR system proposed by Baidu in [1]. This architecture is composed of nc convolution layers (CNN), followed by nr uni or bidirectional recurrent layers, a lookahead convolution layer [13], and one fully connected layer just before the softmax layer, as shown in Fig. 1. The system is trained end-to-end by using the CTC loss function [2], in order to predict a sequence of characters from the input audio. In our experiments, we used two CNN layers and six bidirectional recurrent layers with batch normalization as mentioned in [1]. Given an utterance xi and label y i sampled from a training set X = (x1 , y 1 ), (x2 , y 2 ), ...,
204
F. Hernandez et al.
the RNN architecture has to train to convert an input sequence xi into a final transcription y i s. For notational convenience, we drop the superscripts and use x to denote a chosen utterance and y the corresponding label. The RNN takes as input an utterance x represented by a sequence of log-spectrograms of power normalized audio clips, calculated on 20 ms windows. As output, all the characters l of a language alphabet may be emitted, in addition to the space character used to segment character sequences into word sequences (space denotes word boundaries) and a blank character useful to absorb the difference in a time series length between input and output in the CTC framework. The RNN makes a prediction p(lt |x) at each output time step t. At test time, the CTC model can be coupled with a language model trained on a large textual corpus. A specialized beam search CTC decoder [3] is used to find the transcription y that maximizes: Q(y) = log(p(lt |x)) + αlog(pLM (y)) + βwc(y)
(1)
where wc(y) is the number of words in the transcription y. The weight α controls the relative contributions of the language model and the CTC network. The weight β controls the number of words in the transcription. 4.2
Experimental Results
Experiments were made on the legacy distribution of the TED-LIUM 3 corpus in order to evaluate the impact on WER of training data size for an end-to-end speech recognition system inspired by Deep Speech 2. In these experiments, we used an open source Pytorch implementation4 .
Fig. 1. Deep Speech 2 -like end-to-end architecture for speech recognition. 4
https://github.com/SeanNaren/deepspeech.pytorch.
TED-LIUM 3 Corpus
205
Three training datasets were used: TED-LIUM 2 with original alignment (207 h of speech), TED-LIUM 2 with new alignment (290 h), and TED-LIUM 3 (452 h), as presented in Sects. 2.1 and 2.2. They correspond to the three possible abscissa values (207, 290, 452) in Fig. 2. For each training dataset, the ASR tuning and the evaluation were respectively made on the TED-LIUM release 2 development and test dataset, similar to the experiments presented in Sect. 3.3. Figure 2 presents results in both WER (left side), and Character Error Rate (CER, right side) on the test dataset. Evaluation in CER is interesting because the end-to-end ASR system is trained to produce sequences of characters, instead of sequences of words.
Fig. 2. Word error rate (left) and character error rate (right) on the TED-LIUM 3 legacy test data for three end-to-end configurations according to the training data size. (Color figure online)
For each training dataset, three configurations have been tested: – the Greedy configuration, in blue in Fig. 2 that consists of evaluating sequences of characters directly emitted from the neural network by gluing all the characters (including spaces to delimit words); – the Greedy+augmentation configuration, in red, which is similar to the Greedy one, but in which each training audio samples is randomly perturbed in gain and tempo for each iteration [4]; – the Beam+augmentation configuration, in brown, achieved by applying a language model through a beam search decoding on the top of the neural network hypotheses using the Greedy+augmentation configuration. This language model is the cantab-TEDLIUM-pruned.lm3 provided with the Kaldi TEDLIUM recipe.
206
F. Hernandez et al.
As expected, the best results in WER and CER are achieved by the Beam+augmentation configuration, with a WER of 13.7% and a CER of 6.1%. Regardless of the configuration, increasing training data size significantly improves the transcription quality: for instance, while the Greedy mode reached a WER of 28.1% with the original TED-LIUM 2 data, it reaches 20.3% with TEDLIUM 3. We can observe that with TED-LIUM 3, the Greedy+augmentation configuration gets a lower WER than the Beam+augmentation one when trained with the original TED-LIUM 2 data. This shows that increasing the training data size for the pure end-to-end architecture offers a higher potential for WER reduction than using an external language model in a beam search decoding.
5
Experiments with the Speaker Adaptation Distribution
In this section, we present results of speaker adaptation experiments on the adaptation version of the corpus described in Sect. 2.3. In this series of experiments, we trained three pairs of AMs. In each pair, we trained a speaker-independent (SI) AM and a corresponding speaker adaptive trained (SAT) AM. We explore two standard adaptation techniques: (1) i-vectors for a TDNN-LSTM and (2) feature space maximum linear regression (fMLLR) for a GMM-HMM and a TDNNLSTM. The Kaldi toolkit [8] was used for these experiments. First, we trained two GMM-HMM AMs on 39-dimensional features MFCC-39 (13-dimensional Mel-frequency cepstral coefficients (MFCCs) with Δ and ΔΔ): (1) a SI AM and (2) a SAT model with fMLLR. Then, we trained four TDNN-LSTM AMs. All TDNN-LSTM AMs have the same topology, described in [6], and differ only in the input features. They were trained using LF-MMI criterion [9] and 3-fold reduced frame rate. For the first SI TDNN-LSTM AM, 40-dimensional MFCCs without cepstral truncation (hires MFCC-40) were used as the input into the neural network. For the corresponding SAT model, i-vectors were used (as in the standard Kaldi recipe). For the second SI TDNN-LSTM AM, MFCC-39 features (the same as for the GMM-HMM) were used, and the corresponding SAT model Table 7. Speaker adaptation results for the speaker adaptation task (on the corpus described in Sect. 2.3. MFCC-39 denotes 13-dimensional MFCCs appended with Δ and ΔΔ; hires MFCC-40 denotes 40-dimensional MFCCs without cepstral truncation). Model
Features
WER, % – Dev. WER, % – Test
GMM SI
MFCC-39
20.69
18.02
GMM SAT
MFCC-39 – fMLLR
16.47
15.08
TDNN-LSTM SI
hires MFCC-40
TDNN-LSTM SAT hires MFCC-40 ⊕ i-vect TDNN-LSTM SI
MFCC-39
TDNN-LSTM SAT MFCC-39 – fMLLR
7.69
7.25
7.12
7.10
8.19
7.54
7.68
7.34
TED-LIUM 3 Corpus
207
was trained using fMLLR adaptation. The 4-gram pruned LM was used for the evaluation5 . Results in terms of WER are presented in Table 7.
6
Discussion and Conclusion
In this paper, we proposed a new release of the TED-LIUM corpus, which doubles the quantity of audio with aligned text for acoustic model training. We showed that increasing this training data reduces the word error rate obtained by a stateof-the-art HMM-based ASR system very slightly, passing from 6.8% (release 2) to 6.7% (release 3) on the legacy test data (and from 6.8% to 6.2% on the legacy dev data). To measure the recent advances realized in ASR technology, this word error rate can be compared to the 11.1% reached by such a state-of-the-art system in 2014 [10]. We were also interested in emergent neural end-to-end ASR technology, known to be very voracious in training data. We noticed that without external knowledge, i.e. by using only aligned audio from TED-LIUM 3, such technology reaches a WER of 17.4%, which is exactly the WER reached by stateof-the-art ASR technology in 2012 with the TED-LIUM 1 training data. Assisted by a classical 3-gram language model used in a beam search on top of the end-toend architecture, this WER decreases to 13.7% with the TED-LIUM 3 training data, while with the TED-LIUM 2 training data the same system reached a WER of 20.3%. Increasing training data composed of audio with aligned text for this kind of ASR architecture still seems very important in comparison to the HMM-based ASR architecture that reaches a plateau on such TED data, with a low WER of 6.7%. Finally, we propose a new data distribution dedicated to experimenting on speaker adaptation, and propose some results that can be considered as a baseline for future work. Acknowledgments. This work was partially funded by the French ANR Agency through the CHIST-ERA M2CR project, under the contract number ANR-15-CHR20006-01, and by the Google Digital News Innovation Fund through the news.bridge project.
References 1. Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. In: International Conference on Machine Learning, pp. 173–182 (2016) 2. Graves, A., Fern´ andez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006) 5
This LM is similar to the “small” LM trained with the pocolm toolkit, which is used in the Kaldi tedlium s5 r2 recipe. The only difference is that we modified a training set by adding text data from TED-LIUM 3 and removing from it those data, that present in our test and development sets (from the adaptation corpus).
208
F. Hernandez et al.
3. Hannun, A.Y., Maas, A.L., Jurafsky, D., Ng, A.Y.: First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv preprint arXiv:1408.2873 (2014) 4. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH (2015) 5. Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH (2015) 6. Peddinti, V., Wang, Y., Povey, D., Khudanpur, S.: Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Process. Lett. 25(3), 373–377 (2018) 7. Povey, D., et al.: Semi-orthogonal low-rank matrix factorization for deep neural networks. In: INTERSPEECH (2018, submitted) 8. Povey, D., et al.: The Kaldi speech recognition toolkit. In: ASRU. IEEE Signal Processing Society, December 2011 9. Povey, D., et al.: Purely sequence-trained neural networks for ASR based on latticefree MMI. In: INTERSPEECH (2016) 10. Rousseau, A., Del´eglise, P., Est`eve, Y.: TED-LIUM: an automatic speech recognition dedicated corpus. In: LREC, pp. 125–129 (2012) 11. Rousseau, A., Del´eglise, P., Est`eve, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: LREC, pp. 3935– 3939 (2014) 12. Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: INTERSPEECH (2014) 13. Wang, C., Yogatama, D., Coates, A., Han, T., Hannun, A., Xiao, B.: Lookahead convolution layer for unidirectional recurrent neural networks. In: ICLR 2016 (2016) 14. Xu, H., et al.: A pruned RNNLM lattice-rescoring algorithm for automatic speech recognition. In: ICASSP (2017) 15. Xu, H., et al.: Neural network language modeling with letter-based features and importance sampling. In: ICASSP (2017)
LipsID Using 3D Convolutional Neural Networks ˇ Miroslav Hlav´ aˇc1,2,3(B) , Ivan Gruber1,2,3 , Miloˇs Zelezn´ y1 , 3,4 and Alexey Karpov 1
2
Faculty of Applied Sciences, Department of Cybernetics, UWB, Pilsen, Czech Republic {mhlavac,zelezny}@kky.zcu.cz Faculty of Applied Sciences, NTIS, UWB, Pilsen, Czech Republic 3 ITMO University, St. Petersburg, Russia 4 SPIIRAS, St. Petersburg, Russia [email protected]
Abstract. This paper presents a proposition for a method inspired by iVectors for improvement of visual speech recognition in the similar way iVectors are used to improve the recognition rate of audio speech recognition. A neural network for feature extraction is presented with training parameters and evaluation. The network is trained as a classifier for a closed set of 64 speakers from the UWB-HSCAVC dataset and then the last softmax fully connected layer is removed to gain a feature vector of size 256. The network is provided with sequences of 15 frames and outputs the softmax classification to 64 classes. The training data consists of approximately 20000 sequences of grayscale images from the first 50 sentences that are common to every speaker. The network is then evaluated on the 60000 sequences created from 150 sentences from each speaker. The testing sentences are different for each speaker.
Keywords: Visual speech Deep features
1
· Neural network · 3D convolution
Introduction
The field of visual speech recognition is behind the field of audio speech recognition in the rate of success of the recognition algorithm. The current methods [5,6] usually employ end to end systems based on neural networks. The networks are using joint learning for audio and video inputs to gain as much information as possible to improve the recognition rate. The learning process is based on an analysis of video sequences by employing either Long short-term memory (LSTM) [8] or 3D Convolutions [10] to learn the dynamic features of the visual speech. The Connectionist temporal classification [9] is then used as output and loss function for the neural network [2]. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 209–214, 2018. https://doi.org/10.1007/978-3-319-99579-3_22
210
M. Hlav´ aˇc et al.
These networks achieve accuracy of around 60% for the visual speech recognition on an open set of words. There is still a lot of space for improvement of these results by providing additional information in the training process. In the field of audio speech recognition, various methods are utilized to improve the results of the automatic speech recognition(ASR) algorithms. One of the methods named iVectors [11] originally developed for speaker identification proved to be a useful additional information source for adaptation of the audio speech recognition to different speakers. The problem of classification is well documented in neural networks and thus our idea is to adapt the iVectors method to provide additional information about the speaker to improve the recognition rate of visual speech. This paper is proposing a method for obtaining the deep features from the input sequences of visual speech by employing a neural network composed of 3D Convolutional layers trained for the task of classification of the speakers. We have named our method LipsID because it is based on speaker identification based on lips images and the networks are trained in the task of classification. The paper is organized as follows: in Sect. 2 we introduce used dataset and processing method; in Sect. 3 we describe the experiment, specify implementation details, show obtained results and compare them with a chosen baseline approach; and finally in Sect. 4 we make a conclusion and outline our future research.
2
Methods and Datasets
In this section, we will present the data used for training our neural networks and also the background for sequence analysis with neural networks. The first part discusses the dataset we have used to create training data for our experiments and then exactly describes how they were created. The second part is focused on the tested sequence analysis approaches. 2.1
UWB-HSCAVC Dataset
The UWB-HSCAVC [7] dataset was created at West Bohemian University in Pilsen to provide a speech recognition dataset for the Czech language. It contains both visual and audio data of 100 different speakers (39 males, 61 females). It was recorded in a laboratory environment in controlled light conditions. Each speaker recorded 200 sentences, with 50 common to everyone and the rest 150 differs from speaker to speaker. Clapperboard was used as a synchronization mechanism for audio and video. The sentences are chosen with care to provide equal representation of phonemes occurring in the Czech language. The videos are recorder in resolution of 720/ × 576 with 25 fps. The dataset is also preprocessed by creating manual speech transcriptions, speakers head detection, lips corners position detection, and it provides skin texture samples for the regions of nose and cheeks. The sample texture of both eyes of every speaker is also provided (Fig. 1).
LipsID Using 3D Convolutional Neural Networks
211
Fig. 1. Recording conditions of the UWB-HSCAVC dataset [7].
2.2
Training Data
The training data were created from the UWB-HSCAVC dataset by following means. Only the data for 64 speakers were available. The videos were first tracked by Chehra [3] tracker to detect facial keypoints. Then the regions around lips keypoints were extracted and processed to provide an image (40, 60) pixels in size. The order of the frames in each sentence was preserved to provide the suitable source for the creation of visual speech sequences. We chose the length of the sequence to be 15 frames. This number was chosen based on the size of the data and available hardware for training the networks. The sentences were then cut into sequences of the chosen length without overlap. The training dataset was created from the first fifty sentences that are common for each speaker which produced 20740 training sequences for 64 speakers. The remaining 150 sentences from each speaker were then used as testing data. This produced a testing set of 61709 sequences. All of the images were converted to grayscale for the purpose of this work. Example of the training data is included in the Fig. 2. For the purpose of the initial experiments, the frames were randomly shuffled and provided with one-hot labels per frame. Then for the purpose of sequence classification, the sequences were shuffled and provided with one one-hot label for the whole sequence. 2.3
Sequence Analysis
We have chosen two approaches to analyze the speaker data from input sequences. At first, we tried to create a neural network based on LSTM [8], but we were unable to create a good topology to get good results. However, we would like to solve this problem in the future. After that, we have created several testing topologies based on 3D Convolutions [10] which after some adjustments provided results that are further discussed in the next section. These two
212
M. Hlav´ aˇc et al.
Fig. 2. Example of the data used for training the networks.
approaches were mainly selected because the current visual speech recognition system also use this type of sequence analysis and it would be easier to implement our speaker adaptation to these systems.
3
Experiments
The experiments are composed of initial tests with single image classification on a closed set of speakers as our baseline approach, and of sequence classification using 3D convolutions also on a closed set of speakers. The experiments were programmed in Keras [4] neural network framework with Tensorflow [1] backend in version 1.7. 3.1
Single Image Classification
The initial experiments were designed with 2D Convolutions to test the recognition rate on the source dataset UWB-HSCAVC [7]. The training data were composed of 83327 images and the testing data were composed of 245 398 images. The experiment involved a VGG-like [12] CNN in the task of per frame classification of the speakers included in the dataset. The neural network was trained with 15 epochs with mini-batch size 32 and initial learning rate = 0.01. The recognition rate finished at 99.1% on the test data. The topology is described in the following Table 1. Where DO means dropout, FC means fully connected layer and ReLU activation functions are applied if not specified otherwise. We have used standard SGD as the optimizer with momentum = 0.9, weight decay = 1e−6, and with categorical cross-entropy loss. The strides of the convolutional layers were set to one. On the other hand, the strides of maxpooling layers were set to two. 3.2
Sequence Classification
To further improve the recognition we redesigned the topology with 3D Convolutions [10]. The network takes sequences of frames as input and produces a single output speaker classification for the whole sequence. The last but one fully connected layer produces a feature vector of size 256. This vector will serve as
LipsID Using 3D Convolutional Neural Networks
213
Table 1. LipsID - single image topology. Conv2D(64,3×3)
Conv2D(128,3×3)
Conv2D(256,3×3)
FC(4096)
Conv2D(64,3×3)
Conv2D(128,3×3)
Conv2D(256,3×3)
Dropout(0.5)
Batch normalization Batch normalization Batch normalization FC(4096) Maxpooling(2×2)
Maxpooling(2×2)
Maxpooling(2×2)
FC(64,softmax)
Table 2. LipsID - sequences topology. Conv3D(32, 3 × 3 × 3,ReLU) Conv3D(64, 3 × 3 × 3,ReLU) Conv3D(128, 3 × 3 × 3,ReLU) FC(256) Conv3D(32, 3 × 3 × 3,ReLU) Conv3D(64, 3 × 3 × 3,ReLU) Conv3D(128, 3 × 3 × 3,ReLU) FC(64,softmax) Batch normalization
Batch normalization
MaxPool3D(3 × 3 × 2)
MaxPool3D(3 × 3 × 2)
Batch normalization
the LipsID features in our further experiments. The network was trained by SGD optimizer (with same parameters as in the single image classification) with the categorical cross-entropy loss. The neural network was trained with 15 epochs with mini-batch size 32 and initial learning rate = 0.01 again. The training finished with 99.98% recognition rate on training data and 99.29% recognition rate on the test data. This is a significant improvement over the single image LipsID classification. To be more concrete, we decrease the recognition error by 0.19%, which is relative decrease by 21%. The stride of the 3D convolutions is set to one in every dimension and the stride of the 3D maxpooling is set to two. FC means fully connected layer (Table 2).
4
Conclusion and Future Work
This paper has presented a method for producing LipsID feature vector from sequences of visual speech. The method was tested on our own dataset UWBHSCAVC and produced good results in speaker classification based on lips only images. The sequence classification shows improvement over single frame classification. With the usage of sequence classification instead of single image classification, we reached relative decrease of recognition error by 21%. In the future, we would like to add training data from other datasets and also data captured in different light conditions. Then we will implement this method to existing visual speech recognition systems to assess the contribution of the LipsID features to visual speech recognition accuracy. We also would like to test LipsID detection with LSTM networks with which we hopefully reach similar or even better results. Acknowledgments. This work was supported by the Ministry of Education of the Czech Republic, project No. LTARF18017. The work has been also supported by the grant of the University of West Bohemia, project No. SGS-2016-039. This work was supported by the Government of the Russian Federation (grant No. 08-08) and the Russian
214
M. Hlav´ aˇc et al.
Foundation for Basic Research (project No. 18-07-01407) too. Moreover, access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
References 1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/. software available from tensorflow.org 2. Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: Sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016) 3. Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Incremental face alignment in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1859–1866 (2014) 4. Chollet, F., et al.: Keras: Deep learning library for theano and tensorflow, vol. 7, p. 8 (2015). https://keras.io/k 5. Chung, J.S., Senior, A.W., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. CoRR abs/1611.05358 (2016). http://arxiv.org/abs/1611.05358 6. Chung, J., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision (2016) ˇ 7. C´ısaˇr, P., Zelezn` y, M., Krˇ noul, Z., Kanis, J., Zelinka, J., M¨ uller, L.: Design and recording of czech speech corpus for audio-visual continuous speech recognition. In: Proceedings of the Auditory-Visual Speech Processing International Conference 2005 (2005) 8. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM (1999) 9. Graves, A., Fern´ andez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006) 10. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013) 11. Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU, pp. 55–59 (2013) 12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
From Kratzenstein to the Soviet Vocoder: Some Results of a Historic Research Project in Speech Technology R¨ udiger Hoffmann(B) , Peter Birkholz, Falk Gabriel, and Rainer J¨ ackel Institut f¨ ur Akustik und Sprachkommunikation, Technische Universit¨ at Dresden, Dresden, Germany {ruediger.hoffmann,peter.birkholz,falk.gabriel, rainer.jaeckel}@tu-dresden.de
Abstract. This paper demonstrates by means of an example, how historic collections of universities can be utilized in modern research and teaching. The project refers to the Historic Acoustic-phonetic Collection (HAPS) of the TU Dresden. Two “guiding fossils” from the history of speech technology are selected to present a selection of results. Keywords: History of speech communication research Mechanical speech synthesis · Vocoder
1
Introduction
Experimental phonetics and speech technology show continuing interest in their own history. Prominent examples in the literature date back to PanconcelliCalzia [1], Dudley and Tarnoczy [2], and Ohala et al. [3], followed by numerous other papers and the foundation of the Special Interest Group on the History of Speech Communication Sciences of the ISCA and IPA in 2011. The literature is supported and complemented by collections of historic items not only in scientific museums, but also in the different historic collections of the universities. University collections are scientifically important, but endangered, because they are no “real” museums. The best way to take care of a collection is to include it in the processes of research and teaching at the university. It was the aim of a call for proposals of the German Federal Ministry of Education and Research (BMBF) in 2015, to support the collections in this sometimes difficult process [4]. The TU Dresden was successful with the proposal “Sprechmaschine” (speaking machine), which aimed to investigate the exhibits on the history of speech synthesis in their Historic Acoustic-phonetic Collection (HAPS). Five research groups from the TU Dresden and one of the State Art Collections Dresden (as external partner) cooperate in the project. In this paper, we merely present two partial aspects of the ongoing research: the study of Kratzenstein’s “vowel organ” as the starting point of the mechanical speech synthesis (Sect. 3), and the investigation of the history of the vocoder as guiding fossil of the electronic speech synthesis (Sect. 4). c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 215–225, 2018. https://doi.org/10.1007/978-3-319-99579-3_23
216
2
R. Hoffmann et al.
History of the HAPS Collection
Research in electronic speech processing started at the TU Dresden with the development of a vocoder in the 1950s. Walter Tscheschner (1927–2004) started the work in speech synthesis and recognition, which continues until today. Many devices, which were developed during this long time span, were preserved and form the core of the historic collection. There was a close cooperation with the Institute of Phonetics of the Humboldt University in Berlin, which had its origin in the laboratory of the renowned speech therapists Hermann Gutzmann and Franz Wethlo. Dieter Mehnert, the last phonetician on this chair, collected numerous items which demonstrated the history of experimental phonetics in Berlin and at other places. When the institute was closed in 1996, this collection was transferred to Dresden. In this way, the development of experimental phonetics as well as electronic speech technology could be demonstrated as a whole. The fusion of the collections was completed in 1999, when the name HAPS was introduced. The most important place in the development of experimental phonetics in Germany was the Phonetic Laboratory of Giulio Panconcelli-Calzia (1878– 1966) in Hamburg, founded 1910 at the “Colonial Institute”, since 1919 at the Hamburg University. When the successional Institute of Phonetics was closed in 2006, the very important collection of devices from the era of Panconcelli-Calzia in Hamburg was united with the HAPS in Dresden, which is a really important special collection since that time [5]. The exploitation of the HAPS started with cataloging the exhibits from the field of experimental phonetics [6]. A second catalogue volume is planned with the title “Historic devices of speech acoustics”. The BMBF project “Sprechmaschine” (Speaking machine) requires the development of those parts of this catalogue, which are focused on the exhibits from the field of synthetic speech. The history of experimental phonetics starts at the end of the 19th century with the development of the colonial system. However, there are predecessors like the automata constructors of the late Baroque (Kratzenstein, Kempelen, Mical) and the great physiologists of the 19th century (M¨ uller, Ludwig, Helmholtz). Of course, the HAPS is not able to demonstrate original items from these periods, but there are some useful and rare replicas. In the following section, we will focus on one of them: Kratzenstein’s vowel organ.
3 3.1
Kratzenstein’s Vowel Organ – Guiding Fossil of Mechanical Speech Synthesis Kratzenstein’s Revival
Christian Gottlieb Kratzenstein (1723–1795) was the first, who experimentally demonstrated the source-filter theory of speech production. In his “vowel organ”, which he presented at the occasion of a contest of the Imperial Academy of Sciences in St. Petersburg in 1780, he applied a reed pipe as source and different
From Kratzenstein to the Soviet vocoder
217
Fig. 1. Left: Replicas of Kratzenstein’s vowel resonators, designed and manufactured by C. Korpiun. – Right: Replicas of the vowel resonators from Chiba und Kajiyama, c TU Dresden, HAPS. designed and manufactured by T. Arai. Photographs
resonators for the basic vowels as filters. Because the shape of the resonators was found in an empirical way, later scientists did not value his invention. This situation lasted until 2006 (!), when the German linguist Christian Korpiun (1948–2017) proved, that there is enough information in the work of Kratzenstein to make real replicas of his resonators (Fig. 1 left). The replicas are now in the HAPS as a gift of C. Korpiun. Furthermore, Korpiun published a commented German translation of Kratzenstein’s Tentamen [7], which will be complemented by an English version as soon as possible. In the succession of Kratzenstein’s source-filter idea, several improvements have been developed: – With regard to the source, the reed pipes have for example been replaced by arrangements similar to the vocal cords. The most successful attempt was the cushion pipe of Wethlo [8], published in 1913, which is contained as an original in the HAPS. – With regard to the filter, the empirically defined shapes of the vowel resonators have, for example, been replaced by straight tube models, which represent the human articulation tract more precisely. The measurement-based models of Chiba und Kajiyama [9] from 1941 formed the starting point of the contemporary acoustic phonetics. The HAPS owns a replica of these resonators as a gift of T. Arai (Fig. 1 right) [10]. Today, the models for the source and for the vocal tract can be improved even further by new measurement methods and/or new materials. The following subsections briefly sketch, how this is performed in the framework of the project “Sprechmaschine”. 3.2
Vocal Fold Models Using Modern Materials
One goal of the project “Sprechmaschine” is the development of synthetic physical vocal fold models with characteristics as similar as possible to human vocal folds. Human vocal folds have a layered structure: The outermost layer is a thin
218
R. Hoffmann et al.
Fig. 2. (a) 3D view of a vocal fold and its casing; (b) schematic view of the layered structure in the coronal plane; (c) screwed casing and its negative; (d) oblique view of a finished vocal fold model.
skin with a thickness of 0.05–0.1 mm (the epithelum), and the innermost layer is the vocalis muscle. Between the epithelium and the muscle is the lamina propria, a soft, water-like system of nonmuscular tissue. The challenge in the creation of synthetic vocal folds is the reproduction of this layered structure with appropriate materials such that the oscillations of the synthetic vocal folds become similar to those of real vocal folds. In our ongoing study, we use two-composite silicon with different amounts of added silicone oil to recreate the different physical properties of the three layers [11,12]. Figure 2a and b show the general layered geometry of our vocal fold models. The outer shape is based on the geometry by Scherer [13]. The production of the vocal folds is based on 3D-printed casings and moulds (Fig. 2c and d), somewhat similar to [14]. Recently, we investigated the acoustics and vibration patterns of different models to examine the dependencies between the behavior and the geometrical and mechanical properties of the vocal folds. To this end, the vocal fold thickness, the angle of the conus elasticus, and the stiffness of the vocalis muscle were varied. As an example, a “soft” and “hard” vocalis muscle was used and the thickness was varied between 2 mm (T2), 3 mm (T3), and 4 mm (T4) at a constant 40◦ angle of the conus elasticus, resulting in six different synthetic models (two stiffness values × 3 thickness values). There exist some characteristic parameters of the glottal area function (glottal area as a function of time during an oscillation cycle) to describe the vocal fold vibration pattern; cf. [15]. The glottal area function was measured by a high-speed camera during the vibration of the models. The parameters of the glottal area function were extracted and examined using appropriate software tools. As an example, Fig. 3a shows the maximum of the glottal area during an oscillation cycle as a function of the subglottal pressure. A typical acoustical property to make assertions of the voice is the difference between the first and second harmonic in the spectrum of the source signal; cf. [16]. A microphone in a distance of 20 cm a little diagonal to the vocal fold models measured the pressure variations and so the voice of the models. The harmonics were extracted with the tool “VoiceSauce”, and H1-H2 as a function of subglottal pressure is shown in Fig. 3b for the six models. Eventually, measurements like the examples in Fig. 3 need to be compared to real phonation to assess the suitability of certain synthetic model geometries and materials.
From Kratzenstein to the Soviet vocoder
219
Fig. 3. (a) Maximum glottal area of six different vocal fold models. – (b) Difference between the first and second harmonic of the voice spectra of the models.
3.3
Towards a Database of Physical Vocal Tract Models with Realistic Geometries
With recent advances of Magnetic Resonance Imaging (MRI), 3D scanning, and 3D-printing technology, it is now possible to create (static) physical models of the vocal tract with very realistic geometries. Such models have gained increasing interest as research tools in speech science and can be created as follows (also see Fig. 4): First, MRI is used to capture the complete 3D shape of the vocal tract of the speech sound(s) of interest in high detail. Because the scanning takes a few seconds per sound, only sustainable sounds can be captured in 3D (e.g. [17]). Furthermore, because teeth are not visible in MRI data, plaster models of the subject’s teeth must be made and scanned using a 3D scanner. The wireframe models of the teeth are then merged with the MRI data ([18]). Based on the merged data set, the vocal tract is segmented in terms of a triangle mesh that represents the inner vocal tract walls, using freely available software tools (e.g., ITK-SNAP [19]). This surface mesh is then extruded to obtain a vocal tract model that has a certain wall thickness and can be printed as a physical 3D object (Fig. 4). Compared to the vocal tracts of living humans, the 3D-printed counterparts have the main advantage that both their acoustic and aerodynamic properties can be precisely measured. For example, a method to measure the volume velocity transfer function between the glottis and the lips for such models was recently presented by Fleischer et al. [18]. The 3D-printed vocal tract models can be used for research in multiple ways: – The physical models, along with their measured transfer functions, can be used to validate computational models that simulate vocal tract acoustics in one, two, or three dimensions. For example, for a one-dimensional acoustic simulation based on a 2D or 3D vocal tract shape, the vocal tract area function needs to be estimated. Multiple methods have been proposed for this purpose, e.g. [20–22]. However, so far it is not clear, which of these methods generates
220
R. Hoffmann et al.
Fig. 4. Processing steps to obtain a 3D-printable physical model of the vocal tract from volumetric MRI data and 3D scans of plaster models of the upper and lower teeth.
–
–
–
–
acoustic transfer functions that are most similar to the real transfer function of the corresponding 3D vocal tract shapes. 3D-printed and measured vocal tract models can be used to validate and compare different transformation methods. The printed vocal tract models can be used to assess the acoustic effects of certain geometric features or side cavities of the vocal tract. For example, 3Dprinted vocal tract models have been used to examine the acoustic effect of the piriform fossa [23,24]. In a similar way, the acoustic effects of interdental spaces or the vallecula (space between the epiglottis and the tongue root) could be examined. The 3D-printed models can be used to study “formant tuning” of professional singers at high pitches, which align formant frequencies with frequencies of the harmonics of the glottal source to maximize the radiated acoustic power [25]. 3D-printed vocal tract models of fricatives could be used for the investigation of noise sources when airflow is injected through the glottis. Measured spectra of the noise produced by the physical models could be used to validate computational aero-acoustic models. For education, the 3D-printed models, in combination with a suitable glottal excitation, can be used to generate sustained vowels. Suitable devices for glottal excitation are, e.g., reed pipes as described by Arai [10], or synthetic vocal fold models as described in Sect. 3.2.
Given this range of applications, we currently prepare a database that contains the detailed MRI-based 3D geometries of the vocal tract (with inserted teeth) of 22 sustained German speech sounds uttered by one male and one female speaker each. All of these 44 vocal tract shapes are printed using a 3D printer (type Ultimaker 3). For each vocal tract shape, the volume velocity transfer function is measured according to the method by Fleischer et al. [18], and the
From Kratzenstein to the Soviet vocoder
221
radiated noise spectra are determined for a range of stationary airflows injected through the glottis. The transfer functions and the noise spectra will be provided in the database along with the corresponding 3D geometries and the files needed for 3D-printing the models.
4 4.1
The Vocoder – Guiding Fossil of Electronic Speech Synthesis Problems of the History of the Early Vocoders
The vocoder was invented for bandwidth reduction in voice transmission at a time when its implementation was still very complex and its use was therefore limited to a few cases. However, it has provided many new insights into the analysis and synthesis of the speech signal, making it the most important key fossil of electronic speech technology today. For this reason, the vocoder also plays an important role in our study on the history of speech synthesis [26]. Since the vocoder was also used in security-relevant applications, there are still gaps in the presentation of its history. Even with regard to Germany, where the first patent for an apparatus similar to the later American vocoder was granted, these gaps are only partially closed [27,28]. This finding applies in particular to the development in the Soviet Union, which has been perceived outside the Russian-language literature so far mainly on the belletrist processing in the novel “In the first circle” (1968, uncut edition 1978 [29]) by Aleksandr I. Solzhenitsyn (1918–2008). Solzhenitsyn was drafted after the study of mathematics and physics to the war service and served starting from 1943 as a commander of a sound measuring battery. In 1945, he was sentenced to eight years in a detention camp for criticizing Stalin. He spent the period from 1948 to 1950 at a secret telecommunications institute in Marfino near Moscow. He described this time in his aforementioned novel, which also contains some details about the work carried out in Marfino on speech analysis and speech coding. This description served as the main source of information for the statements on the history of the Soviet vocoder in the monographs on the development of speech technology by Schroeder [30] and on the history of the vocoder by Tompkins [31]. After the end of the Cold War, accessibility to many documents in the former Soviet Union has improved and some of the involved scientists have published their memoirs. That this material is still hardly known, is probably due to the language barrier. We have therefore set ourselves the task of gaining a better overview in the context of a literature study, and report here on the status achieved so far. Most important were the biographic notes on Kotel’nikov [32] and the history of the Marfino laboratory by one of the leading engineers, Kalachev [33], which in turn led us to numerous papers in Soviet journals of that time.
222
4.2
R. Hoffmann et al.
A Literature Recherche on the Soviet Vocoder
For our project, we studied a big number of Russian documents, which are obviously less known outside of the former Soviet Union. A snapshot of this work was published in [34] and may be summarized in the following theses: – The famous mathematician Vladimir A. Kotel’nikov (1908–2005), who formulated the sampling theorem for the first time in an engineering context at the age of just 25, worked in different telecommunications projects, where he worked out various solutions for the associated encryption tasks. The work on the encryption apparatus Sobol-P led him to a parametric speech coding approach analogous to the vocoder. In his memoirs, Kotel’nikov notes that in late 1940, he received knowledge of the article by H. Dudley on the vocoder, which confirmed his approach. At the beginning of 1941, the first vocoder in the USSR began to work in his laboratory [32]. – At the same time, the acoustician Lev L. Myasnikov (1905–1972) worked in Leningrad. He is considered the inventor of the first “objective” recognition of speech sounds in 1937 [35]. He habilitated in 1942 on technical phonetics. A patent filed in 1940 describes a parallel filter bank, as it is also suitable for the analysis part of a vocoder. – Since 1943, a work group of the Ministry of State Security (MGB) under the direction of Andrey P. Peterson (*1915) dealt with the improvement of encryption technology. Following a memorandum, the above-mentioned laboratory in Marfino was founded in 1948. The most important specialists in telecommunications and cryptology should be integrated into one facility. From the beginning, the vocoder technology played the most important role. A first version of the vocoder-based encryption system M-803 was tested at the communication line Moscow-Kiev in November 1949, but with insufficient signal quality. As a resort, A. P. Peterson proposed a new approach that integrated the concept of “clipped speech” and that of the vocoder. In April 1950, the improved speech encryption system M-803 found approval of an evaluation committee that included Kotel’nikov [33]. – The development of the Marfino vocoder resulted in three remarkable results: • The modification of the vocoder, in which a part of the speech signal was left in the time domain while the signal energy in the frequency bands was transmitted parametrically, was later known as semi-vocoder or voice excited vocoder [36], which accordingly was invented in Marfino. • For the further improvement of the voice quality, several suggestions for improvement were examined in 1950/51, among them the variant M-803M by Anton M. Vassilyev (1899–1965). If we interpret his proposal correctly, the principle of the formant vocoder was suggested here in a form as also described by Munson and Montgomery in 1950 [37]. • Part of the assessment of the transmission system is that it was probably the world’s first system for the digital transmission of encrypted vocoder signals. – Since the middle of the 1950s, open publications on speech compression and vocoder applications appeared, e.g. the remarkable textbook [38].
From Kratzenstein to the Soviet vocoder
5
223
Conclusion
This paper describes selected results from the authors’ work in the project “Sprechmaschine”. It should be finally mentioned, that other project groups (from linguistics, design, and computer sciences) are additionally working in project parts, which will result in a “virtual collection” of typical instruments and devices from the history of synthetic speech. Acknowledgments. Supported by the German Federal Ministry of Education and Research (BMBF) in the project “Sprechmaschine”, FKZ 01UQ1601A.
References 1. Panconcelli-Calzia, G.: Geschichtszahlen der Phonetik (1941)/Quellenatlas der Phonetik (1940), New edition by K. Koerner. Benjamins, Amsterdam (1994) 2. Dudley, H., Tarnoczy, T.H.: The speaking machine of Wolfgang von Kempelen. JASA 22(2), 151–166 (1950) 3. Ohala, J.J. (ed.): A Guide to the History of the Phonetic Sciences in the United States. University of California, Berkeley (1999) 4. Bekanntmachung von F¨ orderrichtlinien “Vernetzen - Erschließen - Forschen. Allianz f¨ ur universit¨ are Sammlungen” (2015). BMBF Homepage https://www. bmbf.de/foerderungen/bekanntmachung-1029.html. Accessed 22 Apr 2018 5. Hoffmann, R.; Mehnert, D.: Early experimental phonetics in Germany - historic traces in the collection of the TU Dresden. In: Proceedings of the 16th International Congress of Phonetic Sciences (ICPhS 2007), Saarbr¨ ucken, pp. 881–884 (2007) 6. Mehnert, D.: Historische phonetische Ger¨ ate. Katalog der historischen akustischphonetischen Sammlung der TU Dresden, 1. Teil. TUDpress, Dresden (2012) ¨ 7. Kratzenstein, C.G.: Tentamen resolvendi problema, Petersburg 1781. Ubersetzt und kommentiert von Christian Korpiun. TUDpress, Dresden (2016) 8. Wethlo, F.: Versuche mit Polsterpfeifen. Passow-Schaefers Beitr¨ age f¨ ur die gesamte Physiologie 6(3), 268–280 (1913) 9. Chiba, T., Kajiyama, M.: The Vowel: Its Nature and Structure. Tokyo-Kaiseikan Pub. Co., Tokyo (1941) 10. Arai, T.: Education in acoustics and speech science using vocal-tract models. JASA 131(3), 2444–2454 (2012) 11. Chhetri, D.K., Zhang, Z., Neubauer, J.: Measurement of Young’s modulus of vocal folds by indentation. J. Voice 25(1), 1–7 (2011) 12. Alipour, F., Vigmostad, S.: Measurement of vocal folds elastic properties for continuum modeling. J. Voice 26, 816.e21–816.e29 (2012) 13. Scherer, R.C., et al.: Intraglottal pressure profiles for a symmetric and oblique glottis with a divergence angle of 10 degrees. JASA 109(4), 1616–30 (2001) 14. Murray, P.R., Thomson, S.L.: Synthetic, multi-layer, self-oscillating vocal fold model fabrication. J. Vis. Exp. (JoVE) 58 (2011) 15. Chen, G., et al.: Development of a glottal area index that integrates glottal gap size and open quotient. JASA 133(3), 1656–66 (2013) 16. Kreiman, J., et al.: Variability in the relationships among voice quality, harmonic amplitudes, open quotient, and glottal area waveform shape in sustained phonation. JASA 132(4), 2625–32 (2012)
224
R. Hoffmann et al.
17. Stone, S., Marxen, M., Birkholz, P.: Construction and evaluation of a parametric one-dimensional vocal tract model. IEEE Trans. Audio Speech Lang. Process. 26(8), 1381–1392 (2018) 18. Fleischer, M., Mainka, A., K¨ urbis, S., Birkholz, P.: How to precisely measure the volume velocity transfer function of physical vocal tract models by external excitation. PLoS ONE 13(3), e0193708 (2018). https://doi.org/10.1371/journal.pone. 0193708 19. Yushkevich, P.A., et al.: User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage 31(3), 1116–1128 (2006) 20. Birkholz, P.: Enhanced area functions for noise source modeling in the vocal tract. In: Proceedings of the 10th International Seminar on Speech Production (ISSP 2014), Cologne, pp. 37–40 (2014) 21. Beautemps, D., Badin, P., Bailly, G.: Linear degrees of freedom in speech production: analysis of cineradio- and labio-film data and articulatory-acoustic modeling. JASA 109(5), 2165–80 (2001) 22. Laprie, Y., Loosvelt, M., Maeda, S., Sock, R., Hirsch, F.: Articulatory copy synthesis from cine X-ray films. In: Proceedings of the Interspeech, Lyon, France (2013) 23. Dang, J., Honda, K.: Acoustic characteristics of the piriform fossa in models and humans. JASA 101(1), 456–465 (1997) 24. Delvaux, B., Howard, D.: A new method to explore the spectral impact of the piriform fossae on the singing voice: benchmarking using MRI-based 3D-printed vocal tracts. PLOS ONE 9(7), e102680 (2014) 25. Echternach, M., et al.: Articulation and vocal tract acoustics at soprano subject’s high fundamental frequencies. JASA 137(5), 2586–2595 (2015) 26. Hoffmann, R.: On the development of early vocoders. In: Proceedings of the 2nd IEEE Histelcon 2010, Madrid, pp. 359–364, 3–5 November 2010 27. Hoffmann, R.: Zur Entwicklung des Vocoders in Deutschland. In: Jahrestagung f¨ ur Akustik, DAGA 2011, D¨ usseldorf, 37. Jahrestagung f¨ ur Akustik, DAGA 2011, pp. 149–150, 21–24 March 2011 28. Hoffmann, R., Gramm, G.: The Sennheiser vocoder goes digital: On a German R&D project in the 1970s. In: Proceedings of the 2nd International Workshop on the History of Speech Communication Research (HSCR 2017), Helsinki, 18–19 August 2017, pp. 35–44. TUDpress, Dresden (2017) 29. Solschenizyn, A.: Im ersten Kreis. Aus dem Russ. u ¨ bersetzt und zusammengetragen von S. Geier. Vollst¨ andige Ausgabe der wiederhergestellten Urfassung. S. Fischer Verlag, Frankfurt am Main (1982) 30. Schroeder, M.R.: Computer Speech: Recognition, Compression, Synthesis. Springer Series in Information Sciences, vol. 35. Springer, Heidelberg (1999). https://doi. org/10.1007/978-3-662-06384-2 31. Tompkins, D.: How to Wreck a Nice Beach: The Vocoder from World War II to Hip-Hop. Melville House/Chicago: Stop Smiling Media, Brooklyn (2010) 32. Kotel’nikov, V.A.: Sud’ba, ochvativˇsaja vek. Tom 2: N. V. Kotel’nikova ob otce. Fizmatlit, Moskva (2011) 33. Kalaˇcev, K.F.: V kruge tret’em. Vospominanija i razmyˇslenija o rabote Marfinskoj laboratorii v 1948–1951 godach. Moskva (1999) 34. Hoffmann, R., J¨ ackel, R.: Zur Geschichte des Vocoders in der Sowjetunion. In: Jahrestagung f¨ ur Akustik, DAGA 2018, M¨ unchen, 44. Jahrestagung f¨ ur Akustik, DAGA 2018, pp. 840–843, 19–22 March 2018 ˇ 35. Mjasnikov, L.L.: Ob-ektivnoe raspoznavanie zvukov reˇci. Zurnal Techniˇceskoj Fiziki 13(3), 109–115 (1943)
From Kratzenstein to the Soviet vocoder
225
36. Schroeder, M.R., David, E.E.: A vocoder for transmitting 10 kc/s speech over a 3.5 kc/s channel. Acustica 10, 35–43 (1960) 37. Munson, W.A., Montgomery, H.C.: A speech analyzer and synthesizer. JASA 22(5), 678 (1950) 38. Sapoˇzkov, M.A.: Reˇcevoj signal v kibernetike i svjazi. Svjaz’izdat, Moskva (1963)
LSTM Neural Network for Speaker Change Detection in Telephone Conversations Marek Hr´ uz1(B) and Miroslav Hlav´ aˇc1,2,3 1
Faculty of Applied Sciences, NTIS, UWB, Pilsen, Czech Republic [email protected] 2 Faculty of Applied Sciences, Department of Cybernetics, UWB, Pilsen, Czech Republic [email protected] 3 ITMO University, St. Petersburg, Russia
Abstract. In this paper, we analyze an approach to speaker change detection in telephone conversations based on recurrent Long Short-Term Memory Neural Networks. We compare this approach to speaker change detection via Convolutional Neural Networks. We show that by finetuning the architecture and using suitable input data in the form of spectrograms, we obtain better results relatively by 2%. We have discovered that a smaller architecture performs better on unseen data. Also, we found out that using stateful LSTM layers that try to remember whole conversations is much worse than using recurrent networks that memorize only small sequences of speech. Keywords: Speaker change
1
· Diarization · Stateful LSTM
Introduction
Speaker change detection (SCD) is the task of finding boundaries of speech segments of two different speakers. Usually, the final goal is to use these segments for the task of diarization [4], where the segments are clustered and labelled leading to the solution of the problem “who speaks when”. In our previous paper [3] we extended the definition of SCD to detection of time instances in an audio stream when a change of audio source occurs which yields segments with constant audio sources. This is important for the real world scenario where the speech is produced naturally; e.g. telephone conversation. There are frequent speech overlaps, or a loud noise is present, and the silence also plays its role. If we look at the SCD as a preliminary step to speaker diarization it is reasonable to detect a time in audio stream when a second speaker starts speaking into the speech of the first speaker producing overlapped speech and also the time when the occlusion ends. The segment has constant audio source and can be handled by the diarization system as an outlier. Another issue is silence. When a speaker speaks and then he makes a long pause and then continues to speak, c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 226–233, 2018. https://doi.org/10.1007/978-3-319-99579-3_24
LSTM Neural Network for Speaker Change Detection
227
the boundaries of the silence should be also detected although no speaker change occurred. This is reasonable since the diarization system will model one segment as one speaker and the long silence can affect the acoustic properties of the speaker leading to a model that is not correct. One can argue that this can be handled by a voice activity detection system, but our definition explicitly handles this problem and we believe that the system of SCD learns this naturally from the data. The open problem is the length of the silence – what is considered a long pause?
2
Related Work
SCD has been addressed by many researchers in the past. There are classical methods based on the comparison of neighbouring segments of audio – Bayesian Information Criterion [5] and other forms of distances. More recently, some papers presented Deep Neural Networks (DNN) in the task of SCD; Convolutional Neural Networks (CNN) [3], standard fully connected DNNs [2] and recurrent DNNs with Long Short-Term Memory (LSTM) cells [6]. The biggest difference is that the CNN uses spectrogram of the audio while the other approaches use hand crafted features like MFCC. In this paper, we experimented with different architectures of LSTM networks to find out whether it outperforms the baseline CNN approach we published earlier [3]. We also compared whether the DNNs perform better when handcrafted features are presented to them or we can use the raw audio signal in the form of a spectrogram.
3
Dataset
For the training and testing purposes, we used a fraction of telephone conversation data from CallHome [1] corpus. The data are sampled at 8 kHz and are in English. We consider only the conversations where two speakers are present. In total, we obtain 109 conversations, each approximately 10 min long. We used 71 conversations for training and 38 conversations for testing. If spectrograms are used we compute each sample using a Hamming window of length 64 ms with the stride of 10 ms. Each sample represents 256 frequencies of the speech micro segment. When we use MFCC the setup is as follows; we use Hamming window of length 32 ms and shift it by 16 ms. We use 25 triangular filter banks that are spread non-linearly (Mel) and 11 cepstral coefficients are extracted. We use the deltas and delta-deltas of the coefficients and furthermore deltas and delta-deltas of the signal energy. This setup follows the setup in [6].
4
Network Architectures
In this paper, we use a CNN published earlier [3] as a baseline and compare the performance of different LSTM DNNs architectures with it. The CNN architecture is summarized in Table 1. The convolutional layers use ReLU activation functions and the fully connected dense layers use sigmoid activation functions.
228
M. Hr´ uz and M. Hlav´ aˇc Table 1. Summary of the architecture of the CNN. Layer
Kernels
Size
Shift
Convolution Max pooling Batch norm
50
16 × 8 2×2
2×2 2×2
Convolution Max pooling Batch norm
200
4×4 2×2
1×1 2×2
Convolution Max pooling Batch norm
300
3×3 2×2
1×1 2×2
Dense Dense
4000 1
The LSTMs were used in two different ways. First, we trained stateful LSTMs, where we train the models from whole conversations. This means that the net is remembering what was said, how it was said and who was speaking. Secondly, we trained the LSTMs on short sequences and after each sequence, the state of the network was reset. This means that the memory of the net is limited to the length of the sequence. Both approaches share the same architectures, but the style of training is changed. The architectures are summarized in Tables 2 and 3. Table 2. Summary of the architectures of the LSTMs. Architecture 1 Layer Cells Activation LSTM 512 tanh LSTM 512 tanh LSTM 512 tanh Dense 1024 relu Dense 512 relu Dense 256 relu Dense 1 sigmoid
Architecture 2 Layer Cells Activation Dense 1024 linear LSTM 512 tanh LSTM 512 tanh LSTM 512 tanh Dense 1024 relu Dense 512 relu Dense 256 relu Dense 1 sigmoid
Architecture 3 Layer Cells Activation LSTM 512 tanh LSTM 512 tanh LSTM 512 tanh Dense 1024 tanh Dense 512 tanh Dense 256 tanh Dense 1 sigmoid
The first set of architectures in Table 2 use a larger number of parameters (circa 7−8 million of trainable parameters). The architecture in Table 3 follows the work of Yin et al. [6], which is more lightweight (circa 100k trainable parameters). Some architectures were also tested in a Bi-directional scenario, in which the LSTM layers observe the sequences from both directions. This effectively doubles the number of LSTM layers parameters. The last layer in all networks consists only of one neuron with sigmoid activation function. It represents the probability for a given input. In case of the
LSTM Neural Network for Speaker Change Detection
229
Table 3. Summary of the architecture of the lightweight LSTMs Architecture 4 Layer Cells Activation LSTM 32 tanh LSTM 40 tanh Dense 40 tanh Dense 10 tanh Dense 1 sigmoid
Architecture 5 Layer Cells Activation LSTM 64 tanh LSTM 64 tanh LSTM 128 tanh Dense 128 tanh Dense 64 tanh Dense 1 sigmoid
CNN, the input is a spectrogram of 1.4 s of input audio signal. The output is the probability of a speaker change in the middle of the signal. For the LSTMs the input is either a sequence of spectrogram samples or a sequence of MFCCs.
5
Experiments
All the networks were trained to minimize the loss between predicted probability signal and labelling signal. The labelling signal is created according to the work [4]. It is a linear fuzzy labelling, where the probability of the speaker change is given by a normalized distance from an annotated value as in Eq. 1. mini (|t − si |) , (1) L(t) = max 0, 1 − τ where si is the time of ith annotated speaker change and τ = 0.6 is the tolerance. The annotated speaker changes were filtered so that a pause in one speaker’s utterance is limited to 0.5 s. The loss function is defined as binary crossentropy function. For all setups optimal hyperparameters have been found and the networks are trained until convergence of the loss function. The results are presented in the form of coverage-purity curves (Eq. 2) and equal coverage-purity values. maxh∈H |r ∩ h| , (2) coverage(R, H) = r∈R r∈R |r| where R are the reference speaker segments, H are the predicted speaker segments, |s| is the duration of segment s and r ∩ h is the intersection of segments r and h. Purity is the dual metric where the roles of R and H are interchanged. The coverage represents how well we divided the audio signal according to the annotations. Low coverage means that the signal was oversegmented. On the other hand undersegmentation results in high coverage. That is why the overall quality of the segmentation has to be analyzed dually by the purity measurement. Low purity means that the signal was undersegmented, while oversegmentation results in high purity. That is why the best result is achieved when both coverage and purity are high.
230
5.1
M. Hr´ uz and M. Hlav´ aˇc
Baseline CNN
The system is trained according to the work [4]. Each spectrogram representing 1.4 s of the audio signal is regressed into the probability value of a speaker change in the middle of the audio segment. The training samples are randomly selected from the training set. The testing audio signals are analyzed window by window with a shift of 0.1 s. 5.2
Stateful LSTMs
Stateful LSTMs are trained from sequences covering whole conversations. The LSTMs are trained on batches of shorter sequences, but the internal states of the networks are remembered across the whole conversation. The experiment should show whether the networks are able to model the speakers present in the conversation and/or whether they are able to learn from generally longer sequences. The LSTMs return sequences, which allows us to predict frame based probabilities of the speaker change. In the case of the spectrograms on the input of the network, the frame is 0.01 s long, in the case of MFCCs the frame is 0.016 s long. The testing audio signals are analyzed conversation by conversation. After each testing conversation, the internal states of the network are reset. 5.3
Short Sequence LSTMs
These networks were trained with “forgetting” after each short sequence. According to [6] we used sequences 3.2 s long. The networks were presented with batches of randomly selected sequences from the training set. The testing sequences were obtained from individual conversations and shifted by 0.8 s resulting in overlapping probability signals. The resulting probability value for a given time was computed as the average value of overlapping values for the given time.
6
Results
In Fig. 1, we show the coverage-purity curves for stateful LSTM DNNs. The different networks are summarized in Table 4. The best result with equal coveragepurity (ECP) of 0.7257 was achieved by the network that uses raw spectrogram as input. The handcrafted MFCC features worsen the results. The network denoted net03 uses a dense layer to simulate a feature extraction step supersedes the MFCC features but is still not as good as when the raw spectrogram is used. Still, the best result is far behind the ECP of the CNN which is equal to 0.7955. This indicates that there are not enough data to train the stateful LSTMs to obtain a model that is general and works well on unseen testing data. We tried lowering the number of trainable parameters to address this issue by using the lightweight architecture summarized in Table 3. Our experiments showed that using the bidirectional LSTM layers is beneficial by a small margin. The ECP achieved by the network denoted net04 in Fig. 2 was equal to 0.7675 which is
LSTM Neural Network for Speaker Change Detection
Fig. 1. Coverage-purity curves for stateful large LSTMs.
Fig. 2. Coverage-purity curves for lightweight LSTMs.
231
232
M. Hr´ uz and M. Hlav´ aˇc
much better than the larger architecture but still not as good as the CNN. This result supports our theory of not having enough data to train the stateful LSTMs. There may be other reasons, but we were not able to achieve better results. This conclusion led us to new experiments with shorter sequences and “forgetful” LSTM DNNs. During experimenting, we also found out that using hyperbolic tangent as activation function of the dense layers yields better results than the ReLU function, hence the Architecture 3. With this setup, we achieved an ECP of 0.7639 which is again worse as the lightweight architecture. This means that the large architectures are inadequately large for the shorter sequences. When we use the lightweight architecture in this scenario we achieve an ECP of 0.7807 when using MFCC features and finally ECP of 0.8121 when using raw spectrogram. This result is better than the results of the CNN. One last test was with a network with Architecture 5 (Table 3) which has more parameters. The resulting ECP of 0.8027 shows that the smaller architecture performs better on unseen data. Table 4. Different types of LSTM networks. Spectro indicates that spectrogram was on the input while MFCC means MFCC features. Bi means that the LSTM layers were bidirectional and Arch is the type of architecture used. Name Stateful Spectro MFCC Bi Arch ECP net01 ×
×
net02 ×
×
net03 ×
×
net04 × net05 net06
7
×
1
0.7160
1
0.7257
2
0.7189
×
× 4
0.7639
×
× 4
0.7807
× 4
0.8121
Conclusion
We have conducted experiments with recurrent neural networks for the task of speaker change detection in telephone conversations. We have shown that the recurrent LSTM DNNs are able to outperform the CNN approach of [4] when proper care is put into the selection of the architecture and form of the input audio data. Smaller architectures have a better chance to generalize the problem and perform well on unseen data. With a larger architecture, much more data would be needed to train good models. Even though it seems that the problem of speaker change detection is not a very difficult problem when handled with machine learning approach. Another important conclusion is that using the raw spectrogram input is much better than using handcrafted MFCC features. This has been observed in many more applications of neural networks, particularly CNNs in computer vision. With the best setup of LSTM DNNs,
LSTM Neural Network for Speaker Change Detection
233
we achieve a result of 0.8121 outperforming the baseline CNN relatively by 2%. When comparing both approaches as a whole one has to consider the usage of them. CNN is able to decide about any segment 1.4 s long independently on other segments, but generally has much more parameters than lightweight LSTM DNN. On the other hand, our LSTM DNN needs to observe 3.2 s long segments and uses local averaging of the prediction probability function. This requires more forward passes through the network but we obtain frame level decision about the speaker change. Given the smaller number of parameters in the network, this might not be an issue. Acknowledgments. This paper was supported by the project no. P103/12/G084 of the Grant Agency of the Czech Republic. The work has also been supported by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
References 1. Canavan, A., Graff, D., Zipperlen, G.: CALLHOME American English speech, LDC97S42. In: LDC Catalog. Linguistic Data Consortium, Philadelphia (1997) 2. Gupta, V.: Speaker change point detection using deep neural nets. In: ICASSP, pp. 4420–4424. Brisbane (2015). https://doi.org/10.1109/ICASSP.2015.7178806 3. Hr´ uz, M., Kuneˇsov´ a, M.: Convolutional neural network in the task of speaker change detection. In: Ronzhin, A., Potapova, R., N´emeth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 191–198. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-43958-7 22 4. Hr´ uz, M., Zaj´ıc, Z.: Convolutional neural network for speaker change detection in telephone speaker diarization system. In: ICASSP: 42nd IEEE International Conferecnce on Acoustics, Speech and Signal Processing, pp. 4945-4949 (2017) 5. Shaobing, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: Proceedings DARPA Broadcast News Transcription and Understanding Workshop, vol. 8, pp. 127–132 (1998) 6. Yin, R., Bredin, H., Barras, C.: Speaker change detection in broadcast TV using bidirectional long short-term memory networks. In: Interspeech 2017, Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), ISCA, Stockholm, Sweden (2017). https://doi.org/ 10.21437/Interspeech.2017-65
Noise Suppression Method Based on Modulation Spectrum Analysis Takuto Isoyama(B) and Masashi Unoki(B) Japan Advanced Institute of Science and Technology, 1–1 Asahidai, Nomi, Ishikawa 923–1292, Japan {isoyama-t,unoki}@jaist.ac.jp
Abstract. Conventional methods for noise suppression can successfully reduce stationary noise. However, non-stationary noise such as intermittent and impulsive noise cannot be sufficiently suppressed since these methods do not focus on temporal features of noise. This paper proposes a method for suppressing both stationary and non-stationary noise based on modulation spectrum analysis. Modulation spectra (MS) of the stationary, intermittent, and impulsive noise were investigated by using the time/frequency/modulation analysis techniques to characterize the MS features. These features were then used to suppress the stationary and non-stationary noise components from the observed signals. Using the proposed method, the direct-current components of the MS in the stationary noise, harmonicity of the MS in the intermittent noise, and higher modulation-frequency components of the MS in the impulsive noise were removed. The following advantages of the proposed method were confirmed: (1) sound pressure level of the noise was dramatically reduced, (2) signal-to-noise ratio of the noisy speech was improved, and (3) loudness, sharpness, and roughness of the restored speech were enhanced. These results indicate that the stationary as well as non-stationary noise can be successfully suppressed using the proposed method. Keywords: Noise suppression · Modulation spectrum Non-stationary noise · Gammatone filterbank Psychoacoustical sound-quality index
1
Introduction
We perceive various types of sounds at various sound-pressure levels in our daily life. For example, speech and music are perceived as wanted sound, and background stationary and non-stationary noise as unwanted sound. Heavy noise not only dramatically reduces intelligibility of speech but also induces hearing loss and hearing fatigue in case of long-time exposure. Therefore, noise suppression is important for enhancing speech intelligibility as well as protecting hearing ability. There are many kinds of noise suppression methods. Classical and most commonly used method for suppressing noise is Boll’s spectrum subtraction method c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 234–244, 2018. https://doi.org/10.1007/978-3-319-99579-3_25
Noise Suppression Method Based on Modulation Spectrum Analysis
235
[1]. It can successfully suppress stationary components of background noise by subtracting the averaged amplitude spectrum from noisy speech. However, this method cannot sufficiently suppress non-stationary noise such as impulsive noise and intermittent noise. Several methods have been proposed for suppressing non-stationary noise. One of them can sufficiently suppress impulsive noise by using zero-phase signal [2]. However, the drawback of that method is that the impulsive components of the unvoiced signals (consonants) are also being removed. Non-negative spectral decomposition was proposed to suppress both stationary and non-stationary noise [3]. However, this method trains noise properties using a preliminary learning technique, so noise reduction is limited to the training data. It is difficult to reduce both stationary and non-stationary noise simultaneously without prior knowledge of the noise types and preliminary learning. From knowledge of human auditory perception [4], temporal modulation can be regarded as an important part of speech perception as well as of sound quality assessment. Therefore, our motivation is to mimic noise reduction based on auditory modulation perception. This paper proposes a method for suppressing both stationary and non-stationary noise based on modulation spectrum (MS) analysis. MS features are used to characterize speech signals as well as various types of noise. These features are then used to reduce stationary, impulsive, and intermittent noise components from noisy speech.
Fig. 1. Block diagram of method for suppressing stationary/non-stationary noise.
2
Noise Suppression Method
Fig. 1 shows a block diagram of the suppression method based on MS. First, the observed signal s(t) is decomposed to its frequency components (k-channel signals) xk (t)s of k by the gammatone filterbank [5]. Second, the temporal amplitude envelope ek (t) and carrier signal ck (t) are decomposed from xk (t). The temporal power envelope of k-th channel, e2k (t), can be derived by using the Hilbert transform as follows: e2k (t) = LPF |xk (t) + j · Hilbert(xk (t))|2 , (1) where LPF(·) is the low-pass filter with the cut-off frequency of 64 Hz, | · | is the absolute value, and Hilbert(·) is the Hilbert transform. Third, noise components
236
T. Isoyama and M. Unoki
on the temporal power envelope of k-channel are removed by using the following steps: (1) removal of stationary noise component, (2) removal of intermittent noise component, and (3) removal of impulsive noise component. This is referred as a “noise-suppressed” power envelope. Fourth, the restored amplitude envelope is derived from the noise-suppressed power envelope by a square root operation and the stored carrier is multiplied to resynthesize the noise-suppressed channel signal. Finally, the noise-suppressed signal, y(t), is obtained by using the inversed gammatone filterbank. The blocks in Fig. 1 indicate the following: (·)2 is the square operation, Mean(·) is a mean operation in the time domain, HWR(·) is a half-wave rectification, BSF(·) is a band-stop filter, and LPFS (·) is a low-pass filter, where the pass-band is approximated as an MS shape of speech. Figure 2 shows an example of how the power envelope of an observed signal is processed to suppress stationary and non-stationary noise components. Figure 2(a) shows the derived temporal power envelope from the outputs of the gammatone filterbank. Figure 2(b) shows the suppressed power envelope by removing direct-current (DC) component on the power envelope in Fig. 2(a). Figure 2(c) shows the suppressed power envelope by removing harmonics in the MS in Fig. 2(b). Figure 2(d) shows the suppressed power envelope by removing higher modulation frequency components in MS in Fig. 2(c). The temporal power envelopes in Figs. 2(a)–(d) are obtained from the outputs in the corresponding processing in Figs. 1(a)–(d).
Fig. 2. Examples of noise suppression by the proposed method: (a) power envelope of the observed noisy signal; (b) suppressed power envelope by DC-removal from (a); (c) suppressed power envelope by band-stop filtering of (b), and (d) suppressed power envelope by low-pass filtering of (c).
3
Modulation Spectrum Analysis
Modulation spectrum analysis (MSA) is used in Fig. 1(a) to investigate modulation features of various types of sounds. The MS, Ek (fm ), can be then derived
Noise Suppression Method Based on Modulation Spectrum Analysis
237
from the power envelope, e2k (t), by using the discrete Fourier transform (DFT) as follows: Ek (fm ) = |DFT(e2k (t))|,
(2)
where fm is the modulation frequency in Hz. 3.1
Database
Table 1 shows database information of the sound sources in details. Here, fs is the sampling frequency in Hz. This frequency has different values in each dataset so that the sampling frequency used in the proposed method and the MSA is set to be 44.1 kHz by using the resampling technique. Speech stimuli including male and female speech signals with four morae words from familiarity-controlled word-lists (FW07) [6] were used to analyze the MS features of speech signals. Noise stimuli including stationary noise (white noise, pink noise, and babble noise) and non-stationary noise (machine-gun noise as intermittent noise and impulses as impulsive noise) from NOISEX-92 [7] were used to analyze the MS features of stationary and non-stationary noise. Table 1. Sound sources used for modulation spectrum analysis.
3.2
Sound source
fs [Hz] Number of stimulations Duration [sec]
White noise
19,980
1
235
Pink noise
19,980
1
235
Babble noise
19,980
1
235
Machinegun noise 19,980
1
235
Impulse noise
19,980
4
1
Male speech
48,000 400
1
Female speech
48,000 400
1
Feature Analysis and Results
The MSA was used to analyze all stimuli in various sound sources as shown in Table 1. Figure 3 shows the results of MSA for various sound sources. Horizontal axis indicates the modulation frequency in Hz and vertical axis indicates the normalized MS in dB. Here, normalization was done as the level of MS at 0 Hz is 0 dB. It was reconfirmed that the modulation spectra of speech stimuli have a unique peak around modulation frequency of 4 Hz, as shown in Fig. 3 [8]. It was found that the modulation spectra of stationary noise such as white noise, pink noise, and babble noise appear in the lower modulation frequencies. It was also found that MS of machine gun noise as intermittent noise appears as harmonics. From the analyses of the datasets, it was found that the fundamental
238
T. Isoyama and M. Unoki
Fig. 3. Modulation spectra Ek (fm ) of various signals in the case of (a) k=11, (b) k=16, (c) k=21, and (d) k=27.
modulation frequency of the machine gun noise was 8 Hz, while the MS of the impulsive noise appears as a flat shape with the dynamic range of 5 dB in all modulation frequencies. These values of the fm and dynamic range depend on the datasets, so they should be automatically determined by the auto-correlation technique. From these findings, the MS features of various types of noise sounds that we found may be used to suppress both stationary and non-stationary noise simultaneously in the MS domain.
4
Algorithms of Suppression Processing
This section presents three algorithms of noise suppression processing in Fig. 1 to remove stationary and non-stationary noise components. 4.1
Removal of Stationary Noise Component
From the results in Sect. 3, it is found that the MS of stationary noise appears in the lower modulation frequencies. Thus, to obtain the power envelope qk2 (t)
Noise Suppression Method Based on Modulation Spectrum Analysis
239
with removed stationary noise, the DC component of the MS in Fig. 1(a) was cancelled out by using the following processing. e2k (t) − µk e2k (t) ≥ µk 2 qk (t) = , (3) 0 otherwise TN 1 e2k (t)dt, (4) µk = TN 0 where TN is the time length of the non-speech section. In this paper, the speech and non-speech sections were determined by the voice activity detection (VAD) method [9]. 4.2
Removal of Intermittent Noise Component
From the results in Sect. 3, it is found that the MS of intermittent noise appears as harmonics with the fundamental modulation frequency of 8 Hz. Thus, to remove the intermittent noise component, these harmonics of the MS in Fig. 1(b) are canceled out by the following finite impulse response (FIR) band-stop filtering. H(z) = b0 − rL z −L ,
(5)
where b0 = 1, r = 0.995, fc is the fundamental modulation frequency and L = round(fs /fc ). In this paper, fc was determined from the rectified signal in Fig. 1(b) by using the auto-correlation method. Figure 4(a) shows an example of a band-stop filter (BSF) with fc = 8 Hz. 4.3
Removal of Impulsive Noise Component
From the results in Sect. 3, it is found that the MS of impulsive noise appears in the entire modulation frequency domain as a flat shape. Thus, to remove the impulsive noise component, the MS shape in Fig. 1(c) is attenuated by using the following low-pass filter (LPFS ). H(z) =
b0 + b1 z −1 , 1 + a1 z −1
(6)
where H(z) was designed as infinite impulse response (IIR) Butterworth filter. For example, when the cut-off frequency is 5 Hz, coefficients are b0 = 0.07, b1 = 0.07, and a1 = 0.85. Figure 4(b) shows an example of an LPFS with a cut-off frequency of 5 Hz.
5 5.1
Evaluations Evaluation Measures
Five types of objective measures were used to evaluate the proposed method.
240
T. Isoyama and M. Unoki
Fig. 4. Frequency responses of (a) band-stop filter and (b) low-pass filter.
The first two measures evaluated the efficiency of the proposed method in suppressing the noise components. One of them was used to evaluate the suppression level (SL ) with regard to the noise itself, defined as: T 2 s (t)dt , (7) SL = 10 log10 0T y 2 (t)dt 0 where s(t) is the observed signal before suppression and y(t) is the noise signal after suppression. Another measure was used to evaluate the relative suppression level with regard to noisy speech, defined as: T 2 s (t)dt , (8) NS = SNR − 10 log10 T 0 s (ss (t) − y(t))2 dt 0 where SNR is the signal-to-noise ratio with regard to noise condition, ss (t) is the original speech, y(t) is the noisy speech, and T is the signal duration. The last three measures were psychoacoustical sound-quality indices, loudness, sharpness, and roughness [10]. These measures were used to objectively evaluate sound-quality after noise-suppression. Loudness indicates the attribute of a sound that determines the magnitude of the auditory sensation produced. Sharpness and roughness indicate complex effects that quantify the subjective perception of rapid and sharp sound. Thus, heavy noise, in general, induces increasing loudness, increasing sharpness, and increasing roughness. 5.2
Results
The proposed method was evaluated by using five types of noise signals presented in Table 1 and five evaluation measures presented in Sect. 5.1 to confirm whether
Noise Suppression Method Based on Modulation Spectrum Analysis
241
the proposed method can sufficiently suppress stationary and non-stationary noise, as well as reduce the perceptual effects due to noise exposure. Figure 5 shows the results of noise suppression level as a function of the sound pressure level (SPL) of noise from 60 dB to 100 dB. These results were obtained from the five types of noise signals. They indicate that the proposed method can sufficiently suppress the noise level regardless of the SPLs and noise types. They also indicate that the three algorithms of noise suppression have sufficient effect on noise suppression in comparison with each algorithm of noise suppression. Figure 6 shows the results of relative suppression level as a function of SNR at three specific SPLs of noise from 60 dB to 100 dB in heavy noisy conditions. The relative suppression level was calculated from five types of noise by using Eq. (8). It was found that the noise suppression level under speech presentation by the proposed method in the cases of the SPLs at 100 dB and 80 dB exceeds 5 dB, while the suppression level decreases from 5 as the SNR increases in the case of the SPL at 60 dB.
Fig. 5. Noise suppression level by the proposed method for five types of noisy signals for (a) sound pressure level of 60 dB, (b) 80 dB, and (c) 100 dB.
Psychoacoustical sound-quality indices were calculated from the five types of noisy speech and the noise-suppressed speech signals. Figure 7 shows relative improvement of these indices when using the proposed method. Figure 7(a) shows improvement in loudness when using the proposed method, that is, reduced loudness, LR . This was calculated by LR = Lorg − Lsup , where Lorg is loudness of the original noisy speech and Lsup is loudness of the noisesuppressed speech. It was found that when the SPL of noise is 100 dB, the LR of white, pink, and babble noise is 50 sone, while for the intermittent noise and impulsive noise it is 20 sone. In addition, it was found that the reduced loudness, LR , increases as the SPL of noise increases. Figure 7(b) shows reduced sharpness, KR . This was calculated by KR = Korg − Ksup , where Korg is sharpness of the original noisy speech and Ksup is sharpness of the noise-suppressed speech. It was found that when the SPL of noise is 100 dB, the KR of white, pink, and babble noise is 0.1 acum, while for the inter-
242
T. Isoyama and M. Unoki
Fig. 6. Relative suppression level for five types of noisy signals for (a) sound pressure level of 100 dB, (b) 80 dB, and (c) 60 dB.
Fig. 7. Evaluations by psychoacoustical sound-quality indices: (a) reduced loudness LR , (b) reduced sharpness KR , and (c) reduced roughness RR .
mittent noise and impulsive noise it is 0 acum. In addition, it was found that the reduced sharpness, KR , remains the same when the SPL of noise increases. Figure 7(c) shows reduced roughness, RR . This was calculated by RR = Rorg − Rsup , where Rorg is roughness of the original noisy speech and Rsup is roughness of the noise-suppressed speech. It was found that when the SPL of noise is 100 dB, the RR of white, pink, and babble noise is 0.05 asper, that of intermittent noise is 0.73 asper, and that of impulsive noise is 0.25 asper. In addition, it was found that reduced roughness is sensitive to non-stationary temporal fluctuations in such non-stationary noise. All of the results confirmed that the proposed method can perceptually reduce the noise effects for speech enhancement, even if the SPL of noise is high.
Noise Suppression Method Based on Modulation Spectrum Analysis
6
243
Conclusion
This paper proposed a method for suppressing both stationary and nonstationary noise based on MSA. MSA was used to investigate the unique features of stationary and non-stationary noise in the MS domain and the methods for cancelling out these features. The proposed method was evaluated for various types of noisy speech signals by using five types of evaluations (two suppression levels and three psychoacoustical sound-quality indices) to verify whether or not the noise level can be sufficiently suppressed and the perceptual effects of stationary and non-stationary noise can be reduced. It was found that the proposed method can suppress stationary noise by 8 dB, intermittent noise by 6 dB, and impulsive noise by 8 dB terms of the suppression level. It was also found that the proposed method can sufficiently suppress the noise effects from noisy speech by 8 dB as SNRs from −20 to −60 dB in terms of relative suppression level. It was found that when the SPL of noise is 100 dB, LR of stationary noise is 50 sone, while that of intermittent noise and impulsive noise is 20 sone. In addition, it was found that LR increases as the SPL of noise increases. It was found that when the SPL of noise is 100 dB, KR of stationary noise is 0.1 acum while that of intermittent noise and impulsive noise is 0 acum. It was found that when the SPL of noise is 100 dB, RR of stationary noise is 0.05 asper, that of intermittent noise is 0.73 asper, and that of impulsive noise is 0.25 asper. This confirms that the proposed method can not only sufficiently suppress stationary and non-stationary noise but can also reduce the perceptual effects due to noise exposure. Acknowledgments. This work was supported by the Secom Science and Technology Foundation by the Suzuki Foundation, and by a Grant in Aid for Innovative Areas (No. 16H01669, and 18H05004) from MEXT, Japan.
References 1. Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27, 113–120 (1979) 2. Takehara, R., Kawamura, A., Iiguni, Y.: Impulsive noise suppression using interpolated zero phase signal. In: APSIPA2017, pp. 1382–1389 (2017) 3. Zhiyao, D., Gautham, J.M., Paris, S.: Speech enhancement by online non-negative spectrogram decomposition in non-stationary noise environments. In: Proceedings of Interspeech 2012, pp. 595–598 (2012) 4. Stephan, D.E., Torsten, D.: Characterizing frequency selectivity for envelope fluctuations. J. Acoust. Soc. Am. 108, 1181 (2000) 5. Patterson, R., Nimmo-Smith, L., Holdsworth, J., Rice, P.: An auditory filter bank based on the gammatone function. Paper Presented at a Meeting of the IOC Speech Group on Auditory Modelling at RSRE, pp. 14–15 (1987) 6. Kondo, T., Amano, S., Sakamoto, S., Susuki, Y.: Development of familiaritycontrolled word-lists (FW07). IEICE Tech. Rep. 107(436), 43–48 (2008)
244
T. Isoyama and M. Unoki
7. Varga, A., Steeneken, J.M.H.: Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(13), 247–251 (1993) 8. Atlas, L., Greenberg, S., Hermansky, H.: The Modulation Spectrum and Its Application to Speech Science and Technology. Interspeech Tutorial, Antwerp (2007) 9. Kanai, Y., Morita, S., Unoki, M.: Concurrent processing of voice activity detection and noise reduction using empirical mode decomposition and modulation spectrum analysis. In: Proceedings of INTERSPEECH, pp. 742–746 (2013) 10. Zwicker, F.: Psychoacoustics: Facts and Models. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-68888-4
Designing Advanced Geometric Features for Automatic Russian Visual Speech Recognition Denis Ivanko1 ✉ , Dmitry Ryumin1, Alexandr Axyonov1, and Miloš Železný2 (
1
)
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia [email protected], [email protected], [email protected] 2 University of West Bohemia, Pilsen, Czech Republic [email protected]
Abstract. The use of video information plays an increasingly important role for automatic speech recognition. Nowadays, audio-only based systems have reached a certain accuracy threshold and many researchers see a solution to the problem in the use of visual modality to obtain better results. Despite the fact that audio modality of speech is much more representative than video, their proper fusion can improve both quality and robustness of the entire recognition system that was proved in practice by many researches. However, no agreement between researchers on the optimal set of visual features was reached. In this paper, we investigate this issue in more detail and propose advanced geometry-based visual features for automatic Russian lip-reading system. The experiments were conducted using collected HAVRUS audio-visual speech database. The average viseme recognition accuracy of our system trained on the entire corpus is 40.62%. We also tested the main state-of-the-art methods for visual speech recognition, applying them to continuous Russian speech with high-speed recordings (200 frames per seconds). Keywords: Lip-reading · Automatic speech recognition Visual speech decoding · Visual features · Geometric features · Russian speech
1
Introduction
Nowadays, automatic speech recognition is one of the rapidly developing areas of computer science. This fact is confirmed by a large amount of practical applications appearing almost every day. At the moment, the most popular of the existing appli‐ cations are Google “Speech API”, Apple “Siri”, Microsoft “Cortana”, Amazon “Alexa”, Yandex “Alisa” from the giants of the global IT industry, which deserve the recognition of millions of users. Along with them, there are thousands of practical applications that have spread in many areas of human’s life: in automatic processing of incoming calls in telephone call-centers, in voice control for home appliances and car navigation systems, in social services for people with disabilities, in healthcare, military, education, etc.
© Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 245–254, 2018. https://doi.org/10.1007/978-3-319-99579-3_26
246
D. Ivanko et al.
The idea of using voice input for converting speech into text or for organizing humanmachine interaction (HMI) is very convenient and more natural for users compared to the keyboard input. In recent years, the widespread use of machine learning techniques and artificial neural networks (ANN) has made it possible to set the accuracy and reli‐ ability of speech recognition systems to a new level. Usually, the maximization of recognition accuracy is achieved through the use of cascades of ANNs, trained on various combinations of acoustic features [1]. However, despite sometimes satisfactory results obtained with proper training of the system for solving a particular task, the recognition accuracy of spontaneous and continuous speech in real-life conditions is still far from human capabilities. At the same time, audio-based practical applications have a number of significant drawbacks. When acoustic noises occur, the recognition accuracy of such systems is rapidly deteriorating. To date, the main approach to solve this problem is the use of a certain pre-processing techniques for noise reduction in the incoming signal. However, in real life conditions, there are many different types of noise: starting from stationary wideband noise in the telephone channel and to the crowd-noise (“cocktail party” noise) in the room along with the reverberation. Thus, it is not always possible to conduct a proper noise reduction, and that is still an open and currently unresolved issue for auto‐ matic speech recognition. Obviously, in noisy conditions the recognition accuracy of automatic systems is also far from human capabilities. On the other hand, it is worth taking into account that human’s speech is bimodal by its nature, and people themselves pay attention to the lip movements of the interlocutor during a conversation [2]. Therefore, it is not entirely correct to expect from an automatic speech recognition system the same high result as from a human if it receives signifi‐ cantly less information. Because of this, many researchers began to use visual informa‐ tion about speech in their studies. At First, it allows creating more robust systems (since video information is invariant to acoustic noises). Secondly, the correct fusion of modal‐ ities makes it possible to obtain advantages from both and, at the same time, eliminate their shortcomings, giving the best recognition results. In this paper, we focus on the study of visual Russian speech and present a method for extracting an advanced set of geometric features. We also present the results of experiments obtained with the developed lip-reading system. The remainder of this paper is organized as follows. A review of the state of research field is presented in Sect. 2; in Sect. 3, we describe the basic methodology, including region of interest (ROI) localization and proposed geometry-based visual features; in Sect. 4, we describe the setup and the results of the experiments; some conclusions are given in Sect. 5.
2
Related Work
One of the first works in which the researchers tried to systematize the existing knowledge about audio-visual speech recognition was [3]. The paper showed that despite the fact that visual modality of natural speech is much less informative than audio, information received from it is often enough to solve simple tasks (e.g.
Designing Advanced Geometric Features
247
isolated words recognition). A description of the current state of the field can be found in the works [4, 5] dedicated to visual-only speech recognition. And also in the works [6, 7], dedicated to the audio-visual speech recognition. In the framework of the statistical approach to speech recognition, a representative database for model training is an indispensable element. For English speech multiple data‐ bases are publicly available, such as: AVICAR [8], AVLetters [9], CUAVE [10], AVTimit [11], IBMSR [12], PRAV Corpus [13], etc. However, the situation with Russian speech is much more complicated, since there are very few existing Russian visual speech databases. In our work, we used one own database of continuous Russian speech with highspeed recordings – HAVRUS, collected in 2016–2017 in SPIIRAS [14]. The next important step in the construction of a lip-reading system is to locate the region of interest that contains the mouth motion relevant to speech. It is important since the quality of ROI has a significant influence on the recognition accuracy. To extract ROIs, many researchers relied on the active appearance model (AAM) [15, 16], Haarlike feature based boosted classification framework [17, 18], skin color thresholding [19], etc. Despite numerous studies, researchers were not able to find the best feature set universally accepted for representing visual speech (e.g., in comparison with wellknown MFCC features for acoustic speech). To date, there are several basic types of features, which can be found in the literature. The most frequently used of them are: pixel-based features [20] – raw pixel data used directly or after some image transfor‐ mation; geometry-based features [21] – geometric information of the talking lips is extracted as features; motion-based features [22] – features designed to describe the motion; model-based features [23] – a model of the visible articulators is built and the compact model parameters are used as visual features, or a combination of mentioned above features [24, 25]. There are also several state-of-the-art methods for model training. Initially, the most widespread methods were based on the use of Hidden Markov Models (HMM) for lipreading and their coupled or multistream versions for audio-visual speech recognition [26]. However, at present, the approaches based on the use of neural networks of different architectures have become increasingly popular [27]. In this research, we used AAM-based algorithm for ROI localization, developed geometric features to extract information about uttered speech and multilayer neural network for classification.
3
Methodology
3.1 Region of Interest Localization Since the most valuable information about pronounced speech is contained in the mouth area, the first important step is the preprocessing of raw video frames and ROI locali‐ zation. For this purpose, we used AAM-based algorithm implemented in the Dlib open source computer vision library [28]. The main idea of the algorithm is to match the statistical model of object shape and appearance, containing a set of facial landmarks, to a new image. The face detector we use is made using the classic Histogram of Oriented
248
D. Ivanko et al.
Gradients (HOG) feature combined with a linear classifier, an image pyramid, and sliding window detection scheme. Figure 1 shows how to find facial landmarks in an image using this method. These are points on the face such as the corners of the mouth, along the eyebrows, on the eyes, and so forth.
Fig. 1. Full 2D face shape model used in [29] (left) and the face landmarks localization algorithm results (right).
Thus, on each frame, where a face was found, we get the coordinates of the facial landmarks, 20 of which are located in the mouth region (12 on the external and 8 on the internal borders of lips). This method works very well on frontal face images and, since the HAVRUS database contains frontal video recordings, we managed to obtain very precise coordinates of lips landmarks. 3.2 Feature Extraction Method The general structure of the proposed method for extracting geometric features is shown in Fig. 2 and includes the sequential execution of the following 5 steps: 1. Load a frame from a video file. 2. Use the facial landmarks detection algorithm described in Sect. 3.1 to find 68 facial key points. 3. Normalize the coordinates of the obtained landmarks in order to bring the data to a single format, as it was done in the work [30]. 4. Calculate a number of Euclidean distances [31] between landmarks in accordance with Table 1. 5. Save the feature vector.
Designing Advanced Geometric Features
Fig. 2. General diagram of the feature extraction method.
249
250
D. Ivanko et al. Table 1. Relationships between facial landmarks used for features extraction. # 1 2 3 4 5 6 7 8 9 10 11 12
Distance between landmarks (№) 49–61 61–60 60–68 68–59 59–67 67–58 67–57 57–66 66–56 56–65 65–55 65–54
# 13 14 15 16 17 18 19 20 21 22 23 24
Distance between landmarks (№) 54–64 64–53 64–52 52–63 52–62 62–51 62–50 50–61 62–68 63–67 64–66 61–65
In this work, we attempted to determine an optimal set of geometric features to maximize the recognition accuracy. Figure 3 shows 24 pairs of key points (highlighted in green) that have been selected experimentally and convey the most valuable infor‐ mation about uttered speech. Table 1 demonstrates a set of these 24 selected features. The columns indicate the landmark numbers in accordance with the map (Fig. 1, left), the Euclidean distance between which was taken as features, e.g. feature #24 shows the width of the internal borders of lips (landmarks 61 to 65). The results given in the experimental section were obtained using this feature set.
Fig. 3. Examples of the detected ROIs with 20 landmarks in the mouth region. (Color figure online)
3.3 MLP Training For viseme classification we used Multi-layer Perceptrons (MLPs) trained with the Scikit-learn free software machine learning library [21]. MLP is a supervised learning algorithm that learns a function f (⋅):Rm → Ro by training on a dataset, where m is the number of dimensions for input and o is the number of dimensions for output. Given a set of features X = x1 , x2 , … , xm and a target y, it can learn a non-linear function approximator for either classification or regression. Figure 4 shows a one hidden layer MLP with scalar output [32].
Designing Advanced Geometric Features
251
Fig. 4. MLP with one hidden layer [32].
The leftmost layer, known as the input layer, consists of a set of neurons xi | x1 , x2 , … , xm representing the input features (24 in our case). Each neuron in the hidden layer transforms the values from the previous layer with a weighted linear summation 𝜔1 x1 + 𝜔2 x2 + … + 𝜔m xm, followed by a non-linear activation function g(⋅):R → R. The output layer receives the values from the last hidden layer and trans‐ forms them into output values. In this work, we used the following basic parameters of a neural network: as an activation function we used the rectified linear unit function, returns f (x) = max(0, x). The number of neurons in the hidden layer ranged from 1 to 100. Batch size was calcu‐ lated by the formula batch_size = min(200, n_samples). Maximum number of iterations was 200 with 1e–4 tolerance for the optimization.
4
Evaluation Experiments
4.1 Experimental Setup The experiments were carried out using HAVRUS corpus consisting of high-speed (200 fps) video recordings of 20 speakers. Each of them uttered 200 Russian phrases, taken from phonetically rich texts. The resolution of video data is 640 × 480 pixels. The data‐ base also contains phoneme and viseme labeling. In this work, we solved the so-called phoneme/viseme recognition task (the input of the system is an image; the output is the recognized phoneme/viseme). According to the existing HAVRUS labeling, we divided the available data into 48 classes, according to the number of phonemes in Russian language. After that, the data for each class was divided with the ratio of training data to test data as 75:25%. Then, 48 MLPs were trained (one for each viseme class) according to Sect. 3.3. Thus, when an input image from the test set is fed, we get the probability of its belonging to a certain class from each MLP and the MLP with the highest probability wins. Accuracy in this case means correct
252
D. Ivanko et al.
recognition of visemes on the test set, in comparison with the viseme labeling in the speech database. 4.2 Experimental Results The best recognition result of the system trained in this way is 40.62% with 85 neurons MLP. Figure 5 shows a dependence of the viseme recognition accuracy on the number of layers. Of course, the main goal of this work was not finding the best configuration of the neural network. However, we can also observe some trend of increasing accuracy with increasing the number of neurons until a certain limit.
Fig. 5. Viseme recognition accuracy trained on the HAVRUS corpus.
The main task of this study was to find the optimal set of geometric features. We can say that the preliminary results of this work are a necessary intermediate step to improve the existing baseline of audio-visual Russian speech recognition [26, 33] and will be used for this purpose in our future research.
5
Conclusions and Future Work
In this paper, we present an advanced set of geometric features designed to improve the accuracy of the lip-reading system for Russian and also report our preliminary experi‐ mental results. The experiments were conducted using the developed MLP-based lipreading system, trained by Scikit-learn machine learning library. The average recogni‐ tion accuracy of the system trained on the HAVRUS database reaches 40.62%. The results of this work will be used in the future studies to improve audio-visual baseline for continuous Russian speech recognition [26, 33]. The fusion of different types of visual features is also of great practical interest for the future research.
Designing Advanced Geometric Features
253
Acknowledgments. This research is financially supported by the Ministry of Education and Science of the Russian Federation, agreement No. 14.616.21.0095 (reference RFMEFI616 18X0095) and by the Ministry of Education of the Czech Republic, project No. LTARF18017.
References 1. Yu, D., Deng, L.: Automatic Speech Recognition. SCT. Springer, London (2015). https:// doi.org/10.1007/978-1-4471-5779-3 2. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976) 3. Potamianos, G., Neti, C., Matthews, I.: Audio-visual automatic speech recognition: an overview. Issues Audio Vis. Speech Process. 22, 23 (2004) 4. Zhou, Z., Zhao, G., Hong, X., Pietikainen, M.: A review of recent advances in visual speech decoding. Image Vis. Comput. 32, 590–605 (2014) 5. Bowden, R., et al.: Recent developments in automated lip-reading. In: Proceedings of SPIE, Optics and Photonics for Counterterrorism, Crime Fighting and Defence IX, vol. 8901, p. 13 (2013) 6. Katsaggelos, K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015) 7. Seong, T.W., Ibrahim, M.Z.: A review of audio-visual speech recognition. J. Telecommun. Electron. Comput. Eng. 10(1–4), 35–40 (2018) 8. Lee, B., et al.: AVICAR: audio-visual speech corpus in a car environment. In: Proceedings of Interspeech 2004, pp. 380–383 (2004) 9. Cox, S., Harvey, R., Lan, Y., Newmann, J., Theobald, B.: The challenge of multispeaker lipreading. In: Proceedings of the International Conference Auditory-Visual Speech Process (AVSP), pp. 179–184 (2008) 10. Patterson, E., Gurbuz, E., Tufekci, Z., Gowdy, J.: CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings of the IEEE ICASSP 2002, vol. 2, pp. 2017–2020 (2002) 11. Hazen, T., Saenko, K., La, C., Glass, J.: A segment-base audio-visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of the International Conference Multimodal Interfaces, pp. 235–242 (2004) 12. Lucey, P., Potaminanos, G., Sridharan, S.: Patch-based analysis of visual speech from multiple views. In: Proceedings of AVSP 2008, pp. 69–74 (2008) 13. Abhishek, N., Prasanta, K.G.: PRAV: a phonetically rich audio visual corpus. In: Proceedings of Interspeech 2017, pp. 3747–3751 (2017) 14. Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Železný, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 338–345. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_40 15. Newman, J., Cox, S.: Language identification using visual features. Proc. IEEE Audio Speech Lang. Process. 20(7), 1936–1947 (2012) 16. Lan, Y., Theobald, B., Harvey, R.: View independent computer lip-reading. In: Proceedings of the International Conference Multimedia Expo (ICME), pp. 432–437 (2012) 17. Zhou, Z., Hong, X., Zhao, G., Pietikainen, M.: A compact representation of visual speech data using latent variables. Proc. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 181–187 (2014) 18. Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. Proc. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)
254
D. Ivanko et al.
19. Estellers, V., Thiran, J.: Multi-pose lipreading and audio-visual speech recognition. EURALISP J. Adv. Signal Process. 51 (2012) 20. Hong, X., Yao, H., Wan, Y., Chen, R.: A PCA Based visual DCT feature extraction method for lip-reading. In: Proceedings of the International Conference Intelligent Information Hiding Multimedia, Signal Process, pp. 321–326 (2006) 21. Cetingul, H., Yemez, Y., Erzin, E., Tekalp, A.: Discriminative analysis of lip motion features for speaker identification and speech-reading. Proc. IEEE Trans. Image Process. 15(10), 2879–2891 (2006) 22. Yoshinaga, T., Tamura, S., Iwano, K., Furui, S.: Audio-visual speech recognition using lipmovement extracted from side-face images. In: Proceedings of the International Conference Auditory-Visual Speech Processing (AVSP), pp. 117–120 (2003) 23. Lan, Y., Theobald, B., Harvey, R., Ong, E., Bowden, R.: Improving visual features for lipreading. In: Proceedings of the International Conference Auditory Visual Speech Processing (AVSP), pp. 142–147 (2010) 24. Radha, N., Shahina, A., Khan, A.: An improved visual speech recognition of isolated words using combined pixel and geometric features. Proc. J. Sci. Technol. 9(44), 7 (2016) 25. Rahmani, M.H., Alamsganj, F.: Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In: 3D International Conference on Pattern Recognition and Image Analysis, pp. 195–199 (2017) 26. Ivanko, D., et al.: Using a high-speed video camera for robust audio-visual speech recognition in acoustically noisy conditions. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 757–766. Springer, Cham (2017). https://doi.org/ 10.1007/978-3-319-66429-3_76 27. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-54184-6_6 28. Implementation of Computer Vision Library. https://github.com/davisking/dlib. Accessed 30 Apr 2018 29. Baltrusaitis, T., Deravi, F., Morency, L.: 3D constrained local model for rigid and non-rigid facial tracking. In: Computer Vision and Pattern Recognition (CVPR), pp. 2610–2617 (2012) 30. Howell, D., Cox, S., Theobald, B.: Visual units and confusion modelling for automatic lipreading. Image Vis. Comput. 51, 1–12 (2016) 31. Description of Euclidean Distance Calculation. https://en.wikipedia.org/wiki/Euclidean_distance. Accessed 30 Apr 2018 32. Machine Learning Toolkit. http://scikit-learn.org/stable/. Accessed 30 Apr 2018 33. Ivanko, D., et al.: Multimodal speech recognition: increasing accuracy using high-speed video data. J. Multimodal User Interfaces (JMUI) (2018, in press)
On the Comparison of Different Phrase Boundary Detection Approaches Trained on Czech TTS Speech Corpora Mark´eta J˚ uzov´ a(B) Department of Cybernetics and New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic [email protected]
Abstract. The phrasing is a very important issue in the process of speech synthesis since it ensures higher naturalness and intelligibility of synthesized sentences. There are many different approaches to phrase boundary detection, including simple classification-based, HMM-based, CRF-based approaches, however, different types of neural networks are used for this task as well. The paper compares representative methods for phrasing of Czech sentences using large-scale TTS speech corpora as training data, taking only speaker-dependent phrasing issue into consideration. Keywords: Phrase boundary · Speech corpus Conditional random fields · Neural networks
1
· Classification
Introduction
The natural sentence splitting into smaller parts audibly separated, usually by a pause, during speech is called “phrasing” [22]. The main reason of phrasing is definitely the better intelligibility of the passed message in speech. However, one of the other reasons why people divide a sentence into phrases lies in the need for taking a breath. Thus, a speech without any pauses sounds unnatural and robotic. In spite of the fact TTS systems do not need to breathe, it is common to deal with the phrasing issue as a part of text normalization sub-system. In general, it is not a simple task – the position of phrase breaks is not clearly defined in the speech, the pauses highly depend on the particular speaker, the speech speed and the particular situation or the purpose of the speech [24]. The phrase boundary detection task could be defined as a sequence-tosequence problem [32]: A list of words (or tokens) w0 , w1 , . . . , wn should be assigned by a list of juncture types j0 , j1 , . . . , jn where ji = 1 if a phrase break follows a word wi and ji = 0 otherwise. And there have been many different approaches to this natural language processing (NLP) task, usually reported on English. Besides deterministic approaches based on punctuation marks or function/content words [33], there are many classification based approaches c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 255–263, 2018. https://doi.org/10.1007/978-3-319-99579-3_27
256
M. J˚ uzov´ a
[5,9,21,25] using different features. However, the main disadvantage of these algorithm is the decision-making about each juncture type separately. The authors of [33] firstly used HMM model for phrase breaks prediction, a similar approach was also presented e.g. in [31]. Other techniques used are conditional random fields (CRF) [3,10,13] and all kinds of neural networks, e.g. [2,30]. The reason for testing neural networks (and CRF and HMM) for the purpose of phrasing lies in the basics of the phrase break detection task – it is a sequenceto-sequence modelling so it seems to be reasonable to use a sequence modelling framework, as these methods may be more suited for that, compared to the “classical” classification-based approaches. Phrases, in general, contain usually only several words (see Fig. 1), the phrase boundaries are inserted in specified intervals and they depend on the other breaks in the sentence as well – as described e.g. in [32,33].
Fig. 1. A histogram of phrase and sentence lengths in the Czech speech corpus [18] (marked as corpus1 in Sect. 3).
In the TTS system ARTIC [17,35], developed on the author’s department, the appropriate phrasing of recorded and input text sentences emerged to be very important. That is why the phrase boundaries in the recorded sentences are thoroughly being detected by the automatic segmentation process [14] and they have been still corrected [4] to ensure the most possible accurate description of the speech data. On the other hand, the input text sentences (those to be synthesized) are still split to phrases by a simple algorithm based on commas. Note that in Czech texts, the commas are much more frequent compared e.g. to English texts, so it is a good indicator for the phrase boundary detection. However, phrasing based only on commas can produce extremely long phrases, for example in the case of a long compound sentence containing several simple
On the Comparison of Different Phrase Boundary Detection Approaches
257
sentences joined with a coordinate conjunction (e.g. a, EN: and ) where no comma is written in Czech. Afterwards, the created text phrases are synthesized using the formal prosody grammar features (so called “prosodemes”, see [8,26,27,34] for more details) to ensure (in unit selection) the selection of appropriate unit candidates using Viterby search [36] – the symbolic prosody feature prosodemes ensures a correct behaviour of the F0 contour at phrase-final words for keeping the required communication function; the prosodeme agreement is one of the component of the target cost in TTS ARTIC [15]. There have not been many other approaches to the phrasing of the text sentences for TTS ARTIC (except punctuation-based), e.g. [29]. However, recently, different classification-based approaches were compared in [7] and also CRF based boundary detector was trained [6] which showed to be the best option among the tested methods. Anyhow, in the last decade, the neural networks (NN) have been used more and more often for various NLP tasks, so it was decided to try new approaches to phrase boundary detection in Czech sentences. And the overall comparison is the main scope of the presented paper.
2
Data Acquisition
The process of data gaining for the phrase boundary detection could be a demanding task – the annotator agreement (both from text and speech data) is quite low, as shown in [6,28]. And, as proved e.g. in [6,24] and mentioned in Sect. 1, the phrase boundary detection is a speaker-dependent task. For these reasons, the author decided to use data from speech corpora recorded by professional speakers for the purposes of Czech TTS ARTIC [17,35]. All the recorded sentences had been manually checked by human annotators and also automatically corrected [19] (to reveal lapsus or swapped short words – the most often speakers’ errors) and than the automatic pitch-synchronous segmentation process [11,12,14,16] was performed. The resulted segmented speech corpora contain the information about positions of pauses and breaths in the speech and these information (together with commas, see Sect. 1) are used for the presented experiment. As all the speakers were professionals, the speech breaks are expected to occur in reasonable places in the read sentences. The “true” phrase boundary is set (ji = 1) after every word wi which – is followed by a comma in the text sentence or – is followed by a pause or a breath in the spoken sentence. Note that it was decided to use both commas and speech pauses/breaths to be phrase breaks in the presented study, contrary e.g. to [23] considering only speech pauses to be phrase boundaries, since commas are, especially in Czech language, good indicators for phrase breaks in speech. However, the breaths and speech pauses in the corpora (not associated with any comma) represent a “value added” and, hopefully, might ensure more accurate phrasing.
258
2.1
M. J˚ uzov´ a
Features
The compared approaches (except LSTM emb and Bi emb; see Sect. 3) use the following set of features for a word wi , inspired e.g. in [5,31], used also in the previous studies [6,7]: word wi , word wi has or has not a comma, following word wi+1 , morphological tag ti of the word wi , morphological tag ti+1 of the word wi+1 , bigram ti + ti+1 , trigram ti−1 + ti + ti+1 , sentence length N , position of the word wi in the sentence Ni , distance from the preceding word followed by a comma i − iLC (iLC ≤ i; iLC = 0 if none of words w0 . . . wi−1 has a comma), – distance to the next word followed by a comma iN C − i (iN C ≥ i; iN C = N − 1 if none of words wi+1 . . . wN −1 has a comma).
– – – – – – – – – –
Besides these features listed above, some of the presented methods use only word embeddings as it has shown up that they are powerful input representations in many NLP tasks [1,37], including the phrasing; these high-dimensional vectors, as shown e.g. in [20], are able to capture general syntactic and semantic properties of words, as well as their relations – so it is tested whether they are able to substitute the list of features.
3
Compared Approaches
The score of this paper is to compare different phrase boundary detection approaches. The author choose representatives among simple deterministic phrasing methods, classification-based approaches, and different types of neural networks. The whole list of the methods used is listed below: – Comma – a simple approach which splits the given sentence after every comma (currently used in TTS ARTIC ) – LogReg – Logistic Regression classifier1 – SVC – Support Vector Machines with linear kernel1 – CRF – Conditional Random Fields – MLP – Multi-layer Perception (MLP) with 30-dimensional input layer and 100-dimensional hidden layer (all layers are fully connected) with dropout set to 0.2; trained for 100 epochs 1
Note that no cross-validation results for classifier’s parameters are presented in this paper since they were a part of the previous study in [7], and the parameters were set according to the best results shown in the aforementioned paper.
On the Comparison of Different Phrase Boundary Detection Approaches
259
– LSTM – a neural network with two long short-term memory (LSTM) layers (each with 200 units) and the output fully connected layer and dropout set to 0.2; trained for 100 epochs – LSTM emb – equal to LSTM, but with input embedding layer with 200 units – Bi emb – bidirectional neural network (with 100 LSTM units in each layer) with input embedding layer (200 units), output fully connected layer and dropout set to 0.1 Note that the padding was performed on all sentences for all above mentioned approaches for fair comparison as some NN approaches require to have input data of the same length. The last two models do not use the set of features listed in Sect. 2.1, only word embeddings. Table 1. Comparison of speaker-dependent phrase boundary detection for 3 Czech speech corpora (the results of simple classification-based approaches could slightly differ from the results presented in [6] due to padding of corpus sentences to the same length). Corpus
Classifier
tp
fn
fp
tn
A
R
P
F1
corpus1-male
Comma LogReg SVC-lin CRF MLP LSTM LSTM emb Bi emb
2407 2772 2784 2857 2521 2344 2337 2365
781 416 404 331 667 844 851 823
0 116 134 170 95 269 284 246
75472 75356 75338 75302 75377 75203 75188 75226
0.990 0.993 0.993 0.994 0.990 0.986 0.986 0.986
0.755 0.870 0.873 0.896 0.791 0.735 0.733 0.742
1.000 0.960 0.954 0.944 0.964 0.897 0.892 0.906
0.860 0.912 0.912 0.919 0.869 0.808 0.805 0.816
corpus2-male
Comma LogReg SVC-lin CRF MLP LSTM LSTM emb Bi emb
2319 2335 2336 2343 2361 2189 2177 2305
109 0 62534 0.998 93 1 62533 0.999 92 1 62533 0.999 85 3 62531 0.999 67 0 62534 0.999 239 152 62382 0.994 251 167 62367 0.994 123 85 62449 0.997
0.955 0.962 0.962 0.965 0.972 0.902 0.897 0.949
1.000 1.000 1.000 0.999 1.000 0.935 0.929 0.964
0.977 0.980 0.980 0.982 0.986 0.918 0.912 0.957
corpus3-female Comma LogReg SVC-lin CRF MLP LSTM LSTM emb Bi emb
2377 2397 2396 2407 2387 2231 1832 2237
155 0 65403 0.998 135 8 65395 0.998 136 9 65394 0.998 123 16 65387 0.998 145 1 65403 0.998 301 133 65270 0.994 700 217 65186 0.987 295 97 65306 0.994
0.939 0.947 0.946 0.951 0.943 0.881 0.724 0.883
1.000 0.997 0.996 0.993 1.000 0.944 0.894 0.958
0.968 0.971 0.971 0.972 0.970 0.911 0.800 0.919
260
4
M. J˚ uzov´ a
Results
The overall comparison of the tested approaches is shown in Table 1 using 4 standard evaluation measures – accuracy (A), recall (R), precision (P ) and F1-score (F 1), and the numbers of true positives (tp), true negatives (tn), false positives (fp) and false negatives (fn). All the results are calculated at word level. The results show that the CRF -based approach outperformed the other methods for two of the tested corpora, even those using different neural networks. For the corpus2-male data, the MLP approach provided the best results. In the author’s opinion, that is because of a slightly different nature of this corpus – the male speaker made almost all pauses/breaths at the comma punctuation, compared to the others. In any case, the results are quite surprising as many studies on phrasing of English sentences have indicated that the LSTM or bidirectional networks (sometimes using embeddings instead of a set of features) achieved better results compared to other approaches, e.g. [30]. The highest results for our Czech data among NN-based approaches were achieved by MLP, it was also proved that the bidirectional NN are more powerful compared to LSTM. It is also obvious that the word embeddings are not able to fully substitute the set of features. The problem might be the number of vector dimensions – the Czech language is much more complex compared to English so, maybe, smaller word embeddings can not cover the whole semantic and syntactic properties of words (to compare, only several tens of part-of-speech tags are defined in English but about 3000 in Czech).
5
Conclusion
The testing and comparison of different speaker-dependent phrase boundary detection approaches on Czech speech corpora showed that, in general, the CRF model is able to outperform the others. However, the massive usage of neural networks forces the author to test more NN approaches for the Czech phrasing task. Some results are promising which indicates that more experiments (with different settings) should be performed to find out the optimal solution. As a future work, it is also planned to apply the presented methods on other largescale corpora build for the purposes of TTS ARTIC – both Czech and English. Acknowledgments. The work has been supported by the grant of the University of West Bohemia, project No. SGS-2016-039, and by Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
On the Comparison of Different Phrase Boundary Detection Approaches
261
References 1. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 999888, 2493–2537 (2011) 2. Fernandez, R., Rendel, A., Ramabhadran, B., Hoory, R.: Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: Proceedings of Interspeech 2014, pp. 2268–2272. ISCA, September 2014 3. Gregory, M.L.: Using conditional random fields to predict pitch accents in conversational speech. In: Proceedings of ACL 2004. ACL, East Stroudsburg, pp. 677–684 (2004) 4. Hanzl´ıˇcek, Z.: Correction of prosodic phrases in large speech corpora. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 408–417. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-455105 47 5. Hirschberg, J., Prieto, P.: Training intonational phrasing rules automatically for English and Spanish text-to-speech. Speech Commun. 18(3), 281–290 (1996) 6. J˚ uzov´ a, M.: CRF-based phrase boundary detection trained on large-scale TTS speech corpora. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 272–281. Springer, Cham (2017). https://doi.org/ 10.1007/978-3-319-66429-3 26 7. J˚ uzov´ a, M.: Prosodic phrase boundary classification based on Czech speech corpora. In: Ekˇstein, K., Matouˇsek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 165–173. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-642062 19 8. J˚ uzov´ a, M., Tihelka, D., Vol´ın, J.: On the extension of the formal prosody model for TTS. In: Text, Speech and Dialogue. Lecture Notes in Computer Science, Springer, Heidelberg (2018) 9. Koehn, P., Abney, S., Hirschberg, J., Collins, M.: Improving intonational phrasing with syntactic information. In: Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 1289–1290 (2000) 10. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001) 11. Leg´ at, M., Matouˇsek, J., Tihelka, D.: A robust multi-phase pitch-mark detection algorithm. Proc. Interspeech 2007, 1641–1644 (2007) 12. Leg´ at, M., Matouˇsek, J., Tihelka, D.: On the detection of pitch marks using a robust multi-phase algorithm. Speech Commun. 53(4), 552–566 (2011) 13. Louw, A., Moodley, A.: Speaker specific phrase break modeling with conditional random fields for text-to-speech. In: Proceedings of 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), pp. 1–6 (2016) 14. Matouˇsek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: Proceedings of Interspeech 2008, pp. 1626–1629. ISCA (2008) 15. Matouˇsek, J., Leg´ at, M.: Is unit selection aware of audible artifacts? SSW 2013. In: Proceedings of the 8th Speech Synthesis Workshop, pp. 267–271. ISCA, Barcelona (2013) 16. Matouˇsek, J., Tihelka, D.: Classification-based detection of glottal closure instants from speech signals. In: Proceedings of Interspeech 2017, pp. 3053–3057. ISCA (2017)
262
M. J˚ uzov´ a
17. Matouˇsek, J., Tihelka, D., Romportl, J.: Current state of Czech text-to-speech system ARTIC. In: Sojka, P., Kopeˇcek, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006). https://doi.org/10. 1007/11846406 55 18. Matouˇsek, J., Romportl, J.: Recording and annotation of speech corpus for Czech unit selection speech synthesis. In: Matouˇsek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 326–333. Springer, Heidelberg (2007). https://doi. org/10.1007/978-3-540-74628-7 43 19. Matouˇsek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of Interspeech 2013, pp. 1511–1515. ISCA (2013) 20. Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of 2013 NAACL HLT, pp. 746–751 (2013) 21. Mishra, T., Jun Kim, Y., Bangalore, S.: Intonational phrase break prediction for text-to-speech synthesis using dependency relations. In: Proceedings of ICASSP 2015, pp. 4919–4923 (2015) ˇ 22. Palkov´ a, Z.: Rytmick´ a v´ ystavba prozaick´eho textu. Studia CSAV; ˇcis. 13/1974, Academia (1974) 23. Parlikar, A., Black, A.W.: Data-driven phrasing for speech synthesis in low-resource languages. Proc. ICASSP 2012, 4013–4016 (2012) 24. Prahallad, K., Raghavendra, E.V., Black, A.W.: Learning speaker-specific phrase breaks for text-to-speech systems. In: The 7th ISCA Tutorial and Research Workshop on Speech Synthesis, pp. 162–166 (2010) 25. Read, I., Cox, S.: Stochastic and syntactic techniques for predicting phrase breaks. Comput. Speech Lang. 21(3), 519–542 (2007) 26. Romportl, J., Matouˇsek, J.: Formal prosodic structures and their application in NLP. In: Matouˇsek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10.1007/ 11551874 48 27. Romportl, J.: Structural data-driven prosody model for TTS synthesis. In: Proceedings of Speech Prosody 2006, pp. 549–552. TUDpress, Dresden (2006) 28. Romportl, J.: Automatic prosodic phrase annotation in a corpus for speech synthesis. In: Proceedings of Speech Prosody 2010. University of Illionois, Chicago (2010) 29. Romportl, J., Matouˇsek, J.: Several aspects of machine-driven phrasing in text-tospeech systems. Prague Bull. Math. Linguist. 95, 51–61 (2011) 30. Rosenberg, A., Fernandez, R., Ramabhadran, B.: Modeling phrasing and prominence using deep recurrent learning. In: Proceedings of Interspeech 2015, pp. 3066– 3070. ISCA (2015) 31. Sun, X., Applebaum, T.H.: Intonational phrase break prediction using decision tree and n-gram model. Proc. Eurospeech 2001, 3–7 (2001) 32. Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009) 33. Taylor, P., Black, A.W.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12(2), 99–117 (1998) 34. Tihelka, D.: Symbolic prosody driven unit selection for highly natural synthetic speech. In: Proceedings of Interspeech 2005 - Eurospeech, pp. 2525–2528. ISCA (2005) 35. Tihelka, D., Hanzl´ıˇcek, Z., J˚ uzov´ a, M., V´ıt, J., Matouˇsek, J., Gr˚ uber, M.: Current state of text-to-speech system ARTIC: A decade of research on the field of speech technologies. In: Text, Speech and Dialogue. Lecture Notes in Computer Science, Springer, Heidelberg (2018)
On the Comparison of Different Phrase Boundary Detection Approaches
263
36. Tihelka, D., Kala, J., Matouˇsek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: Proceedings of Interspeech 2010, pp. 174–177. ISCA (2010) 37. Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semisupervised learning. In: Proceedings of ACL 2010, pp. 384–394. ACL (2010)
Word-Initial Consonant Lengthening in Stressed and Unstressed Syllables in Russian Tatiana Kachkovskaia(B) and Mayya Nurislamova Saint Petersburg State University, St. Petersburg, Russia [email protected], [email protected]
Abstract. This paper deals with consonant lengthening effects caused by word-initial position in interaction with stress-induced lengthening. Experiment 1, based on a 30-h speech corpus, showed that in general word-initial lengthening is more pronounced in stressed syllables than in unstressed. The lengthening effect is also stronger for consonants in CV syllables compared with CCV syllables. Additionally, it was shown that consonant duration serves to signal word stress, and the reduction pattern for consonants is similar to that for vowels. Experiment 2, based on controlled laboratory data, showed that not all the speakers choose the strategy of signaling word boundaries and word stress with consonant lengthening; presumably, it depends on the speaking style. It was also shown that in CCV syllables the first consonant might be responsible for signaling word boundary, while the second–lexical stress. Keywords: Consonant duration · Word-initial lengthening Word stress · Prosodic boundaries · Russian
1
Introduction
Recent studies for various languages show that phrase boundaries are marked at both ends–initially and finally. In terms of tone, the beginning of an intonational phrase is often signalled by declination reset, while the end of an IP might be marked by a specific boundary tone, or contain a complex or wide melodic movement–in cases when the nucleus occurs phrase-finally. Similar effects are observed for duration, intensity and spectral characteristics, although the latter two are considered weaker cues [1–3]. Lengthening at ends of utterances and intonational phrases–final lengthening–is considered a universal phenomenon [4]. For Russian, however, it is known that the lengthening effect is highly dependent on whether the phrase is followed by a pause or not [5]. The phenomenon of lengthening at the other end–the beginning of a phrase–is called “initial strengthening” [8,9]. For IPs and utterances it is not yet considered universal, as very few languages have been analysed so far. For Russian it was documented in [6] for utterance-initial vowels. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 264–273, 2018. https://doi.org/10.1007/978-3-319-99579-3_28
Word-Initial Consonant Lengthening in Russian
265
Similar effects for word boundaries have been found in a few languages as well [7]. The aim of this study was to find out whether this it also observed in Russian for word-initial consonants. When dealing with segmental duration in Russian, one should bear in mind the phenomenon of vowel reduction, as it has great impact on vowel duration. We know of two factors that influence vowel reduction in Russian: position of the syllable relative to the stressed syllable (1st pretonic vs. other pretonic and all post-tonic), and position relative to word boundary (absolute-initial vs. nonabsolute-initial vowels)1 [10,11]. Although we do not have much data on consonant duration in this respect, we might hypothesize that durational changes in vowels may be accompanied by consonant duration changes as well. In most papers on the temporal organization of speech much attention is paid to vowel duration. However, in terms of articulatory mechanisms, consonant lengthening is not at all impossible. Moreover, there is evidence for consonant lengthening in previous studies. For Russian this evidence includes (a) cases of contrastive stress [12] and (b) phrase-final lengthening [13]. Therefore, our hypothesis that in word-initial position we might expect some consonant lengthening is based on the following observations: 1. evidence from other languages; 2. evidence for utterance-initial vowel lengthening in Russian; 3. evidence for consonant lengthening in other prosodically strong contexts. In order to test this hypothesis, we have taken two steps. First we performed a corpus-based experiment. Then, based on those results, we recorded laboratory material–a set of phrases designed to eliminate other prosodic factors that are beyond this study but may influence consonant duration. At this stage we confined ourselves to syllables /sa/, /ta/ and /sta/ as the most frequent syllables with a fricative and a plosive, and the respective consonant cluster. The decision to include the syllable /sta/ was also motivated by our interest in the phenomenon of consonant compression in longer syllables. This syllable is easier to interpret on purely acoustic basis since it does not contain an articulatory overlap, as opposed to syllables such as /pl/, /dr/ etc.
2
Experiment 1
2.1
Method
At this stage we used a 30-h segmented speech corpus, CORPRES [14]. The corpus contains fictional texts recorded from 8 speakers, all native Russians with standard pronunciation. The recordings are manually segmented into sounds and annotated prosodically. In order to obtain a general impression of the processes governing consonant duration, we have chosen all 3-syllable words2 which were produced in 1 2
The case of word-final vowels is more complicated–for a detailed discussion see [11]. Three syllables is the minimal word duration which makes it possible to compare pretonic syllables in word-initial and word-medial position.
266
T. Kachkovskaia and M. Nurislamova
prosodically neutral context–i.e. not under nuclear stress and not in the initial or final position within the intonational phrase. This way we eliminated the influence of other processes: lengthening caused by prosodic prominence and IP-boundary effects, which are beyond the present study. Absolute duration values are hard to compare across speakers and across different consonant types. This is why we calculated consonant duration in normalized form using the following formula [15]: ˜ = d(i) − μp d(i) σp ˜ is the normalized duration of segment i, d(i) is its absolute duration, where d(i) and μp and σp are the mean and standard deviation of the duration of the corresponding phone p. The mean and standard deviation values were calculated over the whole corpus for each speaker separately. 2.2
Results and Discussion
Figure 1 shows normalized duration values for consonants in CV syllables in four possible conditions: 1. Stressed syllable (a) in word-initial position (e.g. /p/ from /pa/ in (b) in word-medial position (e.g. /d/ from /da/ in 2. First pretonic syllable (a) in word-initial position (e.g. /p/ from /pa/ in (b) in word-medial position (e.g. /k/ from /ka/ in
(apiary)) (task)) (bucket hat)) (imperative show))
Fig. 1. Normalized consonant duration in CV syllables in 3-syllable words; data are averaged across the whole corpus.
Word-Initial Consonant Lengthening in Russian
267
−2
normalized duration −1 0 1 2
First consonant in CCV
initial stressed
medial stressed
initial pretonic
medial pretonic
Fig. 2. Normalized duration of the first consonant in CCV and CCV syllables in 3-syllable words in four contexts; data are averaged across the whole corpus.
For stressed syllables (see left pane in Fig. 1) the average normalized duration values were 0.544 for word-initial syllables and 0.024 for word-medial syllables. The difference was statistically significant (Welch’s t-test, p < 0.001), and the sample sizes were 1156 and 2568, respectively. For pretonic syllables (see right pane in Fig. 1) the average normalized duration values were −0.189 for word-initial syllables and −0.326 for word-medial syllables. The difference was statistically significant as well (Welch’s t-test, pD>N C>D>N C>N>D C>D>N C>D>N
Note: C – comfort, N-neural, D – discomfort.
Table 2. Confusion matrices for emotion recognition by listeners with ASD and MR informants breakdown. Listeners
Native
Foreign
Emotional state
Comfort Neutral Discomfort Comfort Neutral Discomfort
Informants ASD Comf Neut 20 39 37 37 7 40 23 13 35 26 12 18
Disc 41 26 53 64 39 70
MR Comf 37,5 20 8 43 20 15
Neut 38,5 56 41 31 56 45
Disc 24 24 51 26 24 40
The second task for listeners was to determine the six basic emotional states “fear – anger – sadness – natural – joy – surprise” and difficult to answer, when listening to the speech test sequence (Fig. 6). The listeners attributed the greatest number of speech samples of ASD informants to the state sadness, while the speech of the informants with MR to the neutral state. They identified the fear state and surprise equally in speech of ASD and MR informants. Neutral state of MR informants recognized by listeners better vs. other states F(2.39) = 3.3487 p < 0.0455 (Wilks’ Lambda - 0.85344). In the task of determining the age of the informants, according to speech samples the listeners correctly estimated 42% of ASD and 47% of MR informants who are in the age range of 17–28 years. Further, some listeners incorrectly estimated informants’ age as lower than 16 years (for ASD – 6%, for MR – 18%), or higher than 35 years.
364
E. Lyakso et al.
Fig. 6. Percentage of listener’s answer to attributed speech samples of ASD (black color) and MR (gray color) informants to emotional six states “fear – anger – sadness – neutral – joy – surprise” and difficult to answer.
The most frequent words in the speech of all informants were words that reflected a positive emotional state: /love – 0.43/ in the speech of ASD informants, 0.21 in the speech of MR informants; /like – 0.8/ in the speech of MR informants. Words reflecting the negative emotional state were absent in the situation of dialogue in informants with ASD, in MR informants were represented by the words: /bad – 0.08/, /unfortunately – 0.08/, /unpleasantly – 0.04/, /fearfully – 0.04/. The most frequent words reflected emotional state in the picture description situation were: /kind – 0.27/, /kind-hearted – 0.27/ in informants with ASD and /fight – 0.22/, /regret - 0.17/, /angry - 0.08/, /love 0.08/ in MR informants speech.
4
Discussion
The results of the study showed a worse level of speech formation in adults with ASD, in comparison with MR ones. In studying the speech features of ASD children we showed their ability to clearly pronounce vowels in words [11] and increase the clarity of articulation in learning with the age of children [5]. The listeners recognized the emotional state of informants with ASD worse than emotional state of children on the base of listening speech samples [14]. It is found that individuals with severe or profound intellectual disabilities may exhibit more subtle facial expressions of internal states, which are poorly interpreted by adults if they do not have experience of caring for, or communicating with this population [15]. The adults with intellectual disabilities may be vulnerable to deficiencies in the awareness and understanding of their emotional experience, problems with adequate relaying this information to others [16]. Studies aimed at identifying the relationship between the state of persons at an early age and in transition to adulthood began to be investigated [17, 18]. In special study about young people with intellectual disability transitioning to adulthood it was found that people with Down syndrome experience less behavioral problems than people with intellectual disability of another cause across all subscales of emotional and behavioral problems,
Speech Features of Adults with Autism Spectrum Disorders and Mental Retardation
365
except for communication disturbance [19]. The transition to adulthood is of greatest concern to the parents of children with autism and the least concern for parents of chil‐ dren with Down syndrome [17]. Our study is the first step towards investigating the problem of the transition from childhood to adulthood for Russian people with atypical development. The findings from this study provide valuable information for health and other professionals working with people with intellectual disability.
5
Conclusions
We revealed the specificity of speech features in informants with ASD and MR. For ASD informants replicas in dialogues are simple, complex replicas are absent, the “yesno” answers are predominant, they do not use gestures as substitution or complemen‐ tation of verbal answers, their replicas are less adequate vs. MR informants. More phonetic disturbances at the level of the word and the phrases were described for ASD informants vs. MR ones. Articulation of unstressed vowels of ASD informants is clearer than articulation of the stressed vowels that cause difficulties in the speech samples meaning recognition. Attribution of the emotional speech to states “comfort – neutral – discomfort” of the ASD informants is difficult for listeners. Acknowledgements. This study is financially supported by the Russian Science Foundation (project 18-18-00063).
References 1. Klein Tasman, B.P., van der Fluit, F., Mervis, C.B.: Autism spectrum symptomatology in children with Williams syndrome who have phrase speech or fluent language. J. Autism Dev. Disord. (2018). https://doi.org/10.1007/s10803-018-3555-4 2. Chen, X., et al.: Speech and language delay in a patient with WDR4 mutations. Eur. J. Med. Genet. (2018). https://doi.org/10.1016/j.ejmg.2018.03.007 3. Walton, K.M., Ingersoll, B.R.: The influence of maternal language responsiveness on the expressive speech production of children with autism spectrum disorders: a microanalysis of mother-child play interactions. Autism 19(4), 421–432 (2005) 4. Fusaroli, R., Lambrechts, A., Bang, D., Bowler, D.M., Gaigg, S.B.: Is voice a marker for autism spectrum disorder? A systematic review and meta-analysis. Autism Res. 10(3), 384– 407 (2017). https://doi.org/10.1002/aur.1678 5. Lyakso, E., Frolova, O., Grigorev, A.: Perception and acoustic features of speech of children with autism spectrum disorders. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 602–612. Springer, Cham (2017). https://doi.org/ 10.1007/978-3-319-66429-3_60 6. Saad, A.G., Goldfeld, M.: Echolalia in the language development of autistic individuals: a bibliographical review. Pro. Fono. 21(3), 255–260 (2009) 7. Lee, L., Rianto, J., Raykar, V., Creasey, H., Waite, L., Berry, A., Xu, J., Chenoweth, B., Kavanagh, S., Naganathan, V.: Health and functional status of adults with intellectual disability referred to the specialist health care setting: a five-year experience. Int. J. Fam. Med., Article ID 312492, 9 (2011). https://doi.org/10.1155/2011/312492
366
E. Lyakso et al.
8. Taylor, J.L., Mailick, M.R.: A longitudinal examination of 10-year change in vocational and educational activities for adults with autism spectrum disorders. Dev. Psychol. 50(3), 699– 708 (2014). https://doi.org/10.1037/a0034297. pmid:24001150 9. Autism Spectrum Australia. We Belong: Investigating the experiences, aspirations and needs of adults with Asperger’s disorder and high functioing autism (2012) 10. Lyakso, E., et al.: EmoChildRu: emotional child Russian speech corpus. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 144–152. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_18 11. Lyakso, E., Frolova, O., Grigorev, A.: A comparison of acoustic features of speech of typically developing children and children with autism spectrum disorders. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 43–50. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_4 12. Lyakso, E.E., Grigor’ev, A.S.: Dynamics of the duration and frequency characteristics of vowels during the first seven years of life in children. Neurosci. Behav. Physiol. 45(5), 558– 567 (2015). https://doi.org/10.1007/s11055-015-0110-z 13. Roy, N., Nissen, S.L., Dromey, C., Sapir, S.: Articulatory changes in muscle tension dysphonia: evidence of vowel space expansion following manual circumlaryngeal therapy. J. Commun. Disord. 42(2), 124–135 (2009). https://doi.org/10.1016/j.jcomdis.2008.10.001 14. Kaya, H., Salah, A.A., Karpov, A., Frolova, O., Grigorev, A., Lyakso, E.: Emotion, age, and gender classification in children’s speech by humans and machines. Comput. Speech Lang. 46, 268–283 (2017). https://doi.org/10.1016/j.csl.2017.06.002 15. Adams, D., Oliver, Ch.: The expression and assessment of emotions and internal states in individuals with severe or profound intellectual disabilities. Clin. Psychol. Rev. 31, 293–306 (2011). https://doi.org/10.1016/j.cpr.2011.01.003 16. McClure, K.S., Halpern, J., Wolper, P.A., Donahue, J.J.: Emotion regulation and intellectual disability. J. Dev. Disabil. 15(2), 38–44 (2009) 17. Blacher, J., Kraemer, B.R., Howell, E.J.: Family expectations and transition experiences for young adults with severe disabilities: does syndrome matter? Adv. Mental Health Learn. Disabil. 4(1), 3–16 (2010). https://doi.org/10.5042/amhld.2010.0052 18. Thompson, C., Bölte, S., Falkmer, T., Girdler, S.: To be understood: transitioning to adult life for people with autism spectrum disorder. PLoS ONE 13(3), e0194758 (2018). https:// doi.org/10.1371/journal.pone.0194758 19. Foley, K.-R., Taffe, J., Bourke, J., Einfeld, S.L., Tonge, B.J., Trollor, J., Leonard, H.: Young people with intellectual disability transitioning to adulthood: do behaviour trajectories differ in those with and without down syndrome? PLoS ONE 11(7), e0157667 (2016). https:// doi.org/10.1371/journal.pone.0157667
Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise Thomas Manzini(B) and Alan Black(B) Carnegie Mellon University, Pittsburgh, PA 15213, USA {tmanzini,awb}@cs.cmu.edu
Abstract. This paper explores how different synthetic speech systems can be understood in a noisy environment that resembles radio noise. This work is motivated by a need for intelligible speech in noisy environments such as emergency response and disaster notification. We discuss prior work done on listening tasks as well as speech in noise. We analyze three different speech synthesizers in three different noise settings. We measure quantitatively the intelligibility of each synthesizer in each noise setting based on human performance on a listening task. Finally, treating the synthesizer and its generated audio as a black box, we present how word level and sentence level input choices can lead to increased or decreased listener error rates for synthesized speech. Keywords: Speech
1
· Synthesized · Noise · Radio · Intelligibility
Introduction
Synthetic speech systems have undergone a great deal of research in the past years. Other research efforts have attempted to predict the intelligibility of different synthesizers in different settings [16,17]. However, to the author’s best knowledge, all work in this area has been done from the perspective of improving synthesized audio [9,11], rather than the synthesizer inputs themselves. This paper aims to determine if intelligibility can be predicted from the content fed to the synthesizer. In this work we explore how to predict if certain words and sentences will be understood by users and how these predictions can be used to formulate or reformulate a sentence for speech in a noisy environment. This is done by treating the synthesizer as a black box and measuring only the inputs and the outputs. Our work is specifically motivated by automated disaster response. Much work has been done using artificial intelligence to handle emergency and disaster situations [7,8]. The integration of speech is a necessary and natural expansion of this research. We foresee speech systems needing to operate in noisy environments where synthetic speech may be broadcast over a radio frequency or near rescue equipment. Both present multiple different issues regarding types of noise. In this work, we use the noisy environment of a radio channel as a test bed for intelligibility. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 367–376, 2018. https://doi.org/10.1007/978-3-319-99579-3_39
368
2
T. Manzini and A. Black
Related Work
Several works have explored multiple different types of noise and multiple different types of speech [16,17]. There are two relevant concepts within the field that are at play in this work: the field of speech in noise and the work surrounding listening tests. 2.1
Speech in Noise
The intelligibility of speech in noise is the measure of how well audio - containing either natural or synthetic speech - can be understood in a noisy environment. A noisy environment can range from the chatter of a restaurant [16] to the sounds of helicopters and the battlefield [18]. In all of these environments a listener may confuse or misinterpret speech because of noise. In past works, authors have shown several key concepts. First, when measuring the kinds of errors list listeners make, [12] has shown that while keyword error rate (KER) may be a more accurate measure, simple word error rate (WER) follows KER closely and is less time intensive to calculate. As such, we use WER for our measurements. At the same time, [19] has shown that there are instances where there are disparities between WER and other metrics, such as concept error rate. We see in our data that WER tends to follow concept error rate. 2.2
Listening Tests
Listening tests are a common way to evaluate the intelligibility of a voice [10,13,15]. Compared to automated methods and metrics, human evaluation is traditionally regarded as the most effective method for evaluation. As a result Listening tests have been used to evaluate synthesizers and intelligibility of both synthetic and natural speech in noise.
3
Experimentation
We explore the effect of three different types of noise on three different synthesizers. This is in an attempt to understand the how humans understand different synthesizers generally, as opposed to possibly overfitting to one synthesizer or one noise setting. 3.1
Structure
We asked English speaking listeners who over eighteen years old to transcribe audio from a series of thirty different audio files. These audio files were generated by selecting random sentences from the Smart-Home dataset [14] and having one of three different synthesizers generate an associated audio file then one of three different noise levels was applied to the audio. The result was captured and stored for the listening task. This was done thirty times for each listening task,
Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise
369
resulting in thirty unique audio files for each listener per task. While our research is motivated by emergency/disaster response use case, we selected this dataset because it has a demonstrably diverse vocabulary that would, in theory, lend itself well to determining the intelligibility of various words. During the initial stages of this research we did explore other datasets, including one of radio traffic, between emergency medical services (EMS) Personnel and their dispatch center, but found that it was not lexically diverse enough for the purposes of this research. Two different types of listening tasks were performed: one to generate training data and another for testing data. In the training task forty-five listeners listened to 450 different audio files. Each file was labeled by three different listeners. In the testing task a fifty listened to 150 different audio files. Each file was labeled by ten different listeners. These two different tasks were done so that the test data would be the most representative of the behavior of users in a noisy environment. 3.2
Synthesizers
We used three different synthesizers for our experimentation: the E-Speak Synthesizer [5], the Flite Synthesizer [1], and the Google Synthesizer [6]. All synthesizers used an English-speaking male voice, but these three synthesizers each have their own specific settings. E-Speak. We used the E-Speak Synthesizer with primarily the default settings. We specified two unique settings when generating our sound files: the use of the en voice which corresponds to an English speaking male and the use of a voice speed of level 120 (down from 175). This was done to better align the speeds of the voices of the different synthesizers. Flite. We used the CMU Flite (Festival Lite) synthesizer with the default settings. We specified that the synthesizer must use the cmu us eey.f litevox voice that came prepackaged with the standard release of Flite. Google. We used the Google Text to Speech system defined within the Python gTTS module. We specified that the synthesizer must use the en − us voice that came with the release of gTTS. All other settings were left at default values. 3.3
Noise
We used three different noise levels each consisting of three different filters applied at different values. First we impose an ambient noise filter designed to replicate radio static. For this filter, we take the original sound and at each time step sample a value from a random normal distribution centered at the original sound. The standard deviation of this normal distribution was varied at each noise level. Next we perform a low pass filter with a variable threshold.
370
T. Manzini and A. Black
Finally we perform and high pass filter with a variable threshold. Varying the parameters to these three different filters provide several knobs we can turn to increase or decrease the noise within the audio files. We make no claim about how well these different filters replicate the noise present on a radio channel, as that can vary based on the radio manufacturer, the frequency used, and the type of system in use. We only state that this noise is subjectively similar to that of an active radio channel. Further work would be required to determine the best noise filters needed to replicate each specific radio channel. We chose three different noise settings that would be presented to users. These noise settings were not intended to be ranked by difficulty, but were intended to represent three distinct kinds of noise that could cause a listener to make transcription errors. We believe that the reasons why certain noise settings are more likely to cause listening errors are out of scope of this work and could be the subject for further research. Noise Level 1. Random noise filter standard deviation: 0.3; Low pass frequency cutoff: 300.0 Hz; High pass frequency cutoff: 2500.0. Noise Level 2. Random noise filter standard deviation: 0.4; Low pass frequency cutoff: 400.0 Hz; High pass frequency cutoff: 2000.0. Noise Level 3. Random noise filter standard deviation: 0.5; Low pass frequency cutoff: 500.0 Hz; High pass frequency cutoff: 1500.0.
4
Listening Test Results
We presented users with different audio files and recorded their precision/word error rate. We include the complete breakdown of user performance below in Table 1. For all experiments we segment the data based on both synthesizer and noise level. We collected approximately fifty different sentences at each different noise level and synthesizer combination for training and approximately sixteen different sentences at each noise level for testing. Table 1. Precision (1.0 - WER) of word level transcription per noise and synthesizer on the training data. Transcription precision score Noise level 1 Noise level 2 Noise level 3 Average Espeak Flite Google
0.227 0.346 0.542
0.196 0.375 0.639
0.242 0.343 0.559
Average
0.372
0.403
0.381
0.222 0.355 0.580
Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise
4.1
371
Listening Test Results Discussion
Table 1 demonstrates the word error rates of the different synthesizers and noise levels. We see a clear trend among the different synthesizers that the highest performing synthesizer was the Google synthesizer, followed by Flite, and finally by ESpeak. These results indicate that the quality of the synthesizer plays a role in its intelligibility in noise. Empirically, it appears that Noise Setting 2 is the most intelligible noise setting, followed by Setting 3, and then Setting 1. It should be noted that results for settings 1 and 3 are very similar and differ somewhat. At the same time there does appear to be some variation between the different synthesizers. While Noise Setting 2 is the least intelligible for Espeak, it is the most intelligible for Google. From this observation, we can conclude that the intelligibility of a given synthesizer in a given noise setting depends primarily on the synthesizer and not on the noise setting. We make no claim regarding why certain noise settings are more intelligible than others, we believe this to be an avenue of further research.
5
Predictive Results
We make intelligibility predictions at the sentence level and the word level. At the sentence level, a model could estimate which paraphrasings are most likely to be understood by listeners. At the word level a model could rank synonyms of specific words so that they are more likely to be understood. At both levels of granularity we explore the application of point-wise and pair-wise ranking for estimating intelligibility. While list-wise reranking is an obvious extension to this work, we do not have enough sentence level, or word level data to make listwise reranking models feasible. We present the results of this predictive exercise below. Work in this field often uses metrics such as the DAU metric [3] or the Glimpse proportion measure [2] to attempt to model intelligibility. These metrics are based off the audio features of your synthesizer. Since we attempt to predict intelligibility based on non-audio features these metrics are out of the scope of this work. 5.1
Sentence Level Intelligibility Prediction
At the sentence level, we try to determine if one sentence is more intelligible than another. We explore this in two ways: first, we trained a machine learning model to estimate the average word error rate of a given sentence, and second, we trained a pair-wise reranking model to attempt to determine if one particular sentence is more intelligible than another. Sentence Level Word Error Rate Estimation. At the sentence level we attempt to train a machine learning model to predict the average word error rate of a given sentence. In order to do this, we construct a feature vector that
372
T. Manzini and A. Black
contains a number of sentence level features. We trained a simple linear model with sigmoid activation. We found that we achieved the best results when using simple models. We used several different features to estimate the word error rate of a sentence but few were effective given the low amount of data. Our features included average word rank, average word length, sentence length, word count, and percent of unique characters. We define word rank as the ranked position of a term, based on how frequently that term appears in the Corpus of Contemporary American English [4]. We define word length as the length of a particular word in characters. We define average word rank and average word length as the average of these respective values. Other features were explored but eventually discarded. The results of this model on the test set are presented in Table 2. Table 2. Performance of our linear error estimator for sentence level error estimation. Point wise reranking - sentences Synthesizer Noise level MSE (Test) Spearman’s R (Test) Espeak
1
0.0227
0.2258
Espeak
2
0.0256
−0.3728
Espeak
3
0.0311
−0.4650
Flite
1
0.0477
−0.1225
Flite
2
0.0451
0.4621
Flite
3
0.0587
−0.1863
Google
1
0.0505
0.1176
Google
2
0.0915
−0.2943
Google
3
0.0701
0.0662
Pair-Wise Sentence Reranking. We constructed a linear model with tanh activation to estimate which sentence is the most intelligible. We do this by feeding one feature representation of each sentence into the linear model. The model then estimate if the first sentence is more intelligible (labeled +1), the second sentence is more intelligible (labeled −1), or if the intelligibility of the sentences are equal (labeled 0). We then train this linear model and evaluate it on the test set. The results of this evaluation are presented in Table 3. 5.2
Word Level Intelligibility Prediction
At the sentence level we are attempting to estimate which words would be most intelligible, either on their own, or when compared to another word. For use in a real world setting the models discussed here could be used to estimate the intelligibility of synonyms of different words in a sentence so as to maximize the
Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise
373
Table 3. Performance of our linear pair-wise sentence level reranking model. Pair wise reranking - sentence Synthesizer Noise level MSE (Test) MSE variance (Test) Espeak
1
0.3441
0.1211
Espeak
2
0.5960
0.1217
Espeak
3
0.4641
0.2591
Flite
1
0.7074
0.1496
Flite
2
0.3979
0.351
Flite
3
0.6095
0.3688
Google
1
0.6679
0.1638
Google
2
0.4076
0.1440
Google
3
0.5080
0.1966
intelligibility of a sentence overall. As a result, estimating which words are going to be the most intelligible is an obvious initial step to estimating the overall intelligibility of a sentence, phrase, or other unit of speech. Word Error Rate Estimation. Working at the word level we have access to significantly more data. Here we trained a machine learning model to attempt to estimate the WER of a particular word in a given sentence. This is different from the sentence level task of the same name because we have features for the word, but also features for the context of the word (eg. the surrounding words). We construct a linear model with sigmoid activation to attempt estimate the error. We used several different word level features regarding the words themselves, and their surrounding contexts. Our word level features included: word rank, percent of vowels in the word, percent of consonants in the word, length of the word, and the percent of unique characters in the word. We define word rank in the same manner described in Sect. 5.1. Our context level features included: the same word level features for both the previous and next word, the length of total number of words in the sentence, and the number of total unique words in the sentence. The results of this evaluation can be found in Table 4. Pair-Wise Word Reranking. To perform pairwise reranking, we changed the layout of our model slightly. We now pass two times the number of features to our model, one for the first word and one for the second word. The word level features that are fed to the model are similar as in the above section, but they have had the features regarding sentence context removed, and contain only features regarding the neighboring words. Like the sentence pair wise reranking schema the model then has to estimate if the first word is more intelligible (labeled +1), the second sentence is more intelligible (labeled −1), or if the intelligibility of the sentences are equal (labeled 0). We trained this linear model and evaluated it on the test set. The results are presented in Table 5.
374
T. Manzini and A. Black
Table 4. Performance of our linear error estimator for word level error estimation. Point wise reranking - words Synthesizer Noise level MSE (Test) Spearman’s R (Test) Espeak
1
0.0429
−0.2516
Espeak
2
0.0426
0.0676
Espeak
3
0.0318
−0.1120
Flite
1
0.0932
−0.1612
Flite
2
0.1032
0.1346
Flite
3
0.0468
−0.1589
Google
1
0.0902
0.2173
Google
2
0.1182
0.3114
Google
3
0.0801
0.0178
Table 5. Performance of our linear pair-wise word level reranking model. Pair wise reranking - words Synthesizer Noise level MSE (Test) MSE variance (Test)
6
Espeak
1
0.1946
0.1151
Espeak
2
0.1546
0.0901
Espeak
3
0.1610
0.0959
Flite
1
0.1896
0.0972
Flite
2
0.2022
0.1167
Flite
3
0.2009
0.1052
Google
1
0.1783
0.0941
Google
2
0.1794
0.1031
Google
3
0.1840
0.1076
Results Discussion
From our data, we can see that the sentence level error estimation and pair-wise reranking methods are ineffective at the current data scale. For the Spearman’s correlation we can see that there is no consistent behavior between the different error models. In the case of the pair-wise reranking for sentences we still see poor performance. Not only is the MSE fairly high for a problem like this, the variance of the MSE is much larger than would be anticipated. We believe that these are problems that could be solved with additional data, but at the moment our models are not capable of performing this task at the sentence level. For the word level point-wise reranking we can estimate intelligibility for the Google synthesizer to some extent. This is indicated by the Spearman’s correlation which is either positive or near zero. However, this is not the case for other synthesizers. The pair-wise word ranking is more stable than the sentence
Towards Improving Intelligibility of Black-Box Speech Synthesizers in Noise
375
level reranking. For all synthesizers and for all noise levels we see model behavior indicative of estimating the correct word in a reranking context. At the same time, the variances of the MSE are within a reasonable bound and upon closer inspection we do not see a many outliers that could skew these results. Based on our results and given the data that we have presented here, we find that we are positively able to rerank individual terms based on lexical features to estimate their intelligibility. Our results demonstrate that this methodology works best in the pairwise reranking context for this particular data scale. We believe that additional labeled data will improve performance.
7
Future Work
The most significant piece of future work is more data with intelligibility labels for different noise settings and synthesizers. Additional evaluations on different features and different models for predicting intelligibility on the lexical level would be useful.
8
Conclusion
This work has explored the intelligibility of three different synthesizers in three different noise settings. We have evaluated these synthesizers in these noise settings on a human listening task and we have measured performance along metrics that reflect intelligibility. Further we have explored methods that have shown some predictive power regarding how predictable intelligibility is on a lexical level. We show that even with limited data you are able to rerank words and estimate which word will be more intelligible in a given context. Acknowledgments. We would like to acknowledge several people for their help and support on this work. Particularly Carolyn Penstein, Rajat Kulshreshtha, Abhilasha Ravichander, and the officers of CMU EMS. As well as the several people who helped edit this work, especially Elise Romberger. Finally, thank you to reviewers reading and examining our experiments, methodology, and submission.
References 1. Black, A.W., Lenzo, K.A.: Flite: a small fast run-time synthesis engine. In: 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis (2001) 2. Cooke, M.: A glimpsing model of speech perception in noise. J. Acoust. Soc. Am. 119(3), 1562–1573 (2006) 3. Dau, T., P¨ uschel, D., Kohlrausch, A.: A quantitative model of the “effective” signal processing in the auditory system. i. model structure. J. Acoust. Soc. Am. 99(6), 3615–3622 (1996) 4. Davies, M.: The corpus of contemporary American English (Coca): 450 million words, 1990–2012. Brigham Young University (2002) 5. Duddington, J.: eSpeak text to speech (2012)
376
T. Manzini and A. Black
6. Durette, P.N.: gTTS: a python interface for google’s text to speech api (2017). https://github.com/pndurette/gTTS. Accessed 15 Apr 2018 7. Fiedrich, F., Burghardt, P.: Agent-based systems for disaster management. Commun. ACM 50(3), 41–42 (2007) 8. Imran, M., Castillo, C., Lucas, J., Meier, P., Vieweg, S.: AIDR: Artificial intelligence for disaster response. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 159–162. ACM (2014) 9. Kamath, S., Loizou, P.: A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In: ICASSP, vol. 4, pp. 44164–44164. Citeseer (2002) 10. Killion, M.C., Niquette, P.A., Gudmundsen, G.I., Revit, L.J., Banerjee, S.: Development of a quick speech-in-noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners. J. Acoust. Soc. Am. 116(4), 2395– 2405 (2004) 11. McAulay, R., Malpass, M.: Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust. Speech Signal Process. 28(2), 137–145 (1980) 12. Park, Y., Patwardhan, S., Visweswariah, K., Gates, S.C.: An empirical analysis of word error rate and keyword error rate. In: Ninth Annual Conference of the International Speech Communication Association (2008) 13. Pichora-Fuller, M.K., Schneider, B.A., Daneman, M.: How young and old adults listen to and remember speech in noise. J. Acoust. Soc. Am. 97(1), 593–608 (1995) 14. Ravichander, A., Manzini, T., Grabmair, M., Neubig, G., Francis, J., Nyberg, E.: How would you say it? eliciting lexically diverse dialogue for supervised semantic parsing. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 374–383 (2017) 15. Schmidt-Nielsen, A.: Intelligibility and acceptability testing for speech technology. Technical report, Naval Research Lab, Washington DC (1992) 16. Valentini-Botinhao, C., Yamagishi, J., King, S.: Can objective measures predict the intelligibility of modified hmm-based synthetic speech in noise? In: Twelfth Annual Conference of the International Speech Communication Association (2011) 17. Valentini-Botinhao, C., Yamagishi, J., King, S.: Evaluation of objective measures for intelligibility prediction of hmm-based synthetic speech in noise. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5112–5115. IEEE (2011) 18. Varga, A., Steeneken, H.J.: Assessment for automatic speech recognition: Ii. noisex92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993) 19. Wang, Y.Y., Acero, A., Chelba, C.: Is word error rate a good indicator for spoken language understanding accuracy. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2003, pp. 577–582. IEEE (2003)
End-to-End Speech Recognition in Russian Nikita Markovnikov(B) , Irina Kipyatkova, and Elena Lyakso St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), Saint-Petersburg, Russia [email protected], [email protected], [email protected]
Abstract. End-to-end speech recognition systems incorporating deep neural networks (DNNs) have achieved good results. We propose applying CTC (Connectionist Temporal Classification) models and attentionbased encoder-decoder in automatic recognition of the Russian continuous speech. We used different neural network models such Long shortterm memory (LSTM), bidirectional LSTM and Residual Networks to provide experiments. We got recognition accuracy a bit worse than hybrid models but our models can work without large language model and they showed better performance in terms of average decoding speed that can be helpful in real systems. Experiments are performed with extra-large vocabulary (more than 150K words) of Russian speech. Keywords: End-to-end models Speech recognition
1
· Deep learning · Russian speech ·
Introduction
Automatic speech recognition (ASR) systems are traditionally built using acoustic model (AM) by applying hidden Markov models (HMM) with the Gaussian mixture model (GMM) and language model (LM). These models show good recognition accuracy but they consist of multiple parts that are tuned independently. So, errors in one part can involve errors in the other. Also, scenarios of the standard recognition need a large amount of memory and capacity that does not allows to use such systems locally at some devices and needs remote computation at servers. There is an end-to-end approach that has recently been adopted with using deep neural networks (DNN). This approach allows to implement models easily using only one neural network that is tuned with gradient descent only and one loss function. End-to-end models often demonstrate better performance in terms of a speed and an accuracy. Potentially these models require less amount of memory that allows to use them at mobile devices locally. But, they need more training data to be learned properly. Our goal was to build end-to-end models for recognition of continuous Russian speech, to tune and compare them with hybrid baseline models in terms of recognition accuracy and computing characteristics as training and decoding speed. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 377–386, 2018. https://doi.org/10.1007/978-3-319-99579-3_40
378
N. Markovnikov et al.
The performance of the models was evaluated in a term of word error rate (WER) and character error rate (CER). The rest of the paper is organized as follows. In Sect. 2, we survey related works. In Sect. 3, we describe architectures of CTC-models and attention-based encoder-decoder models. In Sect. 4, we discuss our experimental setup and datasets. In Sect. 5, we describe our implementation details of the models. In Sect. 6 we present results that we got using trained models. In Sect. 7 we provide short analysis of the results. Finally, we conclude and discuss future work in Sect. 8.
2
Related Work
Papers [21,23] present an end-to-end system created with help of Eesen toolkit [21] where decoding of CTC-models had been performed using weighted finite state transformators (WFST) [22]. Eesen implementation provides effective search using OpenFST library [1]. In paper [23] Eesen bidirectional LSTM [12] (BLSTM) neural networks were used for speech recognition for Serbian that is located in the same language group with Russian. So, they achieved WER equal to 14.68% that is not the best one comparing with the hybrid systems. But, CER was 3.68%. In paper [26] an end-to-end model using CTC was described. It was shown that such system able to work without LM well. Training dataset [19] was made up of audio tracks of Youtube videos with duration more than 650 h. So, the best WER without using LM was 13.9% but with LM it was equal to 13.4%. In paper [3] an attention-based model using LM was proposed. WFST was used to build integration of end-to-end model with the language model. At a decoding step, a launched output search was performed that minimized encoderdecoder model and LM. So, they got WER equal to 11.3% and CER equal to 4.8%. Independently similar attention-based end-to-end model called Listen, Attend and Spell (LAS) was proposed in [4]. Encoder was pyramidal-shaped BLSTM and decoder used stack of LSTMs. Also, model was recomputed using LM after decoding step. So, on a Google Voice Search WER was equal to 10.3%. So, as we can see, end-to-end models can work well with or without LM for languages with strict word order (e.g. English). The Russian language is characterized by a high degree of a grammatical freedom and a complex mechanism of the word formation. So, we need to use external LM models to increase accuracy. Anyway, there is no end-to-end models’ usage for Russian, so we decided to develop it.
3 3.1
Model Architecture CTC
CTC [9] is a function that allows recurrent neural networks to be learned without initial alignment of input and output sequences. Output layer contains one unit
End-to-End Speech Recognition in Russian
379
per each output symbol and one special blank symbol. Output vector wm is normalized using softmax-function that is interpreted as a probability of appearing of k-th symbol at m time as follows k ) exp(wm P (k, m|x) = k k exp(wm ) k where x denotes input feature sequence of length T and wm denotes k-th component of wm . Let α denotes a sequence of indices of blanks and symbols with T length according to x. Then, P (α|x) can be written as
P (α|x) =
T
P (αt , t|x).
t
Let B is an operator that removes repetitions of symbols and then blanks. So, the probability of an output sequence w is P (α|x). P (w|x) = α∈B−1 (w)
That sum can be computed using dynamic programming and neural network would be learned to minimized CTC-function: CT C(x) = − log P (w∗ |x), where w∗ denotes the target sequence. Also, in [9] a forward-backward algorithm using gradient of CTC-function was proposed. Decoding can be evaluated using proposal that arg maxP (w|x) ≈ B(α∗ ) w
∗
where α = arg maxα P (α|x). But in [10] decoding method using beam search algorithm that let to integrate language model was proposed. 3.2
Attention-Based Encoder-Decoder Model
Encoder-Decoder. Encoder-Decoder networks are used for problems where lengths of the input and the output sequences are variable [6,27]. Encoder is a neural network that transforms input x = (x1 , . . . , xL ) into the intermediate state h = (h1 , . . . , hL ) and extract features. Decoder is usually a recurrent neural network (RNN) that uses an intermediate state for generating output sequences. Encoder can be any neural network as multilayer perceptron (MLP), LSTM, BLSTM, convolutional network (CNN) [17], etc. RNN with Attention Mechanism. In paper [7] the using of an Attentionbased Recurrent Sequence Generator (ARSG) as a decoder was proposed. ARSG is a RNN that stochasticly generates output sequence (y1 , . . . , yT ) using input
380
N. Markovnikov et al.
h of length L = L . ARSG consists of RNN and a subnetwork called attentionmechanism. Attention mechanism chooses a subsequence of the input and then use it for updating hidden states of RNN and predicting the next output. On i-th step ARSG generates an output yi focusing on separate components of h: αi = Attend(si−1 , αi−1 , h) gi =
L
αi,j hj
j=1
yi ∼ Generate(si−1 , gi ) where si−1 is (i−1)-th state of RNN called Generator, αi ∈ RL denotes attention weights vector gi called glimpse. Step finishes with computing new generator state as following si = Recurrency(si−1 , gi , yi ).
4
Experimental Setup
4.1
Dataset
In this work, we use the training speech corpus collected at SPIIRAS as in [16]. The corpus consists of three parts: – recordings of 50 native Russian speakers. Each speaker pronounced a set of 327 phrases; – recordings of 55 native Russian speakers where each speaker pronounced 105 phrases; – an audio part of the audio-visual speech corpus HAVRUS [28]. 20 native Russian speakers (10 male and 10 female speakers) participated in the recordings. Each of them pronounced 200 Russian phrases. The total duration of the entire speech corpus is more than 30 h. To test the system, we used a speech database of 500 phrases pronounced by 5 speakers. The phrases were taken from the materials of Russian online newspaper “Fontanka.ru”1 that was not used in the training data. Our language model was learned using data from Russian news sites [15]. The dataset for the training of language model contains approximately 300 millions words. As a language model n-gram model with Kneser-Ney smoothing [5] with n = 2 and n = 3 was used. Vocabulary size was 150000 collocations. 4.2
Baseline
The baseline is hybrid DNN-HMM acoustic models implemented using Kaldi [24] and CNTK2 toolkit as in [20]. Bigram language model was used for a decoding. The best results shown by BLSTM, ResNet [11] and RCNN [18] are presented in Table 1. 1 2
http://www.fontanka.ru/. https://docs.microsoft.com/ru-ru/cognitive-toolkit/.
End-to-End Speech Recognition in Russian
381
Table 1. The best results of baseline models in terms of WER, average speed of a training (features per second) and a decoding (utterances per second).
5
Model
WER, % Decode Train
BLSTM
23.08
0.211
450.7
ResNet
22.17
0.105
121.4
RCNN
22.56
0.162
325.1
RCNN + residual unit + max-pooling + BLSTM 22.07
0.197
502.3
Implementation
Firstly, a simple speech recognition system using CTC-loss function was implemented using TensorFlow3 . Code and details can be found at a repository4 . This system does not allow using a language model but it uses less memory. Secondly, we used Eesen [21] toolkit at TensorFlow branch where Kaldi methods use TensorFlow neural networks implementation. That system allows to use language models in Kaldi format without additional converting. Thirdly, we used Tensor2Tensor5 framework to conduct experiments with attention-based models. That framework provides a common approach to build sequence-to-sequence model in particular speech recognition systems. All models were tuned using weak-configured machine. So, experiments were provided using NVIDIA GeForce GT 730M with 2 GB memory, CUDA library, available CPU memory was 16 GB with 4 cores.
6
Results
6.1
Results Corresponding to CTC-models
Firstly, we will describe results corresponding to CTC loss function-based models. Results corresponding CTC-models are presented in Table 2. In experiments we used two types of features (extracted from audio with 1 channel and 16000 MHz frequency): 1. 13-dimensional Mel Frequency Cepstral Coefficient (MFCC) [8] features that were extracted using window length equal to 0.025 and window step equal to 0.01 together with 3 additional features representing pitch with their first and second-order derivatives normalized via mean subtraction and variance normalization; 2. 40-dimensional filterbank [25] features with the same properties.
3 4 5
https://www.tensorflow.org/. https://github.com/mikita95/asr-e2e. https://github.com/tensorflow/tensor2tensor.
382
N. Markovnikov et al.
The whole training data were split into training (95%) and cross-validation (5%) sets. We used several neural network types as MLP, LSTM, Bidirectional LSTM, ResNet. The setting of the neural networks provided the best results without using any LM were as follows: – MLP had 4 hidden layers with 512 nodes using ReLU activation function with initial learning rate equal to 0.007 and decay factor equal to 1.5. – LSTM had 4 layers with 512 units in each with dropout equals 0.5 with initial learning rate equal to 0.001 and decay factor equal to 1.5. – ConvLSTM used convolutional layers before LSTM described above to simplify input features. It has one 2D convolutional layer with 8 filters, 2 × 2 kernel, no padding and ReLU activation function. Then, it has dropout with keep probability equal to 0.5. Then, LSTM had 4 layers with 128 units in each layer with dropout equal to 0.5. – BLSTM used 4 layers with 512 units and dropout with keep probability equal to 0.5. – ResNet had an architecture presented in Figure 1. It had 9 residual blocks with batch-normalization [13].
Fig. 1. ResNet.
The Momentum algorithm was used for the optimization with momentum equal to 0.9. OpenFST library was used to create WFST for decoding models in Eesen toolkit. Similar to [23] every system component as CTC labels (T ), lexicon (L) and language model (G) was transformed into one search graph as following: T LG = T ◦ min(det(L ◦ G)) where min denotes minimization, det is a determinization and ◦ denotes function composition. For 3-gram LM it was difficult to provide composition because of a lack of memory and long computation time, so we conducted experiments not with every model.
End-to-End Speech Recognition in Russian
383
Table 2. CTC-models results. Model
CER, % WER % Decode Train
Models without using LM, MFCC features, implementation (1) MLP
55.42
71.64
0.252
96.7
LSTM
38.58
52.47
0.266
304.9
Conv+LSTM (+L2-weight delay) 36.92
49.23
0.278
315.2
BLSTM
36.73
48.86
0.282
391.7
ResNet
35.69
48.24
0.267
142.6
Models using 2-gram LM, MFCC features, implementation (2) MLP
48.49
62.04
0.174
125.3
LSTM
26.12
38.71
0.181
402.8
Conv+LSTM
25.77
36.93
0.193
391.1
BLSTM
22.98
35.21
0.102
407.2
ResNet
22.35
34.96
0.173
293.2
Models using 3-gram LM, FBANK features, implementation (2)
6.2
MLP
26.57
37.19
0.098
104.7
BLSTM
15.79
26.17
0.026
381.5
ResNet
14.96
25.53
0.083
201.9
Results Corresponding to Attention-Based Models
In this section we describe results corresponding to attention-based models that we used. To tune these models we used MFCC-type features that we extracted working with CTC models. We did not use any language model integration here. So, results corresponding CTC-models are presented in Table 3. We provided experiments with LSTM and BLSTM models. Our model used 4 layers of 128 units with initial decreasing dropout keep probability equal to 0.9 in the encoder. As a decoder we used attention-based LSTM as in the encoder. We used Bahdanau-style attention mechanism [2] with 128 hidden layers size. The output at each time step was the attention value, so the attention tensor was propagated to the next time step via the state to the top LSTM output. Batch size was equal to 36 with learning rate equal to 0.05. The Adam algorithm [14] was used for the optimization with β1 = 0.85, β2 = 0.997 and = 10−6 . We initialized the weights randomly from the uniform distribution from an interval [−1; 1] without scaling variance.
7
Discussion
As we expected we found out that CTC model without LM works mediocre for the Russian language. Model makes mistakes in constructing words and sentences from the recognized characters but obtained phonemic transcription is quite
384
N. Markovnikov et al. Table 3. Attention-based models’ results. Model
CER, % WER % Decode Train
LSTM
19.15
28.47
0.279
389.2
BLSTM 19.08
27.83
0.285
401.8
similar to the original. So, the best result is CER equal to 14.96% and WER equal to 25.53% with ResNet model and using of an external LM. As we have mentioned in the introduction Russian is characterized by a high degree of grammatical freedom and a complex mechanism of the word formation. So, LM is an important part of a recognizer of the Russian language. But, we found that using attention-based models for the Russian language without integrating with LM allows to achieve good results. So, BLSTM version of attention-based model showed CER equal to 19.08% and WER equal to 27.83%. But, we could not surpass our baseline that is using hybrid DNN-HMM models with LM. However, our models showed better performance in terms of decoding speed that can be helpful in real systems. So, attention-based model without using of LM showed decoding speed equal to 0.285 utterances per second that is better on 19% than the fastest baseline ResNet model.
8
Conclusion
In this work, we consider the task of Russian speech recognition using end-to-end models such as CTC model and attention-based encoder-decoder. We used various neural network architectures: multilayer perceptron, LSTMs and theirs modifications and residual convolutional networks. The best result was shown by residual convolutional networks. We achieved recognition accuracy a bit worse than baseline hybrid models. But we showed that end-to-end models can work well for the Russian speech without language model and they showed better performance in terms of average decoding speed. In the future we will provide experiments on using other types of features, integrating language model into the attention-based system and perform experiments with transfer learning technique. Acknowledgments. This research is supported by the Russian Science Foundation (project No. 18-11-00145).
References 1. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: a general and efficient weighted finite-state transducer library. In: Implementation and Application of Automata, pp. 11–23 (2007) 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014). http://arxiv.org/abs/1409. 0473
End-to-End Speech Recognition in Russian
385
3. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016) 4. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016) 5. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–394 (1999). https://doi. org/10.1006/csla.1999.0128. http://www.sciencedirect.com/science/article/pii/ S0885230899901286 6. Cho, K., van Merrienboer, B., G¨ ul¸cehre, C ¸ ., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014). http://arxiv.org/abs/1406.1078 7. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015) 8. Ganchev, T., Fakotakis, N., Kokkinakis, G.: Comparative evaluation of various MFCC implementations on the speaker verification task. Proc. SPECOM 1, 191– 194 (2005) 9. Graves, A., Fern´ andez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning, pp. 369– 376. ACM (2006) 10. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1764–1772 (2014) 11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016) 12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 13. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 (2015). http://arxiv.org/ abs/1502.03167 14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980 15. Kipyatkova, I., Karpov, A.: Lexicon size and language model order optimization ˇ for Russian LVCSR. In: Zelezn´ y, M., Habernal, I., Ronzhin, A. (eds.) Speech and Computer, pp. 219–226. Springer, Cham (2013) 16. Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using Kaldi. In: Ronzhin, A., Potapova, R., N´emeth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 246–253. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7 29 17. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. Handb. Brain Theory Neural Netw. 3361(10), 1995 (1995) 18. Liang, M., Hu, X.: Recurrent convolutional neural network for object recognition. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3367–3375, June 2015. https://doi.org/10.1109/CVPR.2015.7298958
386
N. Markovnikov et al.
19. Liao, H., McDermott, E., Senior, A.: Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 368–373. IEEE (2013) 20. Markovnikov, N., Kipyatkova, I., Karpov, A., Filchenkov, A.: Deep neural networks ˇ zka, J. (eds.) in Russian speech recognition. In: Filchenkov, A., Pivovarova, L., Ziˇ AINL 2017. CCIS, vol. 789, pp. 54–67. Springer, Cham (2018). https://doi.org/10. 1007/978-3-319-71746-3 5 21. Miao, Y., Gowayyed, M., Metze, F.: EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 167–174. IEEE (2015) 22. Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1), 69–88 (2002) 23. Popovi´c, B., Pakoci, E., Pekar, D.: End-to-End large vocabulary speech recognition for the Serbian language. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 343–352. Springer, Cham (2017). https://doi. org/10.1007/978-3-319-66429-3 33 24. Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, No. EPFL-CONF-192584, IEEE Signal Processing Society (2011) 25. Ravindran, S., Demirogulu, C., Anderson, D.V.: Speech recognition using filterbank features. In: The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, vol. 2, pp. 1900–1903, November 2003. https://doi.org/10.1109/ ACSSC.2003.1292312 26. Soltau, H., Liao, H., Sak, H.: Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition (2016) 27. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp. 3104–3112 (2014) ˇ 28. Verkhodanova, V., Ronzhin, A., Kipyatkova, I., Ivanko, D., Karpov, A., Zelezn´ y, M.: HAVRUS corpus: high-speed recordings of audio-visual Russian speech. In: Ronzhin, A., Potapova, R., N´emeth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 338–345. Springer, Cham (2016). https://doi.org/10.1007/978-3-31943958-7 40
Correction of Formal Prosodic Structures in Czech Corpora Using Legendre Polynomials Martin Matura(B) and Mark´eta J˚ uzov´ a Department of Cybernetics and New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic {mate221,juzova}@kky.zcu.cz
Abstract. Naturalness is a very important aspect of speech synthesis that is necessary for a pleasant and undemanding listening and understanding of synthesized speech. However, in a unit selection, unexpected changes in F0 caused by units transitions can lead to an inconsistent prosody. This paper proposes a two-phased classificationbased method that improves the overall prosody by correcting a formal prosodic description of speech corpora. For speech data representation, the authors decided to use Legendre polynomials. Keywords: Anomaly detection · One-class SVM Formal prosodic grammar · Prosodemes Unit selection speech synthesis
1
· Multiclass SVC
Introduction
In human speech, the fundamental frequency values varies within a sentence. The F0 contour, in general, is closely related to the position of stressed syllables and also to the phrasing of the sentence. The F0 movements (increases/decreases), especially at the phrase-final position, have a communication function in the particular language – the mismatch in these movements can cause the misunderstanding of the sentence’s meaning [15,24]. Therefore, it is evident that the correct prosodic description of speech corpora is one of the crucial issues in text-to-speech synthesis. In general, in the unit selection method, the join and target costs are computed to ensure that the optimal sequence of units is selected. These costs control the smoothness of the concatenated neighbouring units, as well as the unit’s suitability for the required position in the synthesized sentence. In our TTS ARTIC [11,20], besides concatenation smoothness, the symbolic prosody features, called prosodemes (Sect. 3, [17,18]), are used in the target cost to ensure the synthesized speech keeps the required communication function (i.e. listeners are able to distinguish declarative sentences from questions) [10]. However, due to some inaccuracies in the formal prosodic description of speech data, speech c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 387–397, 2018. https://doi.org/10.1007/978-3-319-99579-3_41
388
M. Matura and M. J˚ uzov´ a
units are sometimes used in a different context than they were pronounced by a speaker and than they belong to. This may be manifested in the synthetic speech e.g. by unnatural dynamic melody or by inappropriate stress perception. The presented paper focuses on the symbolic prosodic labels in our speech corpora and, using powerful Legendre polynomials (Sect. 2), offers the two-phase algorithm for their correction. The initial experiments were carried out in [14] and showed that the description of an F0 contour based on the Legendre polynomials is sufficient for classification-based approaches.
2
Legendre Polynomials
To describe the F0 contours, the authors used Legendre polynomials [9] – contrary, e.g. to usage of Gaussian mixture models by the author of [7], or HMM models used in [5,6] for the correction of wrongly labelled formal prosodic structures in speech corpora. These polynomials are frequently encountered in physics and other technical fields. Legendre polynomials are defined by Eq. 1, n
Ln (x) = 2
n k=0
n+k−1 n 2 x , k n k
(1)
and they form an orthogonal basis (i.e., non-correlated) suitable for modelling of F0 contours [4,23]. An F0 contour is described by coefficients as a linear combination of these polynomials. Because of the orthogonality, the coefficients can be estimated using cross-correlation at a time lag of 0 (i.e., a mutual energy of F0 contour and Legendre polynomial). The first four polynomials L0 (x), L1 (x), L2 (x) and L3 (x) (see Fig. 1a) match linguistic interpretation as L0 (x) responds to mean value of the pitch, L1 (x) to rise or fall depending on the positive or negative sign of the coefficient (the slope is determined by its absolute value), L2 (x) to peak or valley and L3 (x) to the wave shape of F0 contour. For the purposes of the presented experiment, the authors used mPraat toolbox for Matlab [1] and for each F0 contour, the frequency values has been transferred to semitone scale, interpolated the contour in 1,000 equidistant points and estimated the first four Legendre coefficients (for example, see Fig. 1b, coefficients are 10.7407 (mean value), −2.6880 (falling slope), −1.5522 (valley shape), 0.1685 (only a slight wave curvature)).
3
Symbolic Prosody Features in Speech Corpora
The authors of [17,18] introduced a new formal prosodic model to be used in text-to-speech systems to control the appropriate usage of intonation schemes within the synthesized sentence, the original idea was based on the Czech classical phonetic view described in [15]. This grammar parses the given text sentence
Correction of Formal Prosodic Structures
389
Fig. 1. Setup of the experiment.
in a derivation tree and each prosodic word (PW, i.e. a group of words with only one words stress) is assigned with an abstract prosodic unit, a prosodeme, marked as PX . The former grammar was focused mainly on the differentiation of phrase-final and other PW in the sentence since phrase-final words are, in general, characterized by a distinct increase/decrease of an F0 , they have a certain communication function. However, as showed in [8], the phrase-initial PW s also distinguish from the following words, especially by the increase of the F0 [24]. Recently, based on these observations, the grammar was extended to describe the phrase-initial PW s by a new prosodeme type (P0.1 , see below). In our TTS ARTIC [11,20], we distinguish the following prosodeme types assigned to each PW (see also Fig. 2): – – – – –
P1 – prosodeme terminating satisfactorily (the last PW s of declar. sentences) P2 – prosodeme terminating unsatisfactorily (the last PW s of questions) P3 – prosodeme non-terminating (the last PW s in intra-sentence phrases) P0 – null prosodeme (assigned otherwise) P0.1 – special type of null prosodeme (assigned to the first PW in phrases)
The prosodeme types are used in speech synthesis to ensure the required communication function on the phrase level of synthesized sentences [10,22] – the usage of a correct prosodeme type is controlled by the target cost computation in the unit selection method. Unfortunately, despite the professional speakers were recording the speech corpora for the purposes of TTS, the prosodic description of recorded sentences (based on the formal prosody grammar applied on texts of segmented sentences) sometimes do not correspond to the real F0 contours. The problems mainly appear within the null prosodeme where a “neutral” speech is expected, but the speaker could pronounce a word in an unexpected way regarding prosody. This inaccurate description (and thus the wrong usage of some speech units in the synthesis itself) may lead to an unnatural excessive increase or decrease of the F0 contour in a non-phrase-final prosodic word with
390
M. Matura and M. J˚ uzov´ a
Fig. 2. The illustration of the tree built using the extended prosodic grammar [8, 18] for the Czech sentence “It will get colder and it will snow heavily, so he did not come.”
the null prosodeme which could be manifested by an inappropriate stress or an unnatural melody or, eventually, it may result in a misunderstanding due to not keeping the required communication function. In the presented paper, the experiments are carried out on two large speech corpora – AJ and MR [12,20]. The male synthetic voice, built from AJ corpus, is widely used in commercial products for its high naturalness. On the other hand, the female synthetic voice, built from MR corpus, is not very consistent in prosody (her prosody is very dynamic) – given the original prosodic description baseline, synthesized sentences quite often contain an unnatural intonation pattern (especially in the null prosodeme). The complete statistics of the corpora are listed in Table 1. Table 1. Number of prosodic words labelled by a specific prosodeme type. Corpus No. of sentences No. of PW s P0
4
P1
P2
AJ
12,277
84,733
35,781
MR
12,308
83,486
41,728 11,017 905
P3
P0.1
9,850 922 12,141 26,039 7,953 21,883
Correction Process
The basic idea behind the correction process is simple. With inconsistent prosody, the speech created by the unit selection does not sound naturally and it is unpleasant to listen due to the speech artefacts. If we were able to correct wrongly marked prosodic words, we might achieve more fluent and consistent prosody, which would lead to a better synthesis. The correction process has two
Correction of Formal Prosodic Structures
391
phases and a choice of a suitable data description is a principal issue. Despite prosodemes (Sect. 3) being the only symbolic prosody features, each prosodeme type corresponds to the specific changes in the F0 contour – these could be modelled by the Legendre polynomials (Sect. 2) whose first four coefficients are used as the only representation of our data in the presented experiment. In the first phase, anomalies among the null prosodemes are detected (Sect. 4.1). In the second phase, the detected outliers are classified by a multiclass classifier that gives them new labels (Sect. 4.2). Both phases are described below in detail. After the correction, the evaluation by listening tests was performed (see Sect. 5). 4.1
Phase One: Anomaly Detection
Anomaly (or novelty) detection [2,13] is a well-known approach which is used to find items that do not have the same or similar properties as other items in a dataset. Our previous study [14] showed that, among other classification methods, the One-class Support Vector Machine (OCSVM) is the most suitable for this experiment. We are using the implementation of OCSVM from scikitlearn [16] which is based on libsvm [3] with radial basis function as a kernel and γ = 0.1. The parameter ν, which influences an upper bound on the fraction of training errors, was set to 10% – this value is the authors’ estimation of possible wrongly labelled P W s in corpora. Since we are looking for anomalies only in our closed dataset, we can afford to train the OCSVM model on the whole dataset to get the best decision function possible. We trained two OCSVM models. The first one was trained by using 35,781 P0 prosodemes from AJ corpus and the second one by using 41,728 P0 prosodemes from MR corpus. After training the models, we tested how these models react to the different types of prosodemes and also to the training data. We detected anomalies in each group of prosodemes using the OCSVM model to obtain the number of outliers for each group. Since the model was trained with P0 prosodemes, where we supposed 10% of anomalies, we expected the number of outliers to be about 10% for P0 and significantly higher for the other groups. The results shown in Table 2 confirm our assumption – most of the P1 prosodemes were correctly detected as anomalous by OCSVM model trained on P0 . All the results are described in [14]. Table 2. Number of anomalies detected by OCSVM. Corpus P0
4.2
P1
AJ
3,578 (10.0%)
8,508 (86.4%)
MR
4,174 (10.0%) 10,317 (93.6%)
Phase Two: Outliers Classification
By detecting the anomalies in P0 prosodemes, we obtained a group of outliers whose F 0 does not have “neutral” contour. These outliers can be either strongly
392
M. Matura and M. J˚ uzov´ a
penalized to exclude them from speech synthesis process as described in Sect. 5.1 (see [14]), or classified to another prosodeme class – as mentioned in Sect. 3, apart from P0 , we distinguish another 4 different prosodeme types: P1 , P2 , P3 and P0.1 . To perform the multi-class classification of the P0 outliers, we had to train an appropriate model for each corpus. We collected all available prosodeme data from one corpus to cover all the prosodeme types and then we trained a Support Vector Classifier (SVC) from scikit-learn as our multi-class model. SVC uses one-vs-all decision function and since our data are not evenly distributed among all types of prosodemes, we set the parameter for class weight to “balanced”, which means the weight of each class is adjusted inversely proportional to the class frequencies in input data. As in the previous case, we were working on the closed dataset and therefore we could used all data to train the classification model. The classification and relabelling of P0 outliers was done again for both corpora. We classified 3,578 outliers in AJ corpus and 4,174 outliers in MR corpus; the classification results are listed in Table 3. Table 3. Classification of P0 outliers. Corpus P0 outliers P0 AJ
3,578
MR
4,174
P1
P2
P3
1,559 (43.6%) 189 (5.3%) 328 (9.2%) 328 (9.2%)
P0.1 1,174 (32.8%)
988 (23.7%) 385 (9.2%) 145 (3.5%) 817 (19.6%) 1,839 (44.1%)
It is obvious, that most of the P0 outliers (76.3%) from MR corpus were labelled as a different type of prosodeme. However, 23.7% of them were given the P0 label again. These outliers were picked by the OCSVM model as anomalies, because their properties were somehow different from the other P0 data. Nevertheless, the properties of these outliers are still more similar to the P0 prosodeme than to another prosodeme type, hence the SVC labelled them as P0 . The situation for AJ corpus is analogous with the difference that even more outliers were labelled back to P0 . This is probably caused by a different prosody consistency of each corpus. The intonation of AJ speaker was more consistent and precise compared to the MR speaker and therefore, classifier marked them back to type P0 more often than in the case of MR corpus. The evaluation of the prosodeme corrections will be further described in Sect. 5.2.
5
Evaluation
To evaluate the process proposed in Sect. 4, we carried out two listening tests (see Sects. 5.1 and 5.2) in our new listening test framework. Both tests had the same structure, both were 3-scale preference listening test. The listeners were comparing sentences generated by our baseline TTS system ARTIC (with original corpora, TTS-base) and those generated by a modified system TTS-new
Correction of Formal Prosodic Structures
393
build on the fixed corpora (based on the classification described in Sects. 4.1, 4.2 respectively). They were instructed to use earphones and to compare the overall quality of samples A and B in each pair by selecting one from these options: – Sentence A sounds better. – I cannot decide. – Sentence B sounds better. The answers where normalized for each listener and pair of samples in the listening test to p = 1 where the TTS-new output was preferred, to p = −1 where the TTS-base output was preferred and p = 0 otherwise. These values were used for the final computation of the listening test score s, defined by Eq. 2, p∈T p , (2) s= |T | where T is a set of all answers from all listeners. The positive value of s indicates the improvement of the overall quality when using TTS-new. 5.1
Evaluation of the Phase One
First, the authors evaluate the phase one, the anomaly detection using OCSVM in Sect. 4.1, directly in the unit selection speech synthesis itself [14]. In this evaluation, the modified TTS-new represents a system which highly penalizes units originated from anomalous PW s (those detected by OCSVM) during the Viterbi search [21]. This “ban” should ensure that these “strange” (anomalous) units are not used in the synthesis and it may, hopefully, increase the naturalness of speech synthesis. On the other hand, about 10% of all P0 units are dropped by this approach – this should, however, not be a big problem since the corpora are quite large and they were carefully designed [12] to cover all the different units sufficiently. In any case, this approach results in a different sequence of units compared to that generated by TTS-base. To select the sentences for the listening test, we synthesized 6,000 sentences by TTS-base and TTS-new and we randomly selected 20 sentences for each voice so that they fulfilled the criterion of having 8 or more anomaly units (similarly to [19], but the selection criterion was the number of anomalous units occurrences in TTS-base sentences in this experiment). Thus, the whole listening test contained 40 pairs of synthesized sentences, each pair included two variants of the same sentence – one generated by TTS-base and one generated by the modified system TTS-new. The results of the listening test, gained from 16 listeners (5 of them being speech experts), are presented in Table 4. TTS-new was preferred for both voice corpora, the results are statistically significant (as proved in [14]). The positive score values s indicate that the penalizing of outlier speech units (those originated from PW outliers detected by OCSVM using Legendre polynomials coefficients) leads to more natural synthetic speech.
394
M. Matura and M. J˚ uzov´ a Table 4. Results of the first listening test. Corpus TTS-base better Same quality TTS-new better score s
5.2
AJ
62 (19.4%)
76 (23.7%)
182 (56.9%)
0.375
MR
104 (32.5%)
79 (24.7%)
137 (42.8%)
0.103
Total
166 (25.9%)
155 (24.2%)
319 (49.9%)
0.239
Evaluation of the Phase Two
The results presented in the previous section indicate the improvement of the quality of speech synthesis when penalizing the units originated from P0 words detected as outliers by OCSVM. However, the outliers were in the phase two relabelled by a multi-class SVM classifier (described in Sect. 4.2) and so they could be used in the synthesis with the new label assigned. In this case, the TTS-new uses the same penalization of a mismatch of prosodeme types in the target cost computation as in the baseline TTS-base, the only difference of the two systems are the data with prosodeme labels – TTS-new uses the relabelled speech corpora, TTS-base uses the original speech corpora presented in Sect. 3. Again, when designing sentences for the second listening test, we followed the methodology described in [19] with the selection criterion based on the number of relabelled units occurrences in TTS-new sentences. By this procedure, we randomly selected 10 sentences for the each non-null prosodeme type for both voices (80 sentences in total) to find out how the relabelled units performed in new prosodic contexts. This listening test was finished by 16 listeners, 6 of them being speech synthesis experts. The results listed in Table 5 show that the relabelled prosodemes did not cause any serious problem in the synthesized sentences, the outputs of TTS-new were sometimes even much better evaluated by the listeners contrary to the TTS-base outputs.
6
Conclusion and Future Work
In the presented paper, we examined the usage of the Legendre polynomials for correction of formal prosody grammar. The corpora we have been working with contained inconsistencies in the prosody description – some prosodic words were labelled as “neutral” (P0 ) in the meaning of prosody even though their F 0 did not have a neutral contour. Therefore, we proposed the two-phased correction method to correct these wrongly labelled prosodemes. To represent our data, we took only the first four coefficients of the Legendre polynomials and then we trained One-Class Support Vector Machine (OCSVM) detector and multi-class Support Vector Classifier (SVC). In the first phase, outliers among the P0 prosodemes were detected by the OCSVM and then, in the second phase, we classified them with the multi-class SVC so we get the new labels for the P0 outliers. Afterwards, we conducted
Correction of Formal Prosodic Structures
395
Table 5. Results of the second listening test. Corpus prosodeme TTS-base better Same quality TTS-new better score s AJ
P0.1 P1 P2 P3
18 25 24 38
(11.3%) (15.6%) (15.0%) (23.8%)
111 103 46 65
(69.4%) (64.4%) (28.8%) (40.6%)
31 32 90 57
(19.4%) (20.0%) (56.3%) (35.6%)
0.081 0.044 0.413 0.119
MR
P0.1 P1 P2 P3
17 44 26 49
(10.6%) (27.5%) (16.3%) (30.6%)
124 68 72 47
(77.5%) (42.5%) (45.0%) (29.4%)
19 48 62 64
(11.9%) (30.0%) (38.8%) (40.0%)
0.013 0.025 0.225 0.094
AJ corpus - total
105 (16.4%)
325 (50.8%)
210 (32.8%)
0.164
MR corpus - total
136 (21.5%)
311 (48.6%)
193 (30.2%)
0.089
total
241 (18.8%)
636 (49.7%)
403 (31.5%)
0.127
two listening tests to evaluate the benefit of this approach. By the first test, we verified that the synthetic speech sounds better if we are not using the anomalous P0 prosodemes. In the second test, we found out that if we relabel the anomalies to a different prosodeme type, we can still use them and the quality of speech will not decrease. Hence we do not need to penalize the anomalies or throw them away, which would be a waste of data. Furthermore, in some cases the synthesized speech even gets better with these relabelled prosodemes. As a future work, we would like to test this method on our other corpora (Czech, English, Russian, etc.) and we also want to compare the quality of synthesized speech without all the anomalies and with the relabelled variants of them. Acknowledgements. The work has been supported by the Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506 and by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
References 1. Boˇril, T., Skarnitzl, R.: Tools rPraat and mPraat. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 367–374. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5 42 2. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1–58 (2009) 3. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). Software http://www.csie.ntu. edu.tw/∼cjlin/libsvm
396
M. Matura and M. J˚ uzov´ a
4. Grabe, E., Kochanski, G., Coleman, J.: Connecting intonation labels to mathematical descriptions of fundamental frequency. Lang. Speech 50(Pt 3), 281–310 (2007) 5. Hanzl´ıˇcek, Z.: Classification of prosodic phrases by using HMMs. In: Kr´ al, P., Matouˇsek, V. (eds.) TSD 2015. LNCS (LNAI), vol. 9302, pp. 497–505. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24033-6 56 6. Hanzl´ıˇcek, Z.: Correction of prosodic phrases in large speech corpora. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 408–417. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5 47 7. Hanzl´ıˇcek, Z., Gr˚ uber, M.: Initial experiments on automatic correction of prosodic annotation of large speech corpora. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 481–488. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10816-2 58 8. J˚ uzov´ a, M., Tihelka, D., Vol´ın, J.: On the extension of the formal prosody model for TTS. In: TSD. Lecture Notes in Computer Science. Springer (2018) 9. Legendre, A.M.: Recherches sur l’attraction des sph´ero¨ıdes homog`enes. In: M´emoires de math´ematique et de physique, present´es ` a l’Acad´emie royale des sciences, par divers s¸cavans & lˆ us dans ses assembl´ees, Paris, pp. 411–435 (1785) 10. Matouˇsek, J., Leg´ at, M.: Is unit selection aware of audible artifacts? In: SSW 2013. Proceedings of the 8th Speech Synthesis Workshop, pp. 267–271. ISCA, Barcelona, Spain (2013) 11. Matouˇsek, J., Tihelka, D., Romportl, J.: Current state of Czech text-to-speech system ARTIC. In: Sojka, P., Kopeˇcek, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006). https://doi.org/10. 1007/11846406 55 12. Matouˇsek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: LREC 2008, pp. 1296–1299. ELRA, Marrakech, Morocco (2008) 13. Matouˇsek, J., Tihelka, D.: Anomaly-based annotation errors detection in tts corpora. In: INTERSPEECH, pp. 314–318. ISCA, Dresden, Germany (2015) 14. Matura, M., J˚ uzov´ a, M.: Using anomaly detection for fine tuning of formal prosodic structures in speech synthesis. In: TSD. Lecture Notes in Computer Science, Springer (2018) ˇ 15. Palkov´ a, Z.: Rytmick´ a, v´ ystavba prozaick´eho textu. Studia CSAV; ˇc´ıs. 13/1974. Academia (1974) 16. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 17. Romportl, J.: Structural data-driven prosody model for TTS synthesis. In: Proceedings of the Speech Prosody 2006 Conference, pp. 549–552. TUDpress, Dresden (2006) 18. Romportl, J., Matouˇsek, J.: Formal prosodic structures and their application in NLP. In: Matouˇsek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005). https://doi.org/10. 1007/11551874 48 19. Tihelka, D., Gr˚ uber, M., Hanzl´ıˇcek, Z.: Robust methodology for TTS enhancement evaluation. In: Habernal, I., Matouˇsek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 442–449. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3642-40585-3 56 20. Tihelka, D., Hanzl´ıˇcek, Z., J˚ uzov´ a, M., V´ıt, J., Matouˇsek, J., Gr˚ uber, M.: Current state of text-to-speech system ARTIC: A decade of research on the field of speech technologies. In: TSD. Lecture Notes in Computer Science (2018)
Correction of Formal Prosodic Structures
397
21. Tihelka, D., Kala, J., Matouˇsek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: INTERSPEECH, pp. 174–177. ISCA, Makuhari, Japan (2010) 22. Tihelka, D., Matouˇsek, J.: Unit selection and its relation to symbolic prosody: a new approach. In: INTERSPEECH, vol. 1, pp. 2042–2045. ISCA, Bonn (2006) 23. Vol´ın, J., Tykalov´ a, T., Boˇril, T.: Stability of prosodic characteristics across age and gender groups. In: INTERSPEECH, pp. 3902–3906. ISCA, Stockholm, Sweden (2017) 24. Vol´ın, J.: Extrakce z´ akladn´ı hlasov´e frekvence a intonaˇcn´ı gravitace v ˇceˇstinˇe. Naˇse ˇreˇc 92(5), 227–239 (2009)
On the Contribution of Articulatory Features to Speech Synthesis Martin Matura(B) , Mark´eta J˚ uzov´ a , and Jindˇrich Matouˇsek Department of Cybernetics and New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic {mate221,juzova,jmatouse}@kky.zcu.cz
Abstract. There are several features that are used for the unit selection speech synthesis. Among the most used for computing a concatenation cost are energy, F0 and Mel-frequency cepstrum coefficients (MFCC) that usually give a good description of a speech signal. In our work, we focus on a usage of articulatory features. We want to determine whether they are correlated with MFCC and in that case, if they can replace MFCC or bring a new information into the process of speech synthesis. To obtain the articulatory data, we used electromagnetic articulograph AG501 and then we examined the correlation of two sequences of join costs each described by different features. Keywords: Articulatory features Electromagnetic articulograph
1
· Join cost · Correlation
Introduction
For the unit selection speech synthesis, a good description of speech units is a crucial issue for a high quality of resulting synthesized speech. The process of choosing the best unit sequence is controlled by Viterbi search [18], a searching algorithm based on finding the lowest cost path through the graph with concatenation costs (join cost, JC ) on edges and target costs (TC) on nodes. Fundamental frequency, energy and Mel-frequency Cepstral coefficients (MFCC) are the most common features for the join cost computation and indicate how well the neighbouring units can be joined together [2,9] – in other words, it ensures the smooth transition between units regarding prosodic and acoustic features. On the other hand, the target cost ensures the selection of an appropriate unit to the required position regarding also prosodic and phonetic contexts. The speech itself, when created by a human, is basically a result of appropriate movements of human articulators (lips, tongue, palate, etc.) in the form of an airflow. The airflow is shaped by the articulators according to the sounds present in the produced speech. The articulatory data, obtained by an electromagnetic articulograph, represents the movement and changes in the position of human articulators, hence they are promising candidates for another features describing the speech unit. Apart from MFCC, which describe a frequency spectrum that c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 398–407, 2018. https://doi.org/10.1007/978-3-319-99579-3_42
On the Contribution of Articulatory Features to Speech Synthesis
399
is closely related to the speech, but can be inaccurate due to masking effects, the articulatory data capture the changes between articulators, which create speech, directly and that is why they could contribute to the selection of a better-related speech units. There are not many studies reported the usage of the articulatory features in unit selection speech synthesis. Nevertheless, in the recent study [15], the quality of speech synthesis was tested with different types of features – articulatory, acoustic and articulatory-acoustic, and the study shows that articulatory features have a potential to become another reasonable set of features used for the join cost computation. The articulatory data have also been used for the speech recognition task [22], and there are many studies concerning acoustic-toarticulary inversion mapping, e.g. [8,19,20]. Unfortunately, based on our experience, obtaining a set of articulatory features is a quite demanding task (Sect. 2). Therefore, in the first instance, we want to find out if there is a dependence between articulatory features (AF) and MFCC and whether AF can bring a new information into the process of speech synthesis of Czech language. Once we prove the contribution of AF (a new information compared to MFCC), the usage of articulatory features will be tested in our TTS system ARTIC [11,17] because we expect it may improve the selected speech unit sequence as reported in [15]. For this purpose, an electromagnetic articulograph was used to record our own dataset since there is no publicly available articulatory corpus for Czech language, like MOCHA-TIMIT [21] or mngu0 [14] for English or MSPKA [1] for Italian. The scope of this paper covers the testing of the dependence of MFCC and AF. Since these features are hardly comparable (AF represents the real movements of human articulators during the speech, MFCC are computed from the frequency spectrum), they are used for the join cost computation of the sequence of units during the speech synthesis. The emerged sequences of the two different join costs are used for the correlation coefficient computation. 1.1
Join Cost in Unit Selection
As described in [9], the join cost in the unit selection speech synthesis, which ensures the smooth transition of speech units during the Viterbi search [18], consists of three sub-components – the difference in energy (E ), the difference in fundamental frequency F0 (both together constitute a prosodic component of the join cost) and the Euclidean distance of 12 MFCC, an acoustic component of JC. Values of all features are calculated pitch-synchronously [6,7,10] and the total join cost is calculated as an average of the values of the sub-components. For two units candidates ci−1,j and ci,k (for units ui−1 and ui ), the join cost is defined as follows (Eq. 1): JC(ci−1,j , ci,k ) = wF0 ∗ JCF0 (ci−1,j , ci,k ) + wE ∗ JCE (ci−1,j , ci,k ) + wM F CC ∗ JCM F CC (ci−1,j , ci,k ),
(1)
400
M. Matura et al.
ci−1,j is the j th candidate for the unit ui−1 in the synthesized sentence and ci,k is the k th candidate for the unit ui . For 2 units candidates ci−1,j , ci,k , the MFCC join cost component JCM F CC is defined by Eq. 2 (the Euclidean distance of 12-dimensional MFCC vectors): 12 (ci−1,j (n) − ci,k (n))2 (2) JCM F CC (ci−1,j , ci,k ) = n=1
When a synthesized sentence is included in the speech corpus, the units candidates originated from that recorded sentence must be selected because the best possible quality of synthesized speech is obtained by simply playing back the original speech as far as target specification also matches – it is the basic (and the most obvious) requirement for the selection algorithm. Hence the costs have to be defined to be equal to 0 for the neighbouring units. The energy and F0 values are continuous in the continuous speech (except for unvoiced segments) and also the Euclidean distance of MFCC vectors of two neighbouring units is zero. Naturally, the same principle holds also true for the articulatory features (the courses of sensors coordinates) – so the distance of the coordinates could be, without any doubt, used in the JC computation as a new component JCAF .
2
Data Acquisition
The electromagnetic articulograph allows a digital recording and representation of articulatory movements over time during the process of speech creation. It uses induction coils above a speaker’s head that produce electromagnetic field. This field induces a current in tiny coils (sensors) in the mouth of a speaker which allows us to determine the location of sensors; this issue is described in more details in [5]. It is important for the speaker to keep the head inside the spherical measuring area under the induction coils, otherwise the results can be distorted. To obtain the articulatory trajectories for our research, we use 3D electromagnetic articulograph AG501 (EMA), which is more precise than AG500 [16], with seven sensors attached to the speaker by a physiological adhesive and sampling frequency equal to 250 Hz. Three sensors are used as reference points and four sensors measure trajectories of articulators. The reference sensors are glued to the places on the speaker’s head which do not move while he/she is speaking (upper incisors, temporal bone behind ears). They are used to capture the head movements and afterwards for a subsequent post-processing of articulation data since it is necessary to perform head-correction calculations to eliminate the head movements from the articulatory trajectories. The remaining four sensors measure the articulatory trajectories and they are attached to the lower incisors (LI), tongue tip (TT), tongue body (TB) and tongue dorsum (TD), as shown in Fig. 1. Articulation data are also usually obtained from sensors placed on the lower and the upper lip, unfortunately we did not have enough sensors to carry out measurement of lip’s position.
On the Contribution of Articulatory Features to Speech Synthesis
401
Fig. 1. Midsagittal view of a human mouth with the placement of EMA sensors which capture the articulator’s trajectories.
As also reported e.g. in [1,13], the process of recording with EMA is not a simple task for speakers. The main problem lies in de-attaching of the measuring sensors from the articulators – over the recording time, the physiological adhesive starts to peel away from the soft tissue due to the constant movement and friction inside the mouth. Once the sensor falls off, it is practically impossible to place it back to the exact same spot as before and the recording has to be stopped. Because of this issue, it is important to properly select the sentences for the recording since only hundreds of sentences should sufficiently cover all required speech units. The authors decided to use a high-coverage multilevel Czech text corpus designed for a voice banking process of laryngectomized patients [4], the text corpus building process is described in details in [3]. The primary requirement for that set of sentences was to maximize the coverage of appropriate speech units, no matter the number of sentences that would finally be recorded since there was a limited time for recording of these patients. The building of synthetic voices for the patients (lasting for several years on the author’s department) have proved that the unit selection method could be used from only approx. four or five hundred of recorded sentences (depending on the patients voice quality). In any case, the main idea of the text corpus building perfectly matches the issue of articulatory data recording – nobody knows in advance how long the speaker will be able to record with all the sensors. The msak0 speaker in the MOCHA-TIMIT uttered 460 sentences but during the session some sensors had to be re-attached. Thus, to ensure the longest possible recording time, it is also very important to carefully prepare the speaker’s articulators before attaching the sensors to them. The pilot (female) professional speaker, whose data were used for the presented experiment, was asked to brush the teeth and tongue first, then we dry-cleaned the tongue and glued the sensors to the desired positions shown in Fig. 1. Nevertheless, in spite of our careful preparation, we were not able to record more than 380 sentences (35 min of speech data) in one continuous recording session without de-attaching one of
402
M. Matura et al.
the sensors. As reported in [14], they were able to record over 1300 sentences (67 min of speech data) in one recording session – therefore we are now working on some improvements of the recording and pre-recording process to be able to obtain more speech data for our future experiments. After the recording, a database of speech units with the articulatory features was created. We deprived the recordings of a noise and modified the articulatory data with the head-correction post-processing and then assigned them to the corresponding speech units. As the articulatory features, we selected only X and Y coordinates (Fig. 2) of the all four sensors (LI, TT, TB, TD). Note that the rotations of sensor coils was not considered and we decided to leave out the Z coordinates since they did not show much movements – the side-by-side movement of the articulators is not so usual during the speech creation. We also
Fig. 2. Trajectories of X and Y coordinates of 4 sensors – LI, TT, TB, TD.
On the Contribution of Articulatory Features to Speech Synthesis
403
performed an automatic segmentation process of the recording sentences [12] and unit selection features generation. The prepared database was used in the experiment described in the following sections. The Fig. 2 shows 224 ms of the X and Y coordinates contours of the articulators sensors (56 values of the EMA sensors, with the sampling frequency 250 Hz). The presented speech segment corresponds to two phonemes ([i], [s]), the vertical line represents the boundary determined by the automatic pitch-synchronous segmentation process of the recorded data [12].
3
Experiment
In recent years, the articulatory features have started to be used both for acoustic-to-articulatory mapping (whose results are subsequently used in speech synthesis) [8,19,20] and for the unit selection itself as new features (to replace or extend MFCC) [15]. However, based on the authors best knowledge, there is no reported study concerning the correlation of the articulatory features (AF) and MFCC – whether the AF really bring a new information into the process of speech synthesis and thus, whether it is really worth using them. Hence, we decided to test the contribution of AF and performed a correlation comparison to find differences or similarities in MFCC and AF behaviour. Due to the difference of these two features, it makes a little sense to just compare their contours in the recorded sentences – the AF represent the real movements of human articulators while MFCC are computed from the speech frequency spectrum by applying the mel triangular filters. Moreover, we want to compare their contribution in the speech synthesis itself so we decided to compute the correlation coefficient of the sequences of join costs JC representing a synthesized sentence. Firstly, the join cost would be computed using MFCC and then AF would be used. Since the prosodic components JCF0 and JCenergy of the join cost are not related to the third one (and we want to omit their influence from the total JC computation), we focused only on one join cost sub-component – JCM F CC , JCAF respectively. For the two units candidates ci−1,j and ci,k , JCM F CC was defined in Sect. 1.1 by Eq. 2 and the concatenation cost component characterized by AF was calculated as a mean of Euclidean distances of the corresponding X and Y coordinates (dimension n is 2) of 4 articulatory features (dimension m is 4): 4 2 1 (ci−1,j (n) − ci,k (n))2 (3) JCAF (ci−1,j , ci,k ) = 4 m=1 n=1 To be able to compute the correlation of JCM F CC and JCAF , it was necessary to have sequences of units of the synthesized sentences described by both JCM F CC and JCAF . To do that, we firstly synthesized (in a scripting interface of our TTS system ARTIC [17]) a set of randomly selected text sentences by using only acoustic component for the join cost computation: JC := JCM F CC .
(4)
404
M. Matura et al.
Note that the handling of target cost TC was the same as in the “raw” unit selection in TTS ARTIC, i.e. it is composed of prosodic word position features, phonetic context features and symbolic prosodic features [9]. Then we computed the JCAF costs for the fixed units sequences from the previous step and the mean values meanJCAF and meanJCM F CC for both sequences. For the correlation of the obtained sequences of JCM F CC and JCAF , the Pearson’s coefficient r, which is the most commonly used linear correlation coefficient, defined by Eq. 5, was used: m (JCM F CC − meanJCM F CC )(JCAF − meanJCAF ) r = m i=1 , (5) m 2 2 (JC M F CC − meanJCM F CC ) i=1 i=1 (JCAF − meanJCAF ) where m is number of unit concatenations. The correlation coefficient r has a value between +1 and −1, where 1 represents a total positive linear correlation, 0 means no linear correlation, and −1 is a total negative linear correlation.
4
Results
The correlation was tested on ten and hundred randomly selected sentences. The resulting correlation coefficients are listed in Table 1. Table 1. Correlation coefficients of JCM F CC and JCAF sequences. Number of sentences
Average length in Correlation coefficient Zeros phonemes Mean Mean σ Minimum Maximum
10
42
0.8524 0.0619 0.7384
0.9115
24
100
41
0.8109 0.0648 0.6403
0.9204
22
High numbers of the mean values of correlation coefficient r and small standard deviations in the table show quite a large dependency between the sequences of JCM F CC and JCAF . At first, it was quite surprising because we expect articulatory features should carry a different information than MFCC, but high numbers suggested otherwise. However, the reason for this large dependence is hidden in the basic principle of the unit selection – the zero costs values for neighbouring units (see Sect. 1.1). The unit selection algorithm is trying to select the most suitable units to be concatenated and the best units are those which were originally neighbours – such units have a join cost equal to zero no matter what features are used for the join cost computation. And those units can noticeably distort the result for the correlation coefficient computation – if two zeros are compared, the correlation is always equal to 1. We found out that the sequences of concatenation costs are from more than 50% composed of zeros (due to concatenating of neighbouring units) as shown
On the Contribution of Articulatory Features to Speech Synthesis
405
in the Table 1. It is obvious that these huge amounts of zeros could noticeably increase the correlation of the two sequences so we removed the corresponding zeros from both sequences and recalculated the correlation coefficients again – the results are presented in Table 2. Now, the values in the table are close to 0 which indicates that AF and MFCC are not correlated. Table 2. Correlation coefficients of JCM F CC and JCAF sequences without zeros. Number of sentences Correlation coefficient Mean σ Minimum Maximum 10 100
0.1755 0.3158 −0.3047
0.5400
−0.0068 0.2497 −0.4872
0.5399
The sequences of JCAF and JCM F CC for two selected sentences are also drawn in Fig. 3 – one sentence with the maximal and the second with the minimal value of correlation coefficient r. It can be clearly seen from the figures that the sequences of costs are not much correlated – some higher JCM F CC correspond to lower JCAF and vice versa (we intentionally connected the non-zero values in the graphs by a line so that the very small correlation of the data would be obvious for the reader). These illustrations together with the results listed in Table 2 prove that the MFCC and AF features are not correlated and evince that the articulatory data are able to bring a new information into the speech synthesis process and their usage might improve the overall quality of the synthesized sentences.
Fig. 3. Comparison of JCAF and JCM F CC sequences. The x-axis represents the number of units transition in the synthesized sentence.
406
5
M. Matura et al.
Conclusion
The presented paper tries to answer the question whether the real articulatory data brings any new information to the process of speech synthesis when compared to acoustic MFCC features used in speech synthesis for decades. We have managed to prove certain independence of these two different features by the Pearson’s correlation coefficient values close to zero for randomly selected sentences. Now, as we confirmed the usefulness of the articulatory features, we are working on the utilization of AF in our Czech TTS system ARTIC, both as a replacement of MFCC and the enlargement of the feature space, similarly to [15]. However, the limited amount of recorded data does not allow us to perform acceptable experiments now, so we are improving the recording and pre-recording process and planning to record more voices with EMA sensors used. The future work also include the experiments with acoustic-to-articulatory mapping to gain more data. Acknowledgments. This research was supported by the Czech Science Foundation (GA CR), project No. GA16-04420S and by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.
References 1. Canevari, C., Badino, L., Fadiga, L.: A new Italian dataset of parallel acoustic and articulatory data. In: INTERSPEECH. ISCA (2015) 2. Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: ICASSP, vol. 1, pp. 373–376. IEEE (1996) 3. J˚ uzov´ a, M., Tihelka, D., Matouˇsek, J.: Designing high-coverage multi-level text corpus for non-professional-voice conservation. In: Ronzhin, A., Potapova, R., N´emeth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 207–215. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7 24 4. J˚ uzov´ a, M., Tihelka, D., Matouˇsek, J., Hanzl´ıˇcek, Z.: Voice conservation and TTS system for people facing total laryngectomy. In: INTERSPEECH. ISCA (2017) 5. Kaburagi, T., Wakamiya, K., Honda, M.: Three-dimensional electromagnetic articulography: a measurement principle. J. Acoust. Soc. Am. 118(1), 428–443 (2005) 6. Leg´ at, M., Matouˇsek, J., Tihelka, D.: A robust multi-phase pitch-mark detection algorithm. INTERSPEECH 1, 1641–1644 (2007) 7. Leg´ at, M., Matouˇsek, J., Tihelka, D.: On the detection of pitch marks using a robust multi-phase algorithm. Speech Commun. 53(4), 552–566 (2011) 8. Liu, Z.C., Ling, Z.H., Dai, L.R.: Articulatory-to-acoustic conversion with cascaded prediction of spectral and excitation features using neural networks. In: INTERSPEECH, pp. 1502–1506. ISCA (2016) 9. Matouˇsek, J., Leg´ at, M.: Is unit selection aware of audible artifacts? In: SSW 2013, Proceedings of the 8th Speech Synthesis Workshop, pp. 267–271. ISCA, Barcelona (2013)
On the Contribution of Articulatory Features to Speech Synthesis
407
10. Matouˇsek, J., Tihelka, D.: Classification-based detection of glottal closure instants from speech signals. In: INTERSPEECH, pp. 3053–3057. ISCA (2017) 11. Matouˇsek, J., Tihelka, D., Romportl, J.: Current state of Czech text-to-speech system ARTIC. In: Sojka, P., Kopeˇcek, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006). https://doi.org/10. 1007/11846406 55 12. Matouˇsek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: INTERSPEECH, pp. 1626–1629. ISCA (2008) 13. Richmond, K.: A multitask learning perspective on acoustic-articulatory inversion. In: INTERSPEECH, pp. 2465–2468. ISCA, August 2007 14. Richmond, K., Hoole, P., King, S.: Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: INTERSPEECH. ISCA (2011) 15. Richmond, K., King, S.: Smooth talking: articulatory join costs for unit selection. In: ICASSP, pp. 5150–5154. IEEE (2016) 16. Stella, M., Stella, A., Sigona, F., Bernardini, P., Grimaldi, M., Fivela, B.G.: Electromagnetic articulography with AG500 and AG501. In: INTERSPEECH, pp. 1316–1320. ISCA (2013) 17. Tihelka, D., Hanzl´ıˇcek, Z., J˚ uzov´ a, M., V´ıt, J., Matouˇsek, J., Gr˚ uber, M.: Current state of text-to-speech system ARTIC: A decade of research on the field of speech technologies. In: TSD. Lecture Notes in Computer Science. Springer, Heidelberg (2018) 18. Tihelka, D., Kala, J., Matouˇsek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: INTERSPEECH, pp. 174–177. ISCA (2010) 19. Toda, T., Black, A., Tokuda, K.: Acoustic-to-articulatory inversion mapping with gaussian mixture model. In: INTERSPEECH. ISCA (2004) 20. Toutios, A., Margaritis, K.: Acoustic-to-articulatory inversion of speech: a review. In: Proceedings of the International 12th TAINN (2003) 21. Wrench, A.: The mocha-timit articulatory database (1999). database available at http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html 22. Wrench, A.A., Richmond, K.: Continuous speech recognition using articulatory data. In: INTERSPEECH, pp. 145–148. ISCA (2000)
QuARTCS: A Tool Enabling End-to-Any Speech Quality Assessment of WebRTC-Based Calls Martin Meszaros1,2(B) , Franziska Trojahn1,2(B) , Michael Maruschke2 , and Oliver Jokisch2 1
immmr GmbH, Winterfeldtstraße 21, 10781 Berlin, Germany {martin.meszaros,franziska.trojahn}@immmr.com www.immmr.com 2 Institute of Communications Engineering, Leipzig University of Telecommunications (HfTL), Gustav-Freytag-Straße 43-45, 04277 Leipzig, Germany www.hft-leipzig.de
Abstract. Recently, the use of Web Real-Time Communication (WebRTC) technology in communication applications has been increasing significantly. The users of IP-based telephony require excellent audio quality. However, in WebRTC-based audio calls the audio assessment is challenging due to the specific functioning principles of WebRTC, such as security requirements, diversity of the endpoints and varying client implementations. In this article, we illustrate the challenges in established methods of audio quality assessment with regard to WebRTC and discuss necessary modifications in the measurement technique. We present Quality Analyzer for Real Time Communication Scenarios (QuARTCS) as a novel method to overcome the measurement shortcomings and demonstrate the basic functioning by preliminary call samples.
Keywords: WebRTC
1
· Audio quality assessment · Opus codec · VoIP
Introduction
The popularity of Internet-based communication is steadily increasing. The demands in regard to quality, availability and type of service have adapted to the changes in daily lifestyle: Multiple services have to be available on all devices, from any place and at any time. While voice-based telecommunication is no longer limited to telephones but also available on computers and tablets, it still needs to be easy-to-use for naive users and to provide interoperability with legacy solutions such as the Public Switched Telephone Network (PSTN). In particular, the prevalence of Voice over IP (VoIP) communication services based on WebRTC is rising significantly. Their success relies on a good usability c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 408–418, 2018. https://doi.org/10.1007/978-3-319-99579-3_43
QuARTCS: End-to-Any Speech Quality Assessment
409
and highest possible quality. WebRTC enables Internet Protocol (IP) and webbrowser-based real-time communication using audio, video and auxiliary data without additional plugins or software installation. By default, WebRTC utilizes the Opus codec, standardized by the Internet Engineering Task Force (IETF) in RFC 6716 [1]. The Opus codec offers Full High Definition (HD) audio coding, by supporting a Fullband (FB) frequency range from 20 Hz to 20 kHz with low delays from 5 ms to 66.5 ms. However, the audio quality depends on several network-related parameters such as network bandwidth, packet loss, delay and jitter. Beyond the network-related parameters, WebRTC exhibits its own configuration of process variables, which may influence the call quality too. Consequently, the overall quality measurement, estimation and adjustment in the network are complex tasks. Therefore, the quality has to be monitored to guarantee a satisfying user experience, represented by e.g. the intelligibility of the call partner, the call continuity and the one-way delay. In this article, we compare several, frequently used methods of audio quality assessment. Furthermore, we illustrate the challenges that arise for audio assessment in WebRTC-based communication and provide a novel solution approach for both, developers and providers. As a result, application developers can identify the reasons of degraded quality by locating the network segments with the biggest influence instead of detecting degradations in the overall audio quality only. In Sect. 2, we summarize established methods of audio assessment within the described application environment. Moreover, Sect. 3 is dedicated to the shortcomings in monitoring WebRTC-based calls. Subsequently, we present the QuARTCS method, with its functioning principles for the acquisition of degraded audio signals from Secure Real-Time Transport Protocol (SRTP) streams – captured during an active WebRTC audio call at multiple measurement points in Sect. 4 – followed by preliminary results in Sect. 5 and some conclusions.
2 2.1
Methods of Speech Quality Assessment Subjective Quality Assessment by Listeners
The ITU-Telecommunication Standardization Sector (ITU-T) recommendation P.800 describes several “methods for subjective determination of transmission quality” [3]. Absolute Category Rating (ACR) listening tests represent a commonly used method, in which the degraded audio signal is played to a group of probands, who rate the quality on a five-point opinion scale. The mean value of all individual ratings is called Mean Opinion Score (MOS)-ACR. Besides listening tests, several instrumental methods for the assessment of audio quality exist. Figure 1 illustrates common steps of two communicating VoIP endpoints (not depicted in the figure) as well as the general functioning principle of subjective and objective audio quality assessments. In contrast to the ACR listening test, where only the degraded audio signal is taken into account during the assessment, a reference-based objective assessment algorithm additionally requires the original reference audio sample.
410
M. Meszaros et al.
Fig. 1. Principle of subjective and objective audio quality assessment (derived from Maruschke et al. [2]).
2.2
Objective, Instrumental Quality Assessment
The ITU-T standardized several objective assessment methods for audio quality, which do not require a human rater, e.g. the well-known Perceptual Evaluation of Speech Quality (PESQ) algorithm [4] resulting to the measure Mean Opinion Score (MOS)-Listening Quality Objective - Narrowband (LQOn ). However, this assessment method is limited to Narrowband (NB) speech with a frequency range from 300 Hz to 3.4 kHz1 . Meanwhile, real-time audio codecs enable a frequency range up to FB (e.g. the Opus codec), which also led to advanced audio assessment algorithms, such as Perceptual Objective Listening Quality Assessment (POLQA) [6]. The perceptual model of POLQA (defined in ITU-T P.863 version 2) supports SuperWideband (SWB) speech with a frequency range from 50 Hz to 14 kHz, delivering a MOS-Listening Quality Objective - Super-Wideband (LQOsw ) measure. However, studies show that POLQA can even be used for a FB assessment of music or voice signals under certain conditions [2,7]. Recently, an update of ITU-T P.863 was introduced with version 3, which supports speech with a frequency range from 20 Hz to 20 kHz POLQA [6]. Apart from that, single-ended methods of assessment have been developed, which do not require a reference sample and which can therefore be utilized in a more flexible way, as it is limited to the access to the receiving communication party. The ITU-T P.563 algorithm from 2004 is the first standardized method supporting a single-ended, objective assessment [8]. However, it allows speech quality assessments for NB telephony only. Beyond the chosen assessment method, VoIP calls pose a challenge, since the degraded audio samples have to be acquired after the network transmission. 2.3
Audio Injection and Recording Methods
To guarantee reproducibility and to minimize a possible influence of characteristics of the transmitted speech material itself, it is advantageous to inject 1
An extension for the assessment of Wideband (WB) speech with a frequency range from 50 Hz to 7 kHz exists with ITU-T recommendation P.862.2. [5].
QuARTCS: End-to-Any Speech Quality Assessment
411
prerecorded audio samples into the sending endpoint. Especially reference-based assessment methods require well-defined speech samples. According to the ITU-T recommendation P.863.1, an injection of reference samples in the sending endpoint can be done in three ways [9]: Acoustically by an artificial mouth (from a head and torso simulator) connected to the client [10]; Electrically by connecting an audio cable from a playback device to a line input of the client; Digitally by using Application Programming Interface (API) functions of the communication software (browser and web application) or methods provided by the operating system. Additionally, the audio signal has to be recorded to acquire the degraded audio signal at the receivers’ side after the network transmission – basically by utilizing one of the methods described for injection, with slight adaptions. However, performing the recording acoustically requires special equipment, and background noise has to be kept at a minimum to avoid additional distortion of the signal. Recording the degraded sample electrically requires an audio output at the receiving endpoint, for example a sound card with a 3.5 mm line output jack, and an external recorder has to be connected to the endpoint output. A drawback of this method lies in the additional Digital to Analog Conversion (DAC) at the receiving endpoint and Analog to Digital Conversion (ADC) at the recording device. As the connection between both devices is analog, the transmitted signal is prone to interferences through radio waves or ground loops [11]. The digital approach of recording the audio is far less applicable between different devices, since modifications of the VoIP endpoint might be necessary. However, the advantage of this method lies in the non-modified recording of the degraded sample, which eliminates the described, potential signal distortions. For the digital recording of a VoIP call, one can use an alternative method: In general, the encoded voice is transmitted over the network within Real-Time Transport Protocol (RTP) packets. Thus, one can capture the network traffic with a packet sniffer like Wireshark [12]. To acquire degraded audio signals, the RTP payload has to be extracted and eventually to be decoded. Utilizing this approach allows an audio recording independent of the receiving endpoint as target of the audio recording.
3
Limitations of Call Assessments in WebRTC
WebRTC is standardized by two major standardization bodies, namely the World Wide Web Consortium (W3C), which is responsible for the JavaScript (JS) API and the IETF for the corresponding protocols [14,15]. Merely a browser that follows the WebRTC protocol specifications and implements the JS API, defined by the W3C [14], is necessary. In some cases though, WebRTC native application, so-called “non-browsers”, are preferable over WebRTC browsers.
412
M. Meszaros et al.
Fig. 2. WebRTC triangle architecture [13].
These WebRTC non-browsers do not require implementations of the JS API but must comply with the protocol specification [15]. A typical variant of the WebRTC architecture is depicted in Fig. 2. Two communication paths exist: – The signaling path between the web-/signaling server or servers. Each WebRTC-client (in this example provided through web browsers) can also be represented by non-browsers, – The media path between the communication parties. The web and signaling servers provide the web application, which can be downloaded by the client, and also handle the signaling flow. The signaling protocol is not standardized, and various protocols, including standardized and proprietary ones, can be used but the inter-working with Session Initiation Protocol (SIP) over a signaling gateway must be possible. Therefore, the WebRTC media negotiation must include a representation of the same semantics as contained in Session Description Protocol (SDP) offers/answers used in SIP based VoIP communication [15,16]. The clients in a WebRTC call are named WebRTC endpoints and can either be WebRTC browsers or WebRTC non-browsers. Usually, the media path is established directly between two endpoints in terms of a Peer-to-Peer (P2P) connection. Under certain conditions, for example when symmetric Network Address Translation (NAT) is used, the traffic might be relayed through a Traversal Using Relays around NAT (TURN) server [17]. In all cases, the media data must be sent over SRTP for every channel that is established [18,19]. This means, that encryption must be used for the media path and that a cipher suite including a key exchange mechanism is necessary. For WebRTC-based communication, a large variety of end devices (endpoints) can be used. Due to the heterogeneous nature of these end devices
QuARTCS: End-to-Any Speech Quality Assessment
413
in regard to hard- and software (e.g. operating systems, availability of audio jacks), a universal solution for capturing WebRTC audio signals does currently not exist. Hence, the use of a device-independent recording mechanism is stringently required. As described in Subsect. 2.3, the digital recording by capturing the network traffic is a suitable method for device agnostic acquisition of audio signals. Albeit, the traffic capturing method for the acquisition of the degraded audio signals, is still not trivial due to the encryption of the WebRTC-originated media streams.
4
QuARTCS Concept and Tooling
4.1
Design Principles
We developed QuARTCS as a tool, which allows the acquisition of degraded audio signals from SRTP streams captured during an active WebRTC audio media call at multiple measurement points along the network transmission path including the receiving endpoint. Consequently, the quality influences from one end to any point in the transmission path can be reflected by audio assessments (End-to-Any (E2A) assessment). The data acquisition includes the decryption of the SRTP packets of the captured stream, the payload extraction and the audio segmentation as preparation for an objective quality assessment, e.g. POLQA. 4.2
Functioning Details
Figure 3 illustrates the functioning principle of QuARTCS. In a first instance, a reference sample will be injected into the WebRTC application running within the WebRTC client on Endpoint A. At one of the endpoints (A or B), the encryption key, cipher suite and SDP messages have to be obtained (referred to as Endpoint/reference information in Fig. 3). The reference information is logged in Endpoint A. Before a call is established, the traffic capturing has to be started. The capturing can be conducted at any network node in the network path between Endpoint A and B or directly at Endpoint B. This can be accomplished with traffic capturing tools such as Wireshark or tcpdump running along the WebRTC application on the endpoint2 [12,20]. Within the network path, the traffic can be captured by using a switch with mirroring port functionality, i.e., the actual traffic can be recorded with a third device connected to that mirroring port (cf. Meszaros and Maruschke [21]). During the call, the Reference sample can be looped by the sending endpoint to provide several test samples during one call. Consequently, it is encoded and transmitted over the network by Endpoint A, whilst at the same time the traffic gets captured at the chosen capturing point(s). After the call finishes, the logged reference information, captured traffic as well as the Reference sample (if applicable) injected into Endpoint A, is delivered to QuARTCS. 2
The devices need enough processing power to handle the call as well as the capturing simultaneously to prevent negative effects like a packet loss.
414
M. Meszaros et al.
Fig. 3. General functioning principle of QuARTCS.
The Traffic filtering function of QuARTCS then filters the SRTP stream with direction from Endpoint A to Endpoint B according to information acquired from the SDP message by the Information parsing function. A possible filter condition can be the Synchronization Source (SSRC) identifier of the stream [22]. Additionally, the Traffic filtering has to incorporate a jitter buffer, resembling the jitter buffer functionality of the receiving endpoint. In the next step, the filtered SRTP stream is passed to the SRTP decryption function, which uses the key and cipher suite provided by the Endpoint/Reference information to the Information parsing function, which generates an unencrypted RTP stream. After the decryption, the payload – corresponding to the encoded audio – can be extracted from the RTP stream. Afterwards, the encoded audio is decoded using the Audio decoding function3 . If one Reference sample gets looped throughout the communication session by the sending Endpoint A, the result of the Audio decoding function will be a concatenation of the Degraded sample. As a result, this concatenation has to be split into multiple Degraded sample files to have the same length as the Reference sample, which is accomplished by the Audio manipulation function. Finally, the Degraded samples are passed to the quality assessment model. In our example, the full-reference quality assessment model POLQA is utilized 3
The decoding function has to incorporate an appropriate decoder for the specific audio codec, that was used for the communication session.
QuARTCS: End-to-Any Speech Quality Assessment
415
to estimate MOS-LQOsw values by comparing the Degraded samples with the Reference sample. Equally, a single-ended assessment method, such as P.563, can be used instead of POLQA, if no reference is available. 4.3
Exemplary Speech Assessment
To verify the functioning of QuARTCS, we conducted a preliminary test with a setup depicted in Fig. 4. Endpoint A (callers’ PC) and all intermediary network devices were interconnected via an Ethernet connection supporting a maximum bit rate of 1 Gbit/s. Endpoint B (callee’s smartphone in different positions) was connected to Access Point 1 via a 2.4 GHz IEEE 802.11n wireless connection. The network traffic was captured simultaneously at Switch 2 via a mirroring port, as well as directly at Endpoint B with tcpdump. Thereafter, a WebRTC call was established. During the call, a FB reference speech sample from ITU-T recommendation P.501 [23] was injected digitally into Endpoint A and repeated nine times by using API functions of the web browser. The repetition of the reference sample will result in 9 degraded samples that can be captured at each capturing point and consequently can be compared with the reference sample. After finishing the call, the traffic files from the two capturing points as well as the Endpoint/reference information (cf. Subsect. 4.2) acquired from Endpoint A was provided to the PC with QuARTCS and POLQA. The nine degraded samples acquired from the two capturing points, respectively, were evaluated with POLQA version 2.4 in SWB mode by comparing it with the injected speech sample as reference.
Fig. 4. Exemplary test design for verifying the functioning principle of QuARTCS.
5
Results and Discussion
Each of the nine samples, acquired with Switch 2 as capturing point, achieved a MOS-LQOsw of 4.75 – the maximum in POLQA version 2.4. Considering
416
M. Meszaros et al.
the samples obtained from Endpoint B as the capturing point, two out of nine achieved a lower rating than the maximum possible. Namely, sample 4 was rated with a MOS-LQOsw of 3.88, while sample 8 scores to 4.56. By analyzing the captured traffic itself, it can be observed, that no packet loss occurred in the network segment between Endpoint A and Switch 2. However, in the network segment between Switch 2 and Endpoint B, several packets where lost during the transmission of sample 4. While sample 8 was transmitted, even slightly more packets where lost. The fact that POLQA rated this sample higher than sample 4 anyway can be justified on the grounds that most of the packets where lost during a period of silence that was part of the injected sample. The preliminary tests showed that, QuARTCS allows an E2A assessment at multiple measurement points in the network transmission path simultaneously, including the receiving endpoint. This concept enables the identification of network segments, which cause the most significant degradations to the audio signal. As the tooling is accomplished by decrypting, extracting and analyzing the payload of the SRTP traffic, QuARTCS allows a quality assessment, which is independent of the endpoint characteristics and the WebRTC client implementation. The function blocks of QuARTCS work strictly modular and can easily be adapted to various audio codecs, provided that a standalone decoder is available. The digital acquisition of the degraded audio samples prevents an additional degradation due the measurement method itself. Additionally, QuARTCS is able to pre-process the degraded audio samples (e.g. providing time alignment) to fulfill the requirements of a specific audio assessment method and is not limited to the usage of a certain assessment method. Established methods such as PESQ, POLQA and ITU-T P.563 can be utilized [4,6,8]. Nevertheless, a challenge lies in the determination of the key required for the decryption of the SRTP packets, depending on the key exchange algorithm within the WebRTC application. For instance, if Session Description Protocol Security Descriptions for Media Streams (SDES) is used for key exchange, the key can be obtained from the SDP messages [24]. However, if Datagram Transport Layer Security (DTLS) is utilized, the acquisition of the key might not be possible without the modification of the WebRTC application [25]. Additionally, the calculation of the one-way delay is not yet possible due to the encryption.
6
Conclusions
In this contribution, different assessment methods for voice call quality were compared, and the limitations of a quality assessment in WebRTC-based audio calls were described. Subsequently, we presented QuARTCS as a novel concept and tooling to enable the assessment of WebRTC calls. We described the general working principles of QuARTCS and demonstrated the basic functioning with a preliminary test. Finally, we illustrated the advantages of our approach but also its limitations.
QuARTCS: End-to-Any Speech Quality Assessment
417
The future studies will address the limitations, namely the calculation of the one-way delay despite the encryption, as well as the determination of the encryption key if DTLS is used for key exchange.
References 1. Valin, J., Vos, K., Terriberry, T.: Definition of the Opus Audio Codec. RFC 6716 (Proposed Standard). RFC. RFC Editor, Fremont, CA, USA, September 2012. https://doi.org/10.17487/RFC6716 2. Maruschke, M., Jokisch, O., Meszaros, M., Trojahn, F., Hoffmann, M.: Quality assessment of two fullband audio codecs supporting real-time communication. In: Ronzhin, A., Potapova, R., N´emeth, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 571–579. Springer, Cham (2016). https://doi.org/10.1007/978-3-31943958-7 69 3. ITU-T: Methods for Objective and Subjective Assessment of Quality-Methods for Subjective Determination of Transmission Quality. REC P.800, August 1996. http://www.itu.int/rec/T-REC-P.800-199608-I/en 4. ITU-T: Methods for Objective and Subjective Assessment of Quality Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for Endto-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. REC P.862, February 2001. http://www.itu.int/rec/T-REC-P.862200102-I/en 5. ITU-T: Wideband Extension to Recommendation P.862 for the Assessment of Wideband Telephone Networks and Speech Codecs. REC P.862.2, November 2007. https://www.itu.int/rec/T-REC-P.862.2-200711-I/en 6. ITU-T: Perceptual Objective Listening Quality Assessment (POLQA): An Objective Method for End-to-End Speech Quality Assessment of Wide-Band and Superwide-Band Telephone Networks and Speech Codecs. REC P.863. http:// www.itu.int/rec/T-REC-P.863/en 7. ITU-T Study Group 12: A Subjective ACR LOT Testing Fullband Speech Coding and Prediction by P.863. Contribution SG12-C.22, 19 January 2017. https://www. itu.int/md/T17-SG12-C-0022/en 8. ITU-T: Single-Ended Method for Objective Speech Quality Assessment in NarrowBand Telephony Applications. REC P.563, May 2004. https://www.itu.int/rec/TREC-P.563/en 9. ITU-T: Application Guide for Recommendation ITU-T P.863. REC P.863.1, September 2014. https://www.itu.int/rec/T-REC-P.863.1/en 10. ITU-T: Application Guide for Objective Quality Measurement Based on Recommendations P.862, P.862.1 and P.862.2. REC P.862.3, November 2007. https:// www.itu.int/rec/T-REC-P.862.3/en 11. Digital audio transmission for use in studio, stage or field applications. US4922536 A. Hoque, T. I., 1 May 1990. http://www.google.com/patents/US4922536 12. Wireshark-Community: Wireshark · Go Deep, 30 November 2017. https://www. wireshark.org/. Accessed 13 Dec 2017 13. Maruschke, M., Jokisch, O., Meszaros, M., Iaroshenko, V.: Review of the Opus codec in a WebRTC scenario for audio and speech communication. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 348–355. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7 43
418
M. Meszaros et al.
14. Jennings, C., Narayanan, A., Burnett, D., Bergkvist, A.: WebRTC 1.0: Realtime Communication Between Browsers. W3C Editor’s Draft, 30 November 2017. http://w3c.github.io/webrtc-pc/ 15. Alvestrand, H.T.: Overview: Real Time Protocols for Browser-based Applications. Internet-Draft, Fremont CA, USA, 12 November 2017. https://tools.ietf.org/html/ draft-ietf-rtcweb-overview-19 16. Rosenberg, J., Schulzrinne, H.: An Offer/Answer Model with Session Description Protocol (SDP). RFC 3264 (Proposed Standard). RFC. Updated by RFC 6157. RFC Editor, Fremont, CA, USA, June 2002. https://doi.org/10.17487/RFC3264 17. Takeda, Y.: Symmetric NAT Traversal using STUN. Internet-Draft, Fremont CA, USA, June 2003. https://tools.ietf.org/id/draft-takeda-symmetric-nat-traversal00.txt 18. Baugher, M., McGrew, D., Naslund, M., Carrara, E., Norrman, K.: The Secure Real-time Transport Protocol (SRTP). RFC 3711 (Proposed Standard). RFC. Updated by RFCs 5506, 6904. RFC Editor, Fremont, CA, USA, March 2004. https://doi.org/10.17487/RFC3711 19. Perkins, C., Westerlund, M., Ott, J.: Web Real-Time Communication (Web-RTC): Media Transport and Use of RTP. Internet-Draft, Fremont, CA, USA, 18 September 2016. https://tools.ietf.org/html/draft-ietf-rtcweb-rtp-usage-26 20. The Tcpdump Team: Tcpdump/Libpcap Public Repository, 3 September 2017. http://www.tcpdump.org. Accessed 13 Dec 2017 21. Meszaros, M., Maruschke, M.: Verhaltensanalyse von Einplatinencomputern Beim Transcoding von Echtzeit-Audiodaten. In: Elektronische Sprachsignalverarbeitung 2016. Tagungsband Der 27. Konferenz, vol. 81, pp. 237–245 (2016) 22. Lennox, J., Ott, J., Schierl, T.: Source-Specific Media Attributes in the Session Description Protocol (SDP). RFC 5576 (Proposed Standard). RFC. RFC Editor, Fremont, CA, USA, June 2009. https://doi.org/10.17487/RFC5576 23. ITU-T: Test Signals for Use in Telephonometry. REC P.501, March 2017. https:// www.itu.int/rec/T-REC-P.501-201703-I/en 24. Andreasen, F., Baugher, M., Wing, D.: Session Description Protocol (SDP) Security Descriptions for Media Streams. RFC 4568 (Proposed Standard). RFC. RFC Editor, Fremont, CA, USA, July 2006. https://doi.org/10.17487/RFC4568 25. Rescorla, E., Modadugu, N.: Datagram Transport Layer Security Version 1.2. RFC 6347 (Proposed Standard). RFC. Updated by RFCs 7507, 7905. RFC Editor, Fremont, CA, USA, January 2012. https://doi.org/10.17487/RFC6347
Automatic Phonetic Segmentation and Pronunciation Detection with Various Approaches of Acoustic Modeling Petr Mizera and Petr Pollak(B) Faculty of Electrical Engineering, Czech Technical University in Prague, K13131, Technicka 2, 166 27 Praha 6, Czech Republic {mizera,pollak}@fel.cvut.cz www.fel.cvut.cz www.noel.feld.cvut.cz/speechlab
Abstract. The paper describes HMM-based phonetic segmentation realized by KALDI toolkit with the focus on study of accuracy of various acoustic modeling such as GMM-HMM vs. DNN-HMM, monophone vs. triphone, speaker independent vs. speaker dependent. The analysis was performed using TIMIT database and it proved the contribution of advanced acoustic modeling for the choice of a proper pronunciation variant. For this purpose, the lexicon covering the pronunciation variability among TIMIT speakers was created on the basis of phonetic transcriptions available in TIMIT corpus. When the proper sequence of phones is recognized by DNN-HMM system, more precise boundary placement can be then obtained using basic monophone acoustic models. Keywords: Automatic phonetic segmentation Pronunciation variability · GMM-HMM · DNN-HMM TIMIT
1
· KALDI
Introduction
Automatic phonetic segmentation is a procedure which defines boundary locations of particular phones in a given utterance and whose usage is necessary in situations when phone boundaries must be found for very huge corpora. It is typically used for a creation of subword units for the purpose of concatenative speech synthesis [8,13], for a determination of phone boundaries in huge speech corpora for the training of neural-networks-based speech recognition systems, or in other applications motivated by a study of pronunciation variability based on the analysis of phonetic segmentation results. Detailed analysis of particular phone realizations can also contribute to the clinical diagnostics of serious diseases which influence speech production, or to an analysis of pronunciation variability in spontaneous or informal speech [9]. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 419–429, 2018. https://doi.org/10.1007/978-3-319-99579-3_44
420
P. Mizera and P. Pollak
The basic solution applied to the determination of phone boundaries is based on forced-alignment of trained HMM models for a given utterance with available acoustic realization and known content, optimally, at phonetic level. This procedure is standardly used as a significant step during the training of acoustic models of speech recognizers. It can be realized by various toolkits which implement HMM-based speech recognition, e.g. HTK [17], Sphinx [1], RWTH [14], or KALDI [12]. Especially KALDI is nowadays one of the most popular toolkits used world-wide by the speech research community. In this paper, we present the analysis of phonetic segmentation accuracy using KALDI toolkit. We use acoustic models available in the standard KALDI TIMIT recipe, however, we work with more common setup when the phonetic content is not known. Many previously published approaches based on TIMIT corpus worked with available phone boundaries and many of them used known phonetic content for each utterance as the input of forced-alignment. Finally, we analyzed the accuracy of boundary determination as well as the precision of the choice of proper pronunciation variant when the transcription is available at word level and higher pronunciation variability is supposed in realized utterances.
2
Method
As was mentioned above, KALDI toolkit is frequently used for speech recognition by research community and consequently it is under continuous development. Currently, it covers many contemporary advanced techniques used within particular modules of ASR, including advanced techniques of acoustic modeling, mainly DNN-based systems. However, the usage of KALDI for speech segmentation is not so frequent [7]. Mentioned availability of advanced acoustic modeling techniques in KALDI was the main reason for this study describing an analysis how they benefit the precision of phonetic segmentation. 2.1
Phonetic Segmentation
Concerning the boundary determination, we used rather standard approach of forced-alignment. Its implementation using KALDI toolkit allowed us to study techniques using various approaches to acoustic modeling used in typical solutions (“recipes”) available within KALDI distribution. Generally, we selected AM models which were suitable for generating targets for DNN-HMM training. We experimented with often used GMM-HMM models [12], i.e. the basic and simplest AM based on monophones (marked in the following text by acronym mono), speaker-independent triphone AM using basic short-time cepstral features (acronym tri1), speaker-independent triphone model with LDA features (acronym tri2), speaker-dependent triphone AM obtained by fMLLR and speaker-adaptive training (acronym tri3). In the end, the most advanced AM used in this work was DNN-HMM model (acronym dnn) with the topology of neural network consisting of the input layer with 440 units followed by 6 hidden layers with 2048 neurons per layer. The process of building of DNN-HMM
Automatic Phonetic Segmentation and Pronunciation Detection
421
system started with the initialization of hidden layers by Restricted Boltzmann Machines and it was closed by frame cross-entropy training [4]. More advanced AM models based on time-delay neural networks with the lattice-free version of the maximum mutual information or long-short-termmemory networks [11] were not experimented. They help typically for an improvement of WER in speech recognition, but they are not so good for determination of phone boundaries using forced alignment1 . Speech features were computed in accordance to the setup used in KALDI recipes. As basic cepstral features (used in AMs mono and tri1), we used 13 melfrequency cepstral coefficients including zeroth cepstral coefficient, computed for the short-time frame with the length of 25 ms and shifted over the signal with the step of 10 ms. Cepstral-mean normalization was applied to this 13-element vector of static short-time features and they were completed by delta (dynamic) and delta-delta (acceleration) features to the final length of 39. LDA features (used in AM tri2) were computed from the context obtained by splicing of 5 shorttime-feature vectors to both sides and followed by LDA and MLLT realizing decorrelation and the reduction of dimension to the length of 40. For AM tri3, it was followed by feature-space maximum likelihood linear regression (fMLLR) per each speaker (also called speaker adaptive training, SAT). Finally, these 40 dimensional fMLLR features with mean and variance normalization extended in both-side context were used as the input in dnn AM. 2.2
Impact of Pronunciation Lexicon
The accuracy of forced-alignment technique used for phonetic segmentation relies on the quality of inputs. Of course, it means the quality of acoustic data, however, it also depends strongly on the accuracy of input phonetic contents. Phonetic content of utterances transcribed usually at orthographical level can be obtained by grapheme-to-phoneme conversion or from a pronunciation lexicon, which can cover also a pronunciation variability [2] by including more pronunciation variants. This approach must be definitely used when phone boundaries should be determined for spontaneous and informal speech, higher diversity of language dialects, as well as in other situations when the level of pronunciation variability is rather high [18]. It can be obtained manually (for some very specific situations) or automatically (to extend regular pronunciations by particular phone substitutions or reductions on the basis of defined rules [9,10]). In presented work, we analyzed the accuracy of phone boundaries determination in the case when the lexicon contained more pronunciation variants. For this purpose, we have created the lexicon containing all pronunciations which had appeared within phonetic transcription of TIMIT corpus (called further as timit-variants). It was obtained from available transcription at the word and phone level, i.e. as the new word pronunciation we took the sequence of all phones which lied within the word boundaries. Finally, the significant majority 1
Discussed in KALDI community at https://groups.google.com/forum/#!topic/ kaldi-help/cSAm5iXGhZo.
422
P. Mizera and P. Pollak
of words from TIMIT had more than one pronunciation, so we could analyze also the ability of used AM to recognize the correct pronunciation variant for particular word realizations. In total, we obtained 19184 pronunciations for 6256 words, moreover, in some cases the number of pronunciation variants was very high (22 words have more than 20 pronunciations), as it is shown in more details in Table 1. This lexicon should simulate using TIMIT corpus a realistic situation of phonetic segmentation of informal speech when each word can have more pronunciations due to pronunciation variability in informal speaking style. Table 1. Lexicon timit-variants - statistics. No. of pronunciation variants 1 No. of words
2
3–5
6–10 11–20 21–50 65
631 3372 1516 637
78
21
1
When pronunciation lexica contain such a high number of pronunciation variants (20 and more), correct detection of the proper pronunciation variant is very important task and phonetic segmentation in this setup can also serve to detect proper pronunciation variants within an analyzed utterance. It can then play the important role in the research focused on pronunciation variability and it was also analyzed in this work.
3
Experiments
The experimental part of this research was focused on the analysis of phonetic segmentation accuracy from the following three aspects: the optimum choice of proper acoustic model, the impact of extended pronunciation lexicon, and finally, the accuracy of pronunciation variant detection when more variants are available in the lexicon. 3.1
Used Tools and Speech Databases
All experiments were realized on the basis of TIMIT corpus [3], used often as a standard for the evaluation of phoneme classification, phoneme recognition, or phonetic segmentation for English. As it was mentioned above, designed acoustic model systems were built using KALDI toolkit. Table 2. TIMIT data sets used in presented evaluations. Data set
Speakers Sentences Hours Num. words Num. boundaries
TRAIN
462
3696
3.14
24
192
0.16
1570
7215
COMPLETE test set 168
1344
0.81
11025
50754
CORE test set
30132
-
Automatic Phonetic Segmentation and Pronunciation Detection
423
We started with a standard s5 recipe for TIMIT available in KALDI distribution and we optimized it with regard to improve the accuracy of automatic phonetic segmentation task. The published recipe has been designed mainly for phoneme recognition task and it works with reference train and CORE test sets. For the phonetic segmentation task, we generated TIMIT COMPLETE test set with 168 speakers and 1344 test sentences. The phonetically-compact sentences (marked as SX sentences) and phonetically-diverse ones (marked as SI sentences) were only used for our experiments. TIMIT phoneme set was reduced from 61 to 48 final phonemes, which were used for acoustic modeling. The reduction to 39 phones was used finally for boundaries scoring as it is used standardly for English in KALDI recipes as well as by many other authors in ASR systems [6]. HMM topology consisted from 3 emitting states models for non-silence phonemes and 5 emitting states models for silence and direct phoneme transcription, which included also silence marks, was used for training AMs. Therefore silence appeared in training graphs and silence boundaries were scored, the optional silence was not used for our experiments. Finally, we used 50754 boundaries from COMPLETE test set and 7215 boundaries from CORE test set for our evaluations. The summary of used data sets is presented in Table 2. 3.2
Evaluation Criteria
The evaluation of phonetic segmentation accuracy was done using the criteria describing both the accuracy at the level of phone recognition correctness as well as the accuracy of phone boundary placement (as it was similarly used by other authors, e.g. [5,7]). First, the phone recognition correctness is evaluated standardly using Phone Error Rate computed on the basis of Levenshtein distance as S+D+I · 100 (1) N where N is the number of phones in the reference and S, D, and I are the numbers of substitutions, deletions, and insertions in aligned data. It is also suitable to evaluate Phone Correctness computed as P ER =
N −S−D · 100 (2) N because the evaluation of the accuracy of particular boundary placement makes sense just for correctly recognized phones. For further evaluations, all deleted phones are removed from the reference transcript, inserted phones from aligned transcript, and substituted phones are removed from both of them. The cleared transcripts are then used for the evaluation of boundary placement accuracy. When we have two pairs of reference and transcribed boundaries for each phone realization, i.e. begph,ref [i] and endph,ref [i] vs. begph [i] and endph [i], the following two criteria Phone Beginning Error (PBE) and Phone End Error (PEE) can be defined as P Corr =
PBEph [i] = | begph [i] − begph,ref [i] | ,
(3)
424
P. Mizera and P. Pollak
PEEph [i] = | endph [i] − endph,ref [i] | .
(4)
The accuracy of phone boundary can be approximated using the rate of phone boundary error which is below the chosen threshold which can be defined as Nph (PBEph [i] < thr) PBEph,thr = i=1 (5) Nph where ph is phone/class identification, Nph is the number of phone/class realizations, and thr is the value of chosen error threshold. Similarly, same procedure is applied for the computation of PEEph,thr . Threshold values used for realized evaluations within this work were 5, 10, 20, or 30 ms respectively. All of these criteria can be computed with basic statistics for all particular phones, however, more often is the usage of their evaluation over defined phone classes, which are generally language independent. We used phone classes for English according to [5], i.e. VOW - vowels, GLI - semivowels and glides, VFR voiced fricatives, UFR - unvoiced fricatives, NAS - nasals, STP - stops, UST unvoiced stops, and SIL - silence. Finally, we define PronER (Pronunciation Error rate) to evaluate pronunciation detection accuracy S · 100 (6) P ronER = N where N is the total number of words in the reference set and S is the numbers of incorrectly recognized (substituted) pronunciation variants. 3.3
Results
3.3.1 Direct Phonetic Segmentation As the TIMIT database contains transcriptions at the phone level, it enabled us to evaluate firstly the accuracy of phonetic segmentation with maximally precise inputs of HMM-based forced alignment. In fact, it means the optimum input of forced-alignment with 100% correct phonetic content when no phone needs to be recognized and PER is equal to 0 %. Obtained results are in the Table 3. Similarly, as in several other works (e.g. [7] or [16]), the best results were obtained for the simplest monophone AM, for both the core and complete test sets. Slightly lower accuracy of triphone- and DNN-based AMs might be caused due to the fact that input features are taken from larger context, which yields to higher uncertainty in determination of a boundary position. Furthermore, speaker dependent AMs are probably estimated with smaller accuracy due to the limited amount of data per speaker in TIMIT corpus. Concerning the monophone AM, we looked for its optimized setup. Same as in other published works‘ [7], it was confirmed that smaller amount of Gaussian mixtures per state gave better results. The best ones were achieved for 2 mixtures per state, see Table 4. The number in acronyms mono144, mono288, etc. in Tables 3 and 4 represents the number of Gaussian components in whole HMM, e.g. 288 means 288 components for 2 mixtures per state, 3 emiting states per each monophone, and 48 phones in given HMM (2 × 3 × 48).
Automatic Phonetic Segmentation and Pronunciation Detection
425
Fig. 1. Phone Beginning Error PBE for particular phone classes: blue - monophone system, red - DNN-based system. (Color figure online) Table 3. Results of direct phonetic segmentation, P ER = 0, P Corr = 100. CORE SET COMPLETE SET 5 ms 10 ms 20 ms 30 ms 5 ms 10 ms 20 ms 30 ms mono 29.16 52.79 83.08 93.00 29.00 52.71 82.79 92.63 tri1
27.80
51.21
81.69
92.82
27.84
50.89
81.40
92.12
tri2
27.40
49.55
79.72
91.45
27.10
48.96
79.27
90.91
tri3
27.42
49.34
79.18
91.24
27.18
48.74
78.41
90.36
dnn
27.73
48.87
78.84
90.77
27.11
48.49
78.32
90.09
Table 4. Optimization of monophone AM for direct phonetic segmentation (P ER = 0, P Corr = 100). CORE SET COMPLETE SET 5 ms 10 ms 20 ms 30 ms 5 ms 10 ms 20 ms 30 ms mono144
31.05
54.57
82.51
mono288 31.68 55.80 84.70
92.17
31.37
54.67
81.90
91.73
93.79 32.02 56.39 84.55 93.11
mono432
30.45
54.73
84.74 93.74
31.03
55.32
84.46
93.06
mono720
29.76
53.50
83.53
93.35
29.95
53.70
83.48
92.99
mono1008 29.16
52.79
83.08
93.00
29.00
52.71
82.79
92.63
mono1440 28.18
51.50
81.80
92.82
28.13
51.31
81.80
92.30
Finally, the distribution of values of PBE for particular phone classes is presented in Fig. 1. Particular bars describe distribution of PBE determined by percentiles 0.25 and 0.75 and significantly worse results are observed for DNN system, however, significant increase of an error can be observed mainly for silence while deterioration within phone classes is not so critical.
426
P. Mizera and P. Pollak
3.3.2 Phonetic Segmentation with Pronunciation Variability The second analysis describes the phonetic segmentation when exact phone sequence is not available and phonetic content is obtained from a pronunciation lexicon. It is the most frequent approach for obtaining phonetic content of an utterance, however, the core issue is how well the variability of pronunciation is covered in the lexicon and how the proper choice of word pronunciation variant influences the accuracy of phonetic segmentation. Table 5. Phonetic segmentation with canonic lexicon. PER PCorr 5 ms
10 ms 20 ms 30 ms
CORE
mono 32.58 71.43 dnn 31.88 71.45 mono288-dnn 31.88 71.45
23.94 43.54 72.39 85.82 23.67 40.14 65.28 80.60 25.78 45.37 72.01 84.54
COMPLETE
mono 31.15 72.28 dnn 30.52 72.28 mono288-dnn 30.52 72.28
23.92 43.23 72.34 85.78 23.43 40.32 65.83 80.59 26.39 46.38 72.71 84.93
Table 6. Phonetic segmentation with TIMIT-variant lexicon. PER
PCorr 5 ms
10 ms 20 ms 30 ms
CORE
mono 12.24 89.69 dnn 9.58 92.03 mono288-dnn 9.58 92.03
28.77 51.61 82.26 92.58 27.64 48.55 78.03 90.05 31.28 54.94 83.81 93.09
COMPLETE
mono 12.06 89.62 dnn 10.00 92.06 mono288-dnn 10.00 92.06
28.82 52.11 82.25 92.30 27.16 48.28 77.73 89.55 31.91 55.98 84.17 92.93
We realized the experiments with 3 pronunciation lexica: the first lexicon contained just canonic pronunciations, the second one contained all pronunciation variants realized by speakers in TIMIT corpus, and the third one was based on merging previous two lexica. Obtained results are shown in Tables 5, 6 and 7 and significant decrease of PER was observed when lexicon contained pronunciation variants. Further, the usage of more advanced AM (DNN-based one) contributed to further decrease of achieved PER below 10%. Consequently, it means the increase of PCorr, i.e. more than 92% of all phones were correctly identified, however, the accuracy of boundary determination slightly decreased when DNN-based system was used. On the other hand, when the recognized phone sequence is realigned with optimized monophone system with 288 Gaussian components (acronym mono288-dnn), both the best PER and boundary placement accuracy were achieved [15].
Automatic Phonetic Segmentation and Pronunciation Detection
427
Table 7. Phonetic segmentation with canonic lexicon extended by TIMIT variants. PER CORE
PCorr 5 ms
10 ms 20 ms 30 ms
mono 12.43 89.48 dnn 9.76 91.88 mono288-dnn 9.76 91.88
28.79 51.69 82.25 92.64 27.65 48.51 77.99 90.00 31.33 55.00 83.84 93.12
COMPLETE mono 12.40 89.28 dnn 9.28 92.17 mono288-dnn 9.28 92.17
28.83 52.08 82.25 92.31 27.12 48.22 77.63 89.44 31.92 55.97 84.14 92.93
3.3.3 Pronunciation Recognition In the end, we analyzed the correctness of pronunciation variant selection mentioned above. In fact, it was already quantified a little by the decrease of PER described in previous section, however, for many words we had a rather high amount of pronunciation variants and the ability of the selection of correct pronunciation variant could be very important feature of such a system. From the results described in Table 8, we can observe significant decrease of PronER (Pronunciation Error rate) when more advanced acoustic modeling and the lexicon covering pronunciation variants are used. The best results were obtained with DNN-based system, we observed significant decrease of PronER; 76.34% were obtained for basic monophone system and CORE test set, while 31.89% were achieved for DNN-based system. The contribution of GMM-HMM systems with triphone-based models was proven too. The same trend in obtained results was observed also for COMPLETE set. Table 8. Pronunciation variant recognition. canonic timit canonic+variants PER PronER PER PronER PER PronER CORE
COMPLETE
mono tri1 tri2 tri3 dnn
32.58 32.80 32.55 32.46 31.88
76.34 76.28 76.28 76.28 76.34
12.24 11.49 11.16 10.24 9.58
39.48 37.82 35.97 33.48 31.44
12.43 11.74 11.31 10.42 9.76
40.18 38.46 36.54 34.06 31.89
mono tri1 tri2 tri3 dnn
31.15 31.79 31.45 31.30 30.52
74.22 74.21 74.22 74.21 74.22
12.06 11.89 11.17 10.75 10.00
40.39 37.06 35.77 33.82 31.46
12.40 11.45 11.00 10.23 9.28
41.44 37.87 36.60 34.56 32.19
428
4
P. Mizera and P. Pollak
Conclusions
The implementation of HMM-based phonetic segmentation realized by KALDI toolkit was described in this paper commonly with the analysis of an contribution of various acoustic modeling to final accuracy of phone-boundaries determination. The evaluations were performed with TIMIT database and they proved the contribution of advanced acoustic modeling for the choice of proper pronunciation variant. We achieved more than 92% correctness of phone recognition within forced-alignment with DNN-HMM system together with the improvement of phone boundary placement realized in the second step by optimized monophone GMM-based systems; 83.84% of phone beginning boundaries were determined with the error smaller than 20 ms, for the error smaller than 30 ms it was 93.12%. These results were obtained without any further boundary correction, as it is not currently required by our application as well as it is related to results obtained without any boundary refinement and published by other authors. For the purpose of pronunciation variability modeling, the lexicon covering pronunciation variants of particular words among TIMIT speakers was created on the basis of phonetic transcriptions available in this corpus. Acknowledgments. The research described in this paper was supported by internal CTU grant SGS17/183/OHK3/3T/13 “Special Applications of Signal Processing”.
References 1. CMUSphinx: Open source speech recognition toolkit. http://cmusphinx.github.io 2. Brunet, R.G., Murthy, H.A.: Pronunciation variation across different dialects for English: a syllable-centric approach. In: 2012 National Conference on Communications (NCC) (2012) 3. Garofolo, J.S., et al.: TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web download. Linguistic Data Consortium, Philadelphia (1993) 4. Ghoshal, A., Povey, D.: Sequence-discriminative training of deep neural networks. In: Proceedings of the INTERSPEECH, Lyon, France (2013) 5. Kahn, A., Steiner, I.: Qualitative evaluation and error analysis of phonetic segmentation. In: 28. Konferenz Elektronische Sprachsignalverarbeitung, Saarbr¨ ucken, Germany, pp. 138–144 (2017) 6. Lee, K.F., Hon, H.W.: Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Audio Speech Lang. Process. 37(11), 1641–1648 (1989) 7. Matouˇsek, J., Kl´ıma, M.: Automatic phonetic segmentation using the KALDI toolkit. In: Ekˇstein, K., Matouˇsek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 138–146. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-642062 16 8. Matouˇsek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matouˇsek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10. 1007/978-3-540-39398-6 41
Automatic Phonetic Segmentation and Pronunciation Detection
429
9. Mizera, P., Pollak, P., Kolman, A., Ernestus, M.: Impact of irregular pronunciation on phonetic segmentation of Nijmegen corpus of casual Czech. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 499–506. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10816-2 60 10. Nouza, J., Silovsk´ y, J.: Adapting lexical and language models for transcription of highly spontaneous spoken Czech. In: Sojka, P., Hor´ ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 377–384. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8 48 11. Peddinti, V., Wang, Y., Povey, D., Khudanpur, S.: Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Process. Lett. 25(3), 373–377 (2018) 12. Povey, D., et al.: The Kaldi speech recognition toolkit. In: Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, ASRU 2011 (2011) 13. Rendel, A., Sorin, A., Hoory, R., Breen, A.: Toward automatic phonetic segmentation for TTS. In: Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, pp. 4533–4536 (2012) 14. Rybach, D., et al.: The RWTH Aachen university open source speech recognition system. In: Proceedings of Interspeech 2009 (2009) 15. Stolcke, A., Ryant, N., Mitra, V., Yuan, J., Wang, W., Liberman, M.: Highly accurate phonetic segmentation using boundary correction models and system fusion. In: Proceedings of ICASSP, Florence, Italy (2014) 16. Toledano, D.T., G´ omez, L.A.H., Grande, L.V.: Automatic phoneme segmentation. IEEE Trans. Speech Audio Process. 11(6), 617–625 (2003) 17. Young, S., et al.: The HTK Book, Version 3.4.1. Cambridge (2009) 18. Yuan, J., Ryant, N., Liberman, M., Stolcke, A., Mitra, V., Wang, W.: Automatic phonetic segmentation using boundary models. In: Proceedings of INTERSPEECH, Lyon, France, pp. 2306–2310 (2013)
Improving Neural Models of Language with Input-Output Tensor Contexts Eduardo Mizraji1(&), Andrés Pomi1, and Juan Lin1,2 1
Group of Cognitive Systems Modeling, Biophysics Section, Facultad de Ciencias, Universidad de la República, Iguá 4225, 11400 Montevideo, Uruguay [email protected], [email protected], [email protected] 2 Department of Physics, Washington College, Chestertown, MD 21620, USA
Abstract. Tensor contexts enlarge the performances and computational powers of many neural models of language by generating a double filtering of incoming data. Applied to the linguistic domain, its implementation enables a very efficient disambiguation of polysemous and homonymous words. For the neurocomputational modeling of language, the simultaneous tensor contextualization of inputs and outputs inserts into the models strategic passwords that rout words towards key natural targets, thus allowing for the creation of meaningful phrases. In this work, we present the formal properties of these models and describe possible ways to use contexts to represent plausible neural organizations of sequences of words. We include an illustration of how these contexts generate topographic or thematic organization of data. Finally, we show that double contextualization opens promising ways to explore the neural coding of episodes, one of the most challenging problems of neural computation. Keywords: Matrix memories Tensor contexts Semantic spaces Episodic memory
Word strings
Gradually, it saw itself (like us) imprisoned in this sonorous web of Before, After, Yesterday, While, Now, Right, Left, Me, You, Those, Others. From “The Golem” by J.L. Borges
1 Introduction The procedures developed by the human brain to organize sequences of semantic elements that create meaningful phrases are yet an unsolved problem. Such a sequence can be metaphorically congruent to the search for the exit of an intricate labyrinth, with myriad galleries connecting thousands of semantic modules. In this labyrinth, the output of a module is specifically guided toward its next module, a process that generates a completely non-random sequence of words. This controlled guidance can be due to the existence of specific “keys” that select and open the next appropriate semantic target. © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 430–440, 2018. https://doi.org/10.1007/978-3-319-99579-3_45
Improving Neural Models of Language with Input-Output Tensor Contexts
431
Taking into account the extremely large number of possibilities offered by the semantic network, the possibility of building rapid meaningful phrases in natural language strongly suggests that these output keys explore all their potential targets in parallel. An interesting approach would be to consider the creation of a meaningful phrase as analogous to the production of a sequence of motor acts oriented toward a goal [1– 5]. This analogy would assume that before the construction of a phrase there exists an objective that induces a layout over which the words are organized. In this case the goal is a communicational task, and a complete discourse can be structured by a set of subtargets that organize their parts. In this work we shall try to model the emergence of different kinds of language organization, by representing semantic modules with matrix associative memories. The many remarkable properties of these matrix memories are described in [6–10]. As “mesoscopic models” they connect algorithms operating on complex symbolic data to the neuro-dynamic level [11]. In this formalism, to find a path in the labyrinth of semantic modules would mean that outputs of matrix associative memories become inputs of particular memories that produce the words in the general layout of the phrase that is being created. Our contribution aims to fill this framework by showing that the modulation of inputs and outputs of matrix memories by tensor contexts provides a procedure to explain how coherent sequences of words can be created. In addition, this formalism implies the possibility of building thematic clusters in semantic spaces.
2 Basic Models In what follows we describe some properties of matrix associative memories and how tensor contexts enlarge their computational abilities. 2.1
Matrix Associative Memories
A matrix memory associates an m-dimensional column input vector f i to an ndimensional output vector gi . Kohonen [10] shows that a memory can be characterized by the set Mem ¼ ðg1 ; f 1 Þ; ðg2 ; f 2 Þ; . . .; gQ ; f Q : ð1Þ This “learning set” represents the data to be stored in a matrix memory M. To find the appropriate structure of this matrix, define two partitioned matrices G ¼ g1 g2 gQ ; F ¼ ½f 1 f 2 f Q ; and represent the associations between the Q pairs of vector patterns by the matrix equation G ¼ MF. Let In ¼ f1; 2; . . .; Qg be the set of indexes of stored pairs. Under this condition, the best solution in the sense of least squares, is in terms of the pseudoinverse F þ : M ¼ GF þ :
ð2Þ
432
E. Mizraji et al.
In the extremely simple case of an orthonormal set of inputs ff i g; i ¼ 1 to Q, Eq. (2) admits the closed expression: M¼
Q X i¼1
gi f Ti :
ð3Þ
For this matrix memory the recall operates as follows: Mf k ¼
Q X
gi hf i ; f k i;
ð4Þ
i¼1
with the scalar product being hf i ; f k i ¼ dik (dik is the Kronecker’s delta); hence if the index k 2 In, the recall is perfect, Mf k ¼ gk . 2.2
Input Tensor Contexts
Imagine we need to model a neural network capable to disambiguate homonymic or polysemic words. Networks with hidden layers trained with backpropagation, are the classical devices to deal with this kind of problem [12]. However, in such approach we generally lose the possibility of a transparent mathematical theory allowing to predict what is happening during training as well as the final network structure. This opacity was the main motivation to develop a “transparent connectionist” alternative [13]. This alternative uses a kind of vector symbolic architecture based on tensor contextualization [11, 14, 15]. Let f i be one homonymic word, associated with two vectors gi1 and gi2 for two completely non-correlated concepts. For instance, the input can represent the word “bank” and one output would be “money” and the other would be “sand”. To retain the matrix format of the associative memory, we integrate the input with two vector contexts pi1 ; pi2 2 Rh using the Kronecker product , a tensor procedure adapted to the operations of matrix algebra [16]. In our example, we could consider that the first context concerns finances and the second geography. The segment of a memory in our example can be expressed as: Mi ¼ gi1 ðpi1 f i ÞT þ gi2 ðpi2 f i ÞT :
ð5Þ
Consequently, when the memory receives an input and the corresponding context, the selection of the output happens via two scalar products: Mi ðp12 f i Þ ¼ gi1 hpi1 ; pi2 ihf i ; f i i þ gi2 hpi2 ; pi2 ihf i ; f i i:
ð6Þ
In a situation where both, the inputs and the contexts are orthonormal, we have a resolution of ambiguity, Mi ðp12 f i Þ ¼ gi2 :
ð7Þ
Improving Neural Models of Language with Input-Output Tensor Contexts
433
This format can be generalized [14, 17, 18] to a global memory module composed of a variety of specialized sub-modules, each having the required complexity for the contextualization of its inputs: X M¼ Mi : ð8Þ i
2.3
Input-Output Contexts
We can extend the previous approach by modulating both, inputs and outputs with vector contexts. This approach leads to memory matrices with the following general structure: H¼
X
p0ik gij
T pik f ij :
ð9Þ
i;j;k
From the properties of Kronecker products, the H matrix admits some interesting alternative representations. We illustrate two of them: H¼
X
p0ik pTik gij f Tij ;
ð10Þ
i;j;k
H¼
T X p0ik Idimðgij Þ gij f Tij pik Idimðf ij Þ :
ð11Þ
i;j;k
Note that inputs puv f ab with stored patterns, display outputs given by Hðpuv f ab Þ ¼ p0uv gab :
ð12Þ
These outputs are prepared to enter as inputs to a similar memory H’ with this particular pair [context - pattern] stored in its database. Memories with this structure accept many representational and computational potentialities to process the operations displayed by natural languages [19, 20]. In the next Sections we shall describe some of these operations.
3 Deterministic Semantic Strings In his “Principles of Psychology” (Vol. II, Chap. XXVI) James [21] writes that voluntary acts are based on consolidated memory traces created by previous involuntary acts. Similarly, the voluntary creation of phrases has as prerequisite the existence of word associations in previously fixed memories–developed after experiential contact with word usage. As we mentioned before language production could be seen as the generation of meaningful phrases, and may be similar to the assembly of a sequence of motor actions
434
E. Mizraji et al.
aimed at reaching a goal [3, 4, 22]. The purpose of spoken or written phrases is to transmit information by means of expressions that can be understood. Neural modeling challenges us to reach this goal by triggering an appropriate chain of meaningful words. Let us suppose that a phrase could be represented by a string: Fða; nÞ ¼ haa1 ; aa2 ; . . .; aan i; aai 2 Semfa1 ; a2 ; . . .; ax g;
ð13Þ
with Sem being the very large set of words in a normal lexicon. The phrase can repeat words, and consequently it is possible to have aai ¼ aaj . Now, how do we insure that aa1 precedes aa2 ? Moreover, how does the meaning of the phrase guide the correct order of successive words while information is transmitted? A possible answer to the first question would be to assume that the transition probabilities between words are responsible for the correct sequence, with a given word followed by its most probable successor. Within this framework, language production is mainly represented by a stochastic process with transition probabilities dependent on segments of previously used words. [23–26]. The second question seems to imply the existence of an anticipatory layout for the phrase. Here, we explore the following proposal. Imagine a small string of three words haa1 ; aa2 ; aa3 i representing a miniature phrase. Let us immerse these elements in contexts, generating a new string D E ptarg aa1 p1 ; p1 aa2 p2 ; p2 aa3 pend ;
ð14Þ
The neural vector ptarg is both, the context that triggers the sequence and concurrently, the target code. Contexts p1 and p2 are keys indicating the correct next element of the string, and context pend marks the end of the phrase. In this way, a good sequence of words is selected by the contextual string D
E ptarg ; p1 ; p2 ; pend :
ð15Þ
A recursive tensor input-output memory with the structure S ¼ ðpend aa3 Þðp2 aa2 ÞT þ ðp2 aa2 Þðp1 aa1 ÞT þ ðp1 aa1 ÞpTtarg
ð16Þ
can accomplish the procedure just described. In the general case, the final output can be a “pure” string of words, haa1 ; aa2 ; . . .; aan i. The contexts used in an internal, hidden computation, are channeled by a filter Way Out Matrix (WOM) having the structure WOM ¼
X k
! pTk
IdimðaÞ :
ð17Þ
Improving Neural Models of Language with Input-Output Tensor Contexts
435
Fig. 1. This diagram illustrates how a context target enters a recursive semantic network S triggering a sequence of contextualized outputs. These outputs are filtered by a WOM matrix that extracts the contexts and produces a pure word string.
The sum includes all the relevant contexts, and IdimðaÞ is an identity matrix with the same dimension as word vectors. Note that WOM pc ah ¼ ah :
ð18Þ
In Fig. 1 we illustrate this recursive model for a string of arbitrary length. The neurobiology of lexical strings production is far from being understood. We can consider the voluntary construction of utterances by our model in light of William James’ thought. Our model requires the previous existence of permanent memories of words and contextual markers, and a transitory working memory to install the appropriate string format. Finally, we mention that the target ‘feeds and builds’ contexts to generate meaningful strings in the same way that the target of a mechanical movement of our arm guides the intermediate steps needed to reach it.
4 Clustering by Contexts The memory H given in Eq. (10), with sets of different input-output associations sharing the same pair of input-output contexts can be factorized into clusters of associations induced by the contexts, H¼
X i
" p0i
pTi
X
# gij f Tij
:
ð19Þ
j
This partition suggests how scattered data may be organized in large neural networks. Contexts may create a topical coherence in a recall. Let us mention that an interesting formal parallelism between matrix memories and the Latent Semantic Analysis (LSA) has been described in [19]. In this direction, the structure of matrices (10) and (19) suggests the possibility of looking for the thematic clustering of textdocument matrices using, instead of a classical LSA based on SVD, a procedure that labels topics via the search of Kronecker factors.
436
E. Mizraji et al.
If we use as contexts unit vectors es (vectors with a 1 in position s and 0’s otherwise), the matrix H can be expressed as: X H¼ e0i eTi Mij ; ð20Þ i
with Mij ¼
X
gij f Tij
ð21Þ
being a classical Anderson-Kohonen associative memory matrix. By an adequate selection of dimensions for the context unit vectors, it is possible to generate a topographic pattern with different associative memories M placed as tiles into the “host” matrix H (Pomi, Mizraji and Lin, paper submitted). We illustrate this point with a simple example. Given the two unit column vectors e1 ¼ ½1 0T ; e2 ¼ ½0 1T and four associative memory matrices, MðmÞ 2 Rpq ; m ¼ 1; . . .; 4 H takes the form H ¼ e1 eT1 Mð1Þ þ e1 eT2 Mð2Þ þ e2 eT1 Mð3Þ þ e2 eT2 Mð4Þ :
ð22Þ
After computing the Kronecker products we find
Mð1Þ H¼ Mð3Þ
Mð2Þ ; H 2 R2p2q : Mð4Þ
ð23Þ
Thus, the contexts create a computational layer composed by various memory modules located in specific topographies, each one able to receive and redirect information selectively channeled by the contexts. Kohonen [27] developed one of the most important and deep procedures to model the generation of topographic neural patterns. The approach we are describing here assumes cognitive supervised learning. One could imagine associative memories to be the result of active interactions between a trainable brain and an external instructor–an active human teacher or environmental experiences. Hence, emergent clusters of associative memories may explain how, after extensive vocabulary learning, complex semantic webs can be established. We want to mention that the results of Huth et al. [28] experimentally illustrate the existence of a remarkable topographic organization in the semantic web of the human brain.
Improving Neural Models of Language with Input-Output Tensor Contexts
437
5 Episodes Since the foundational characterization of episodic memories by Tulving (updated in [29]), the search for their neural bases became an important research objective [30–34]. Adapting ideas of these investigators, we shall assume that episodic memories result from the interaction of different classes of memories, fundamentally, a semantic memory and a context memory that stores episode markers. We illustrate the interaction between these memory modules in Fig. 2.
Fig. 2. This scheme adapts to our model one of the conceptions about episode storage and retrieval. LH: Left hemisphere, RH: Right hemisphere, SM: Semantic Memory, CM: Contexts Memory.
We are going to assume that the encoding happens mainly in a region capable of sustaining a semantic memory (e.g.: the left prefrontal cortex) and the recall involves a region that stores contextual markers (e.g.: the right prefrontal cortex). The model we want to comment is formally similar to the model that generates semantic strings. However, there is a crucial difference: in episodes we do not necessarily have a target. A contingent series of events is stored in the memory due to a variety of causes, among others, emotional impact, autobiographical importance, bizarre consequences, etc. In these episodic sequences, contexts provide a kind of positional information–an expression of the embryologist Lewis Wolpert–that places words in the precise positions needed to recreate the episode. Let us define an episode by a time sequence of contexts that intermingle with words selected from the semantic memory. The sequence of contexts can be generated by a cyclic memory structured as: C ¼ pout pTn þ pn pTn1 þ þ p1 pTin :
ð24Þ
Context vector pin marks the beginning of the sequence, and context pout marks the end. Within a recursive network, the reinjection of successive outputs of memory C creates the time pattern hpout ; pn ; . . .; pi ; pin i:
ð25Þ
438
E. Mizraji et al.
Intermingling these contexts with words ai extracted from the semantic memory, builds the episodic sequence hðpout an pn Þ; ðpn an1 pn1 Þ; . . .; ðp3 a2 p2 Þ ; ðp2 a1 pin Þi:
ð26Þ
We are going to model this situation by assuming that intermingling occurs because the semantic memory is structured with associative memories that can be approximated by matrices like E¼
X
T p0ik aij pik aij ;
ð27Þ
i;j;k
with the particularity that context markers are very sparse vectors (e.g.: unit vectors). The total set of stored episodes can be based on a semantic basis of N words, N being very large. A given memory cannot store all this variety due to dimensional limitations. But memories like (25) can surpass the dimensional limitations imposed by neuroanatomy and enlarge the variety of episodes via a multi-modular semantic organization. The final step of the episodic recall can be a pure verbal string emerging from a WOM filter. We end this Section by mentioning that there is a close relationship between remembered episodes, and episodes created by the imagination. A fictional story does not travel to the autobiographical past, but creates episodes that we can recall even if such episodes are placed in the far past or future. This shows an interesting point concerning the possible coincidence between the neural systems responsible for the recall of personal biographical episodes and the imaginary generation of fictional facts (see [35, 36] for extensive references about this point), including the conception of innovative literary, philosophical, scientific, or technological scenarios.
6 Perspectives In this work we have assumed that a semantic unit, integrated with many contexts, could participate in a large variety of different linguistic tasks. The described models are written in terms of matrix algebra and Kronecker tensor products, which makes them operationally transparent and easily amenable to computer implementation, even though the dimensions involved in these linguistic tasks can be extremely large. In any case, the highly flexible production of organized, non-random sequences of words in a natural language is a marvelous and yet obscure process. The topical organization of a biological semantic web, with patches including elaborate pieces of language could plausibly be a basis for the hierarchical elaboration of complex thoughts. These thoughts are translated into linguistic codes and communicated. In a way, “deep learning” technological procedures involving a system of hierarchical computing levels, are already implemented by the human brain. We need to understand these codes, which in many cases, can be accompanied by linguistic productions. A simplified example of this kind of hierarchical processing is given in [20]. Finally, the recreation, or invention of episodes represents one of the most significant signatures of
Improving Neural Models of Language with Input-Output Tensor Contexts
439
the human mind and is placed, by researchers like Tulving [29], at the highest levels of cognition. With tensor input-output contexts we have been able to formulate an elementary approach to the modeling of these open and crucial problems. Acknowledgments. AP and EM acknowledge partial financial support by PEDECIBA and CSIC-UdelaR.
References 1. Luria, A.R.: The Working Brain. Basic Books, New York City (1973) 2. Kimura, D.: Neuromotor mechanisms in the evolution of human communication. In: Steklis, H.D., Raleigh, M.J. (eds.) Neurobiology of Social Communication in Primates, pp. 197–219. Academic Press, New York (1979) 3. Calvin, W.H.: A stone’s throw and its launch window: timing precision and its implications for language and hominid brains. J. Theor. Biol. 104, 121–135 (1983) 4. Calvin, W.H.: The unitary hypothesis: a common neural circuitry for novel manipulations, language, plan-ahead, and throwing? In: Gibson, K.R., Ingold, T. (eds.) Tools, Language, and Cognition in Human Evolution, pp. 230–250. Cambridge University Press, Cambridge (1993) 5. Ojemann, G.A.: Brain organization for language from the perspective of electrical stimulation mapping. Behav. Brain Sci. 6, 189–206 (1983) 6. Anderson, J.A.: A simple neural network generating an interactive memory. Math. Biosci. 14, 197–220 (1972) 7. Anderson, J.A.: An introduction to neural networks. MIT Press, Cambridge (1995) 8. Cooper, L.N.: A possible organization of animal memory and learning. In: Lundquist, B., Lundquist, S. (eds.) Proceedings of the Nobel Symposium on Collective Properties of Physical Systems, pp. 252–264. Academic Press, New York (1973) 9. Kohonen, T.: Correlation matrix memories. IEEE Trans. Comput. C-21, 353–359 (1972) 10. Kohonen, T.: Associative Memory: A System Theoretical Approach. Springer, Heidelberg (1977). https://doi.org/10.1007/978-3-642-96384-1. Chap. 3 11. Beim Graben, P., Potthast, R.: Inverse problems in dynamic cognitive modeling. Chaos Interdiscip. J. Nonlinear Sci. 19, 015103 (2009) 12. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature 323, 533–536 (1986) 13. Carmantini, G.S., Beim Graben, P., Desroches, M., Rodrigues, S.: A modular architecture for transparent computation in Recurrent Neural Networks. Neural Netw. 85, 85–107 (2017) 14. Mizraji, E.: Context-dependent associations in linear distributed memories. Bull. Math. Biol. 51, 195–205 (1989) 15. Smolensky, P.: Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artif. Intell. 46, 159–216 (1990) 16. Graham, A.: Kronecker Products and Matrix Calculus With Applications. Ellis Horwood, Chichester (1981) 17. Pomi, A., Mizraji, E.: Semantic graphs and associative memories. Phys. Rev. E 70, 066136 (2004) 18. Pomi, A.: A possible neural representation of mathematical group structures. Bull. Math. Biol. 78, 1847–1865 (2016) 19. Mizraji, E., Pomi, A., Valle-Lisboa, J.C.: Dynamic searching in the brain. Cogn. Neurodyn. 3, 401–414 (2009)
440
E. Mizraji et al.
20. Mizraji, E., Lin, J.: Modeling spatial-temporal operations with context-dependent associative memories. Cognit. Neurodyn. 9, 523–534 (2015) 21. James, W.: Principles of Psychology. The Great Books of the Western World, vol. 53. The University of Chicago (1890) 22. Nishitani, N., Schürmann, M., Amunts, K., Har, R.: Broca’s region: from action to language. Physiology 20, 60–69 (2005) 23. Jurafsky, D., Bell, A., Gregory, M., Raymond, W.D.: Probabilistic relations between words: evidence from reduction in lexical production. Typol. Stud. Lang. 45, 229–254 (2001) 24. Jurafsky, D.: Probabilistic modeling in psycholinguistics: linguistic comprehension and production. In: Bod, R., Hay, J., Jannedy, S. (eds.) Probabilistic Linguistics, p. 21. MIT Press, Cambridge (2003). Chap. 3 25. Nowak, M.A., Komarova, N.L., Niyogi, P.: Computational and evolutionary aspects of language. Nature 417, 611–617 (2002) 26. Chater, N., Manning, C.D.: Probabilistic models of language processing and acquisition. Trends Cognit. Sci. 10, 335–344 (2006) 27. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1997). https://doi.org/10.1007/ 978-3-642-97966-8 28. Huth, A.G., Nishimoto, S., Vu, A.T., Gallant, J.L.: A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron 76, 1210–1224 (2012) 29. Tulving, E.: Episodic memory. Annu. Rev. Psychol. 53, 1–25 (2002) 30. Baddeley, A.: Working memory: looking back and looking forward. Nat. Rev. Neurosci. 4, 829–839 (2003) 31. Jonides, J.R., et al.: The mind and brain of short-term memory. Ann. Rev. Psychol. 59, 193– 224 (2008) 32. Repovs, G., Baddeley, A.: The multi-component model of working memory: explorations in experimental cognitive psychology. Neuroscience 139, 5–21 (2006) 33. Eichenbaum, H.: Prefrontal–hippocampal interactions in episodic memory. Nature Rev. Neurosci. 18, 547–558 (2017) 34. Schapiro, A.C., Turk-Browne, N.B., Botvinick, M.M., Norman, K.A.: Complementary learning systems within the hippocampus: a neural network modelling approach to reconciling episodic memory with statistical learning. Philos. Trans. R. Soc. Lond B 372, 20160049 (2017) 35. Schacter, D.L., et al.: The future of memory: remembering, imagining, and the brain. Neuron 76, 677–694 (2012) 36. Schacter, D.L., Benoit, R.G., Szpunar, K.K.: Episodic future thinking: mechanisms and functions. Curr. Opin. Behav. Sci. 17, 41–50 (2017)
Sociolinguistic Variability of Predicate Groups in Colloquial Russian Speech Anfisa Naumova(&) Saint Petersburg State University, Universitetskaya nab. 11, St. Petersburg 199034, Russia [email protected]
Abstract. The paper is devoted to the study of linear and structural orders in the syntactic constructions of colloquial Russian speech. The quantitative and structural characteristics of predicate groups in the replicas of oral speech are examined with the aim of revealing their typical structures and further analysis in the sociolinguistic aspect. The paper contains the description of typical structures of predicate groups, presents their quantitative analysis and concerns correlation with speakers’ social characteristics. The study was based on the material of speech corpus ʻOne Day of Speechʼ, which is the largest resource for studying spoken language, being developed at St. Petersburg State University. 11 macro episodes of everyday communication for 10 respondents were analyzed, including 5 men and 5 women, who are representatives of 6 different professional groups. Manual syntactic and automatic morphological annotation of predicate groups was carried out and their analysis was conducted. The data obtained were verified using statistical methods and mathematically reliable conclusions were found, such as: (1) the size of predicate groups do not depend on the sex of the speaker; (2) the average size of predicate groups in speech of young people is greater than in that of the middle-aged; (3) the size of predicate groups is changed primarily due to the left distance; (4) the size of the most ranked POS-tagged syntactic structures is only 1–2 elements; (5) the number of verbal predicate groups in the female speech is 8% greater than that in the male. Keywords: Spoken Russian language Syntax Predicate groups
Everyday speech Speech corpus
1 Problem Statement The everyday speech of a person is influenced by a variety of factors that may refer not only to linguistics, but also to physiology, sociolinguistics, psycholinguistics, pragmatics, cognitive science, semiotics, and anthropology. Sociolinguistic aspect defines such sociological indicators as age, gender, profession, level of speech competence [1] for native speakers and level of language proficiency for foreigners, relations between speakers and others. All this determines the need for an interdisciplinary approach in the study of everyday spoken speech [2–4] and, in particular, the identification of the most significant sociological indicators. This problem has become one of the tasks of given © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 441–450, 2018. https://doi.org/10.1007/978-3-319-99579-3_46
442
A. Naumova
research. What factors have the most significant effect on the way we speak? Does our speech depend only on ourselves or is it also determined by such causes that we can not change? How exactly can our social status affect our speech? This article attempts to answer some of these questions.
2 Speech Material As a research material, the corpus of oral texts “One Day of Speech” (the ORD corpus) [5–8] was used. This project is being developed at the philological faculty of St. Petersburg State University in order to analyze everyday speech at various levels in an interdisciplinary aspect. The principle of audio recording follows from the name of the case: the informant fixes all his/her speech communication during the day on the recorder. The material obtained in this way is the most natural in comparison with the record in the laboratory. Also, the participants of the experiment fill out social and psychological questionnaires, which allows entering into the corpus a certain set of metadata, including informants’ psycholinguistic and sociolinguistic characteristics. A lot of linguistic research has already been done on the material of this corpus (e.g., [9–12]). Today the ORD corpus has more than 1,250 h of audio recordings received from 128 informants and more than 1000 of their communicants, representatives of various social groups. The research material consists of 2800 macroepisodes of speech communication [13]. The ORD corpus gives the researcher an opportunity to study spontaneous everyday speech not only from a purely linguistic point of view, but also from the position of psycho- and sociolinguistics. For this study, a sample subcorpus was specially selected from the ORD corpus, in which gender and age social groups were balanced and various professional groups were represented. For this purpose, 11 macroepisodes referring to typical everyday settings were selected. Thus, the material of study consists of only urban speech of Saint-Petersburg’s citizens and each gender, social and professional group include 1–2 informants. All speech material was manually annotated on syntactic level, and 830 predicate groups were identified. The annotation was manually made by one expert. The guidelines were formed from the rules for coding utterances proposed by P.V. Rebrova [14]. These rules have been modified to meet the objectives of this study. Thus, a classification was obtained for categories for which the following notation: D – discursive words; F – phraseological units; Inf – infinitive; N – negation; Q – question words; S – subject (noun/pronoun); V – verb, conjugated form; Y – agreement; Z – other particles; A – attribute (adjective); B – adverbial modifier; M – addressing; H – negative particle “not”; O1 – direct object; O2 – indirect object; O3 – object with a preposition. • poetomu *S/doma krasnogo netu // (woman, 28 years old, educator) therefore *laugh/there isn’t red (marker) at home // CONJ2 B O1 PRED
Sociolinguistic Variability of Predicate Groups
443
The distribution of predicate groups by gender, age and professions is shown in Table 1. The proposed professional categories may intersect with each other, as one person may be engaged simultaneously in more than one group (e.g., a lecturer in philosophy would be assigned to both humanities and education). Table 1. Distribution of predicate groups by gender, age and occupations. Gender groups Speech of male informants (332 predicate groups) Speech of female informants (498 predicate groups)
Age groups Speech informants up to 30 years inclusive (263 predicate groups) Speech informants from 31 to 45 years (339 predicate groups) Speech informants over 45 years (228 predicate groups)
Professional groups Speech of IT specialists (153 predicate groups) Speech of employees in the education sphere (244 predicate groups) Speech of office workers (198 predicate groups) Speech of representatives of creative professions (88 predicate groups) Speech of representatives of the humanities (98 predicate groups) Speech representatives of law enforcement (49 predicate groups)
Many scientists admit that in oral speech there are no unambiguous criteria for the allocation of a sentence. For this reason, researchers propose other categories for the designation of “oral sentences”: statements [15], clauses [16], elementary discursive units [17] and others. In this paper, we decided to investigate oral speech from the point of view of predicate groups. The choice of the predicate group as the unit of study is explained by the fact that in this study we are oriented toward the syntactic aspect of speech. The concept of a predicate group is closest to the clause, however not only the verb, but also other parts of speech used in the role of the predicate can act here as a “sentence core”. So, in this article the predicate group is understood as a predicate and its environment, which is divided into syntactic elements. This can be, in particular, formally independent units (for example, particles or interjections) formally independent of the predicate, which fall into the linear chain of words of the predicate group. A syntactic element is a graphical word, as well as such cases of fusion of graphic words that can not be separated by the insertion of another word, phraseological fusion [18] and composite conjunctions. Various speech disfluency, breaks, fillers, and non-verbal hesitations were not taken into account here as having only an indirect relation to the syntactic structure of utterances.
444
A. Naumova
3 Research Methodology The data were analyzed by standard statistical methods. Each predicate group was manually annotated, after which its size, the left and right distances of each group were measured. With the help of statistical methods, the most frequent syntactic constructions for the sample as a whole and for each social group were identified. Analysis of the syntactic structure of predicate groups was carried out by means of the morphological parser TreeTagger. All predicate groups were automatically annotated, after which a manual correction of the results was carried out, and the frequency lists were compiled. Thus, 12 ranked lists of POS-tagged syntactic structures of predicate groups in Russian oral speech were created for: (1) all speakers; (2) men; (3) women; (4) informants up to 30 years; (5) informants from 31 to 45 years; (6) informants older than 46 years; (7) representatives of the humanities; (8) IT professionals; (9) employees in the education sphere; (10) office workers; (11) representatives of law enforcement and (12) representatives of creative professions. On the basis of the data obtained, a number of conclusions were made regarding possible trends in the speech of particular social groups. These findings were verified using statistical methods (such as the Student’s test and the Fisher’s test), which made it possible to identify unreliable conclusions from a mathematical point of view and to identify those that can be considered as statistically significant.
4 Quantitative Analysis Using quantitative and statistical analysis, it was found that the average size of the predicate group is 4.28 elements. The average value of the left distance is 2.49 elements, the right – 0.8 elements. Thus, the average predicate group of Russian spontaneous oral speech can be quantitatively represented as 2:x:1 (where 2 is the number of elements before the predicate, x is a predicate and 1 is the number of elements after the predicate). The data obtained make it possible to calculate typical quantitative predicate groups for each social group presented in Table 2. The results obtained were compared within each social category: gender, age and professional sphere of informants. The study showed that the average predicate group size for men is 4.29 elements, the left distance is 2.44 elements, the right distance is 0.85 elements. Similar data for female speech are: 4.28 elements, 2.52 elements and 0.76 elements, respectively. Thus, a significant dependence of the quantitative characteristics of predicate groups on the sex of the speaker is not traced. When analyzing differences in the size of predicate groups for speakers from different age groups, it was found that in the middle age group the average predicate group size was the smallest (3.9 elements), and in the younger group – the largest (4.67 elements). At the same time, the difference in the size of predicate groups in all age groups is created exclusively due to the left distance with almost the same right (0.8, 0.81 and 0.76 elements for the younger, older and middle groups, respectively).
Sociolinguistic Variability of Predicate Groups
445
Table 2. Schemes of typical predicate groups for each social group. Social group Men Women Younger age group (up to 30 years) Middle age group (31–45 years) Senior age group (from 46 years) Humanitarians Employees in the education sphere IT-specialists Office workers Representatives of law enforcement Representatives of creative professions
Scheme of a typical predicate group 2:x:1 3:x:1 3:x:1 2:x:1 3:x:1 3:x:1 3:x:1 2:x:1 2:x:1 3:x:0 2:x:1
To determine the reliability of this conclusion, the Student’s test was used. When comparing the samples for the younger and middle age groups, the Student’s t-criteria turned out to be in the significance zone at its critical level p 0.01, which means that the conclusions made on the basis of a comparison of the two given age groups are reliable. The greatest differences were revealed in the speech of representatives of various professional groups. The maximum average size of the predicate group was expectedly among employees in the education sphere (4.8 elements) and humanitarians (4.7 elements) in comparison with informants from other professions (from 3.69 elements to 4.03). Although the amount of the material does not allow us to speak with certainty about any regularities that the revealed differences may indicate, it is possible at this stage to assume that professional affiliation has the greatest influence on the quantitative characteristics of the speaker’s speech units. The rank distribution of the frequency of predicate groups also allowed us to see certain patterns. The most frequent predicate group in male speech consists of 5 elements, the second – from 4, the third – from 3. While female speech has a reverse picture. However, it should be decided, whether it is a trend or an accident, relying on a greater amount of material. Comparison of frequency of predicate groups in the speech of informants of different age groups showed an interesting result. The most frequent predicate group in the younger age group consists of 5 elements, while the most frequent predicate groups in both the middle and the older age groups consist of 3 elements. The rank distribution of the size of predicate groups by frequency is generally similar for informants from 31 years old, which makes it possible to identify young people as tending to use broader predicate groups. Examples show that their size is achieved most often due to discursive words: • to yest’ kak by dazhe yesli ya podnimayu ruku (woman, 20 years old, humanitarian) that is, as it were, even if I raise my hand
446
A. Naumova
• u nikh tseny tam voobshche // (man, 24 years old, office worker) they have prices there absolutely (low) // However, Student’s test showed that this output is statistically significant only for the pair youngest age group VS older age group. The greatest differences in the size of the predicate groups were found in the speech of different professional groups of speakers. The obtained results, however, should be checked on a larger sample of the material, since in calculating their reliability by the Student’s test, empirical values turn out to be in the zone of significance only when comparing those professional groups that are represented by the largest amount of material. Despite the fact that professional groups showed the most significant differences, it is difficult to conclude about their specific features, as the material for some groups is not enough to obtain reliable data. However, based on the data obtained, it is possible to propose a hypothesis that the professional factor affects the quantitative characteristics of speech units more strongly.
5 Structural Analysis Not only quantitative but also qualitative characteristics of predicate groups are of interest when studying everyday syntax. First of all, it seems worthy to analyze the structure of predicate groups, in particular their POS-tagged syntactic structure. In a result of automatic annotation and further manual correction, a set of 830 POS-tagged structures was obtained. Predicate groups fall into two main types: (1) verbal predicate groups, and (2) nonverbal predicate groups. In this study, we consider a predicate group as verbal if it has a verb as a core. The non-verbal predicate groups are those, which have as a core other parts of speech (category of state, compound nominal predicate, etc.). Among the analyzed predicate groups 658 (79%) were verbal and 172 (21%) were non-verbal, that is, their ratio for all speakers is 4:1. The deviations from this ratio in different social groups are of interest. The number of verbal predicate groups in female speech is slightly higher than that of male (82% versus 74%), and the same indicators in different age groups were almost equal (81% in the older group, 79% in the younger and middle). The greatest difference in the number of verbal and non-verbal predicate groups is found among representatives of different professional groups (from 61% verbal predicate groups for representatives of law enforcement to 89% for office workers). However, the reliability of the findings was checked by statistical methods. For this, mathematical calculations such as the Student’s test and the Fisher’s test were used. Checking the conclusion about the difference in the distribution of different types of predicate groups among age groups has shown that it can not be considered reliable. When checking the conclusion about the difference in the distribution of the different types of predicate groups between gender social groups, it turned out that it can be considered reliable: the empirical value of t by the Student’s test for these groups is 3.1 at a critical value of 2.56 (for p 0.01), and the empirical value of u by the Fisher’s test is 2.78 for a critical value of 2.31 (for p 0.01), therefore, both empirical values are in the significance zone, since they are higher than the critical values.
Sociolinguistic Variability of Predicate Groups
447
As for age and professional groups, Student and Fisher’s criteria allow us to compare only two samples, so these groups should be compared in pairs. The empirical value of t by the Student’s test is 0.6 and 0.3 for younger-middle and middle-senior pairs respectively for the critical value of 2.58 (for p 0.01), and the empirical value of u by the Fisher’s test is 0.425 and 0.292 for the same pairs respectively for the critical value of 2.31 (for p 0.01), and therefore all empirical values are outside the zone of significance. When checking the conclusion about the difference in the distribution of different types of predicate groups among professional social groups, it turned out that it can be considered reliable, but only for the pair representatives of law enforcement VS office workers. In addition, a ranked list of 610 found POS-tagged syntactic structures was compiled (its part is shown in Table 3). Table 3. Frequency POS-tagged syntactic structures of predicate groups. # Structure 1 V 2 S-PRO V 3 PART V
Quantity Percentage Rank Predicate group scheme 45 5.42 1 x 17 2.05 2 1:x 16 1.93 3 1:x
4 CONJ S- 15 PRO V 5 PRAEDIC 13 6 SV 12
1.81
4
2:x
1.57 1.45
5 6
x 1:x
7 CONJ V 8 ADVPRO V 9 PART PART V
10 10
1.2 1.2
7 7
1:x 1:x
9
1.08
8
2:x
Example
Translation
skhodite // ya ponyala // ne znayu //
go // I understood // (I) don’t know // chto ty what do you noyesh’? want? prikol’no // cool // eksport export will nakroyetsya // break down // yesli budet if there will be seychas () (I)’ll look posmotryu / now / nu ne znayu / well (I) don’t know /
Thus, the most frequent syntactic structures of predicate groups were identified. By the type there are 8 verbal and 1 non-verbal among them. By the content, the most ranked structure consists of one verb. The 3rd, 7th and 9th ranks belong to structures consisting of a verb with an auxiliary part of speech. The 2nd and 4th ranks belong to structures in which an object is added to the verb. It also turned out that the size of the most ranked syntactic structures is only 1–2 elements, and their right distance is invariably zero. Within the structures of predicate groups, it is also interesting to make comparative analysis of different social groups. This was done in the following way: for each social group, a rating of the 10 most frequent syntactic structures was compiled for all speakers and for each social category, and then these ratings were compared.
448
A. Naumova
The ratio of the 10 most frequent POS-tagged syntactic structures of predicate groups in different age groups also allows us to make some observations. These findings were also checked for reliability using the Fisher’s test. It turned out that only a conclusion can be considered as statistically significant: the structure “PRAEDIC” is rather typical for the middle-aged group (II rank), but rarely occur in speech of youth (XVII rank). The significance of other observations was not confirmed statistically. Comparison of professional groups also reveals some differences. According to Fisher’s test, some conclusions were unreliable and some were in the zone of uncertainty. However there were five statistically significant observations: – The structure “V” is the most frequent in professional groups of IT specialists and office workers, but is not typical for representatives of law enforcement. – The structure “S-PRO V” has II rank in the speech of office workers, but is not typical for representatives of law enforcement. – The structure “PART V” has high ranks in speech of humanitarians and office workers (I and III respectively), while it is not typical for employees in the education sphere and representatives of creative professions. – The structure “CONJ S-PRO V” is the most frequent in the professional group of employees in the education sphere, and at the same time it is not typical for humanitarians and representatives of creative professions. – The structure “PRAEDIC” is the most frequent in the speech of representatives of creative professions, but it rarely occurs in the speech of employees in the education sphere, IT specialists and office workers. Such number of values in the zone of significance for professional social groups allow us to speak about the greatest degree of influence of professional affiliation on the structural organization of predicate groups in everyday speech.
6 Conclusions The analysis of the data made it possible to come to a number of conclusions that were verified using statistical methods. According to them, the following observations can be considered reliable. 1. The quantitative characteristics of predicate groups in Russian oral speech, apparently, do not depend on the sex of the speaker. 2. The average size of predicate groups in speech of young people is greater than in that of the middle-aged. 3. The size of predicate groups is changed primarily due to the left distance, while the right distance for all informants, regardless of sex, age and occupations, ranges from 0 to 1 element. This result confirms the hypothesis that left-branching verbal groups prevail in spoken Russian [19]. 4. The size of the most ranked POS-tagged syntactic structures is only 1–2 elements, and their right distance equals 0. 5. The number of verbal predicate groups in female speech is 8% greater than that in the male.
Sociolinguistic Variability of Predicate Groups
449
Besides, there is a number of other reliable observations mostly about professional groups, concerning more specific matters. It should be mentioned though, that these conclusions are based upon rather small volume of sociolinguistic speech material. At this stage, not all the observations described above seems to be sufficient to identify the diagnostic features of the social groups under study, but they show well the potential of the methods used by which such diagnostic features can be identified. Acknowledgements. The research is supported by the Russian Foundation for Basic Research, project # 17-29-09175 “Diagnostic Features of Sociolinguistic Variation in Everyday Spoken Russian (based on the Material of Sound Corpus)”.
References 1. Bogdanova, N.: Uroven’ rechevoy kompetentsii kak real’naya sotsial’naya kharak-teristika govoryashchego, opredelyayushchaya yego rech’ [The level of speech competence as a real social characteristic of the speaker, which determines his speech]. In: Asinovskiy, A., Bogdanova, N. (eds.) XXXVIII Mezhdunarodnaya Filologicheskaya Konferentsiya [XXXVIII International Philological Conference] 2009, vol. 22, pp. 29–40. SaintPetersburg (2010). (in Russian) 2. Kanu, A.: Reflections in communications. An Interdisciplinary Approach. University Press of America, Lanham (2009) 3. Kreiman, J., Sidtis, S.: Foundations of voice studies. An Interdisciplinary Approach to Voice Production and Perception. Wiley, New York (2011) 4. Potapova, R., Potapov, V., Lebedeva, N., Agibalova, T.: Mezhdistsiplinar-nost’ v issledovanii rechevoy poliinformativnosti [Interdisciplinarity in the study of speech polyinformativity]. Yazyki slavyanskoy kul’tu-ry [World of Slavic Culture] 3, 82–95 (2016). (in Russian) 5. Asinovsky, A., Bogdanova, N., Rusakova, M., Ryko, A., Stepanova, S., Sherstinova, T.: The ORD speech corpus of Russian everyday communication “One Speaker’s Day”: creation principles and annotation. In: Matouˇsek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 250–257. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-64204208-9_36 6. Bogdanova-Beglarian, N., et al.: Sociolinguistic extension of the ORD corpus of Russian everyday speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 659–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-31943958-7_80 7. Bogdanova-Beglarian, N., Sherstinova, T., Blinova, O., Martynenko, G.: An exploratory study on sociolinguistic variation of Russian everyday speech. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 100–107. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_11 8. Sherstinova, T.: Macro episodes of Russian everyday oral communication: towards pragmatic annotation of the ORD speech corpus. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 268–276. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23132-7_33
450
A. Naumova
9. Zobnina, E.: Perspektivy ispol’zovaniya zvukovogo korpusa “Odin rechevoy den’” v prepodavanii russkogo yazyka kak inostrannogo [Prospects for the use of the sound building “One Speech Day” in teaching Russian as a foreign language]. Mir russkogo slova [The world of the Russian word] 4, 99–103 (2009). (in Russian) 10. Bayeva, E.: O sposobax sociolingvisticheskoj balansirovki ustnogo korpusa [na primere “Odnogo rechevogo dn’a”) [On Means of Sociolinguistic Balancing of a Spoken Corpus (Based on the ORD corpus)]. Vestnik Permskogo universiteta. Rossijskaja i zarubezhnaja filologia [Perm University Herald. Russian and Foreign Philology] 4(28), 48–57 (2014). (in Russian) 11. Ermolova, O.: “Odin Rechevoy Den’” govoryashchego s tochki zreniya pragmatiki [“One speech day” of the speaker from the point of view of pragmatics]. Vestnik Permskogo universiteta. Rossijskaja i zarubezhnaja filologija [Perm University Herald. Russian and Foreign Philology] 3(27), 21–30 (2014). (in Russian) 12. Bogdanova-Beglaryan, N., et al.: Russkiy yazyk povsednevnogo obshcheniya: osobennosti funktsionirovaniya v raznykh sotsial’nykh gruppakh. Kollektivnaja monografija [Russian language of everyday communication: features of functioning in different social groups]. Layka [Laika], Saint-Petersburg (2016). (in Russian) 13. Bogdanova-Beglarian, N., Sherstinova, T., Blinova, O., Martynenko, G.: Linguistic features and sociolinguistic variability in everyday spoken Russian. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 503–511. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_50 14. Rebrova, P.V.: Strukturnyye i lineynyye poryadki v spontannoy rechi (na materiale korpusa « Odin rechevoy den’ »): dis…. mag. lingv [Structural and linear orders in spontaneous speech (on the material of the case “One Speech Day”): the dissertation of the master of linguistics] (typescript). Saint-Petersburg (2014). (in Russian) 15. Bakhtin, M.: Estetika slovesnogo tvorchestva [Aesthetics of verbal creativity]. Iskusstvo [Art], Moscow (1986). (in Russian) 16. Testelets, Y.: Vvedeniye v obshchiy sintaksis [Introduction to the general syntax]. RGGU, Moscow (2001). (in Russian) 17. Kibrik, A., Podlesskaya, V.: Rasskazy o snovidenijakh. Korpusnoe issledovanie ustnogo russkogo diskursa [Stories about dreams. Corpus study of Russian oral discourse]. Yazyki slavyanskikh kul’tur [Languages of Slavic cultures], Moscow (2009). (in Russian) 18. Vinogradov, V.: Izbrannyye trudy. Leksikologiya i leksikografiya [Selected works. Lexicology and lexicography]. Nauka [Art], Moscow (1977). (in Russian) 19. Bogdanova-Beglarian, N., Martynenko, G., Sherstinova, T.: The “One Day of Speech” corpus: phonetic and syntactic studies of everyday spoken Russian. In: Ronzhin, A., et al. (eds.) SPECOM 2015, LNAI, vol. 9319, pp. 429–437. Springer, Switzerland (2015). https:// doi.org/10.1007/978-3-319-23132-7_53
Building Real-Time Speech Recognition Without CMVN Thai Son Nguyen(B) , Matthias Sperber, Sebastian St¨ uker, and Alex Waibel Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Germany [email protected]
Abstract. Estimating cepstral mean and variance normalization (CMVN) in run-on and real-time settings poses several challenges. Using a moving average for variance and mean estimation requires a comparatively long history of data from a speaker which is not appropriate for short utterances or conversations. Using a pre-estimated global CMVN for speakers instead reduces the recognition performance due to potential mismatch between training and testing data. This paper investigates how to build a real-time run-on speech recognition system using acoustic features without applying CMVN. We propose a feature extraction architecture which can transform unnormalized log mel features to normalized bottleneck features without using historical data. We empirically show that mean and variance normalization is not critical for training neural networks on speech data. Using the proposed feature extraction, we achieved 4.1% word error rate reduction compared to global CMVN on the Skype conversations test set. We also reveal many cases when features without zero-mean can be learnt well by neural networks which stands in contrast to prior work.
Keywords: Real-time speech recognition Neural network
1
· Feature normalization
Introduction
Ceptral mean and variance normalization (CMVN) [22] and other normalization techniques (e.g., Cepstral mean normalization (CMN) [7]) are widely adopted in many neural network speech recognition systems due to several advantages. First, these techniques as shown in [22] make the recognizer more robust by canceling out environmental changes. Second, they help reducing the environment mismatch (e.g. background noises or microphones) between training and testing conditions. Last, the acoustic features after normalization have zero mean which is found critical for neural network training [13]. In offline situations, CMVN is usually applied at the utterance level or more ideally at the speaker level when many utterances of the same speaker are available. However, these approaches are not appropriate for real-time situations, c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 451–460, 2018. https://doi.org/10.1007/978-3-319-99579-3_47
452
T. S. Nguyen et al.
because they require a certain amount of history to be available for the current speaker, and cannot handle unexpected speaker changes. Instead, mean and variance can be continuously computed over a moving window of some hundred frames (e.g., 3 s [1,17]). However, moving windows require the availability of historical data of at least a window-size, so that a delay must be introduced to handle the beginning of a new utterance. A third approach, computing mean and variance globally (e.g., [19,26]) for all training and test data, avoids the delay but reduces the recognition performance due to potential data mismatch. CMVN can also be recursively updated in real-time as in [17], but this approach does not handle multiple speakers. Peddinti et al. [14] proposed to use mel-frequency cepstral coefficients (MFCC) without normalization for real-time speech recognition, as currently implemented in the Kaldi toolkit [15]. In their approach, i-vectors [3] which supply the information about the mean offset of the speaker’s data are provided to every input so that the network itself can do feature normalization. However, i-vectors still require a certain amount of data of about 6 s per speaker. In this paper, we investigate and employ feature extraction approaches which exhibit comparable performance to CMVN but do not require speaker historical data and are therefore better suited for real-time situations. Our contributions are summarized as follows: – We contrast different CMVN methods and point out their respective advantages and limitations in a real-time feature extraction setting. We conclude that global CMVN is most desirable regarding real-time properties, although utterance- or speaker-based CMVN yield best recognition accuracy. – We propose to use a two-step transformation method that is empirically shown to transform unnormalized log mel-filterbank (FBANK) features into suitable acoustic model inputs, without requiring historical data of the current speaker. Using this transformation, we show that the acoustic models trained on the new feature domain significantly outperform global CMVN. – We identify a potential mismatch between training and testing data when acoustic models are trained on unnormalized data and propose to use data augmentation as a solution. We empirically show that retraining the feature extraction and the systems on a volume perturbation dataset can avoid the mismatch of audio volume and increase the recognition performance by up to 3.1%. – We also observe and discuss cases when features without zero-mean can be learnt well by neural networks, which stands in contrast to prior work. Without waiting for acoustic features being normalized correctly, a run-on speech recognition using our proposed feature extraction can process utterances of arbitrary length (or shortness). It can also handle situations where multiple speakers are sharing a single microphone.
Building Real-Time Speech Recognition Without CMVN
2
453
Improving Real-Time Feature Extraction
We propose performing two steps to learn robust feature extraction for realtime speech recognition systems. First, the traditional mel-filter features are transformed into LDA domain and then fed into a bottleneck network to have final features which are value-normalized and easier to exploit by neural network models. Second, in order to increase the system’s robustness, data augmentation is used for retraining both the feature extraction and the network model.
Fig. 1. Real-time feature extraction.
2.1
Using LDA Transformed Features
The most popular acoustic features such as MFCC or FBANK without normalization are problematic input for neural networks to learn. MFCC features usually span a wide range in every dimnension, e.g., [−93, 363] on typical data, while FBANK features only have positive values, e.g. in the range [0, 11.66]. We attempt to find a transformed domain such that the transformation can be performed in real-time. Linear Discriminant Analysis (LDA) [4] is usually used for dimensionality reduction, but here we propose to use it only for feature transformation. Using LDA, we compute a d × d linear transformation matrix which
454
T. S. Nguyen et al.
projects d -dimensional FBANK into a new domain with the same dimensionality. In this LDA domain, the features maintain the class-discriminatory information and can be mapped with their class-separability magnitudes according to the associated eigenvectors and eigenvalues. When used for dimensionality reduction, LDA is applied by keeping only k (much smaller than d ) features with largest magnitudes. We, however, use all d -dimensional features in our in models because we observed better system performance. 2.2
Using Normalized Bottleneck Features
As will be experimentally shown, optimizing single network models on unnormalized data can be hard. Dealing with this situation, our idea is to train a first network model for extracting length-normalized features. Later we can use a second network to perform the real classification task. Figure 1 illustrates our proposed feature extraction architecture. The input of the network can be unnormalized FBANK or LDA-transformed features. We employ some rectifier [25] layers on top of the input layer, followed by a narrow (bottleneck) layer of 42 sigmoidal units. Two last layers which will be discarded after the training include one rectifier and the final softmax. Since the training of this feature extraction optimizes phonemes classification, the extracted features at the bottleneck layer are supposed to be significant for class-discrimination. When using a sigmoidal activiation function, we can obtain bottleneck features that are normalized to be in a small range which can be easier handled by the second network. We experimented with sigmoidal functions, the logistic function which has range if [0,1] and the hyperbolic tangent which produces features in range [−1, 1]. Different from [8,24], the proposed feature extraction is able to handle both normalized and unnormalized inputs. It does not suffer from vanishing gradients and does not need pre-training which significantly reduces the training time. Applying this feature extraction in real-time can be considered as adding more hidden units to the classification network, which linearly increases the computation time (i.e. 25% in our experiments). 2.3
Increasing Robustness by Data Augmentation
As will be explored in this paper, the neural network systems trained on unnormalized features potentially need to deal with environment mismatch between training and testing. In speech recognition, mismatches such as different speech variations, background noises or microphones, can lead to a significant drop of recognition performance. In this paper, we analyze the robustness of our proposed feature extraction against the mismatch of audio volume conditions and improve it with the help of data augmentation. Data augmentation applied to speech recognition has been explored in many studies. In [10,16], corrupting clean training speech with noise improved the speech recognizer against noisy speech. Using vocal tract length perturbation [11] has shown gains on TIMIT. In [12,14], training with speed and volume perturbations datasets increased the system performance on several LVCSR tasks.
Building Real-Time Speech Recognition Without CMVN
455
In this paper, we only consider data augmentation by performing volume perturbation.
3 3.1
Experimental Setup Training and Test Data
In our experiments we used a large training set of 460 h. This dataset is the result of combining TED-LIUM [18], Quaero [21] and Broadcast News [9] corpora. Our three evaluation sets include TED-LIUM test, tst2013 from the IWSLT evaluation campaign [2] and the English set from the MSLT corpus [5] which contains conversations over Skype. The volume perturbations were done as suggested by [14] where each recording was scaled with a random variable using sox. We set the random variable within the range [0.2, 2] for all recordings in the training data set. Then they were added to the original training set to form the augmented dataset. To investigate the robustness against volume mismatch, we used the ranges [0.2, 0.6] and [1.6, 2.0] for the all recordings of the tst2013 set to create a perturbed test set. 3.2
System Description
All the network models used roughly same number of input features (i.e., 440 FBANK and 462 LDA or bottleneck features) and were trained using the crossentropy loss function to predict 8,000 context-dependent phonemes. Rectifier networks were constructed of 6 hidden layers with 1,600 units per layer. For sigmoidal networks, we used 5 hidden layers of 2,000 units and performed pretraining with denoising auto-encoders [23]. For our convolution neural network (CNN), we used the best architecture from [20] which includes two convolutional layers of 256 hidden units with filter size 9 and a max pool size of 3, followed by 4 fully connected layers with 1,024 units. However, we did not use delta and delta-delta features for consistent comparisons between models. The tests were performed with the Janus Recognition Toolkit (JRTK) [6] with a 4-gram language model and a vocabulary of more than 150,000 words.
4 4.1
Results Using Normalized and Unnormalized Features
In Table 1, we compare the systems using different CMVN methods against various systems trained on unnormalized FBANK features. Using our training data, CMVN systems performance depends on the amount of available speaker historical data. Normalization at speaker level yielded the best performance, followed by utterance level normalization and normalizations with windows 300 frames in length. The results on the perturbed test set show an interesting fact that these normalizations produce robust features to the changes of audio volume.
456
T. S. Nguyen et al.
Global CMVN is less optimal than other normalizations (7.1% rel. increase in WER compared to speaker level). However, real-time system may have to adopt this method, in order to achieve acceptable latency. For the normalized features, the gap between sigmoidal and rectifier [25] networks appears small. However, when using the features without normalization which have only positive values in a large range [0, 11.66], optimizing sigmoidal networks for good convergence becomes difficult. We had to reduce the initial learning rate by a factor of ten compared to normalized features. The training then converged at a poor local minimum and caused worse classification performance. The situation changed with the rectifier network. We were able to keep the same learning rate and the training converged with the same pattern. However, it suffers from a 7.3% rel. increase in WER compared to global CMVN. Switching to a CNN network gave a further improvements, however its result is still not good as that of the CMVN systems. These results demonstrate the difficulties when training single network models on unnormalized FBANK features. The increase in WER of the systems using unnormalized features and global normalized features on the perturbed test set indicates that they may be sensitive to volume mismatch between training and test data. Table 1. Word error rates of various systems using 40 log mel-filter bank features with and without CMVN.
4.2
CMVN
Network Type
tst2013 tst2013-vp
Speaker
sigmoid
15.5
15.5
Utterance sigmoid
15.8
15.8
Window
sigmoid
16.2
16.4
Global
sigmoid
16.6
17.3
Global
rectifier
16.5
17.1
none
sigmoid
22.3
23.2
none
rectifier
17.7
18.0
none
rectifier (CNN) 17.1
17.6
Using LDA Transformed Features
Table 2 compares the efficiency of different LDA transformations applied to unnormalized features. Such a conventional approach which reduces dimensionality of 440 features of 11 consecutive frames down to 42 and then stacks again for 11 frames, does not show clear improvements. When transforming 40 FBANK features without reduction and stacking 11 adjacent frames of LDA features as the network input, the systems improved. Further improvement was achieved when transforming 440 features of 11 consecutive frames via LDA and using
Building Real-Time Speech Recognition Without CMVN
457
them as network input. Interestingly, the transformed features which are in the range [−14.95, 14.50] without zero-mean are better than FBANK with global CMVN. When applying global mean and variance normalization again on these LDA features, the performance even got worse showing that the normalization is unnecessary for this training data. The large degradation (5.4% rel. in WER) of the performance on the perturbed test set presents the need of a method for improving LDA features against possible environment mismatch. Table 2. The systems with LDA features. LDA Feature CMVN DNN Reduction
4.3
none
tst2013 tst2013-vp
rectifier 17.5
17.8
Full-40
none
rectifier 16.8
17.4
Full-440
none
rectifier 16.2
17.0
Full-440
Global rectifier 16.5
17.2
Full-440
none
17.7
sigmoid 16.8
Using Normalized Bottleneck Features
The proposed bottleneck feature extraction shows its advantages when applied to both unnormalized FBANK and LDA features and produces improved features. The same networks trained on the bottleneck features showed relative reduction of 7.4% and 4.9% as shown in Table 3. The extracted bottleneck features are in a normalized range [0, 1] or [−1, 1], so a sigmoid network can be trained well showing again that we do not need to apply mean normalization. When evaluating against the mismatch test set, we found that the extracted features are more stable to speech variations indicating the normalized bottleneck network may be automatically forced to learn robust features. Table 3. The system with normalized bottleneck (BN). Feature
BN Type DNN
tst2013 tst2013-vp
FBANK sigmoid
rectifier 16.4
16.6
FBANK sigmoid
sigmoid 16.5
16.8
LDA
sigmoid
rectifier 15.5
15.8
LDA
tanh
rectifier 15.5
15.8
458
4.4
T. S. Nguyen et al.
Using Data Augmentation
When retraining the feature extraction and the systems on the augmented dataset, we obtained improvements on both test sets as presented in Table 4. Now, there are only small gaps between the two test sets indicating robustness of the models and the effectiveness of the proposed data augmentation. Retraining improves the recognition performance in general (i.e. 3.1% rel. for the bottleneck system using FBANK). We could only achieve small gains for the systems using LDA features. This can be the case when we only retrained the feature extractions and the systems without re-estimating the LDA transformation. Table 4. The systems trained with data augmentation. Feature
tst2013
FBANK
17.3 (2.3%) 17.4 (3.3%)
tst2013-vp
LDA
16.1 (0.6% ) 16.3 (4.7%)
BN-FBANK 15.9 (3.1%) 15.9 (4.2%) BN-LDA
4.5
15.4 (0.7% ) 15.6 (1.3% )
Comparison on Different Test Sets
Table 5 compares the results of our systems on two different test sets. TEDLIUM set contains 11 TED talks while the MSLT set is a collection of 3,000 utterances of recorded Skype conversations. There is no speaker information for the MSLT set and more than a half of the utterances are less than 3 s. In different online domains, our proposed feature extraction can reduce the WER by 12.1% relative compared to the global CMVN. Comparing to another system of the same complexity which uses the bottleneck architecture from [8], we also achieve a significant improvement. Table 5. Results on the TED-LIUM and MSLT test sets. Feature
CMVN TED-LIUM MSLT2016
FBANK Global
9.8
BN [8]
9.2
30.9
FBANK none
10.3
35.0
BN-LDA none
8.7
29.8
Global
33.9
Building Real-Time Speech Recognition Without CMVN
5
459
Conclusions
We have presented a novel and effective feature extraction for real-time and runon speech recognition. Our proposed two-step transformation is able to transform unnormalized log mel-filterbank features into useful value-normalized features. These features can be used directly for neural networks or Gaussian mixture models without further normalization. Applying this feature extraction approach hides the involvement of explicit normalization such as CMVN. Other real-time speech applications (such as speaker recognition) can also benefit from our method.
References 1. Alam, M.J., Ouellet, P., Kenny, P., O’Shaughnessy, D.: Comparative evaluation of feature normalization techniques for speaker verification. In: Travieso-Gonz´ alez, C.M., Alonso-Hern´ andez, J.B. (eds.) NOLISP 2011. LNCS (LNAI), vol. 7015, pp. 246–253. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-250200 32 2. Cettolo, M., Niehues, J., St¨ uker, S., Bentivogli, L., Frederico, M.: Report on the 10th IWSLT evaluation campaign. In: The International Workshop on Spoken Language Translation (IWSLT) 2013 (2013) 3. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011) 4. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience (2000) 5. Federmann, C., Lewis, W.D.: Microsoft speech language translation (MSLT) corpus: the IWSLT 2016 release for English, French and German. In: The International Workshop on Spoken Language Translation (IWSLT) 2016 (2016) 6. Finke, M., Geutner, P., Hild, H., Kemp, T., Ries, K.R., Westphal, M.: The karlsruhe VERBMOBIL speech recognition engine. In: Proceedings of ICASSP (1997) 7. Furui, S.: Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29(2), 254–272 (1981) 8. Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3377–3381. IEEE (2013) 9. Graff, D.: The 1996 broadcast news speech and language-model corpus. In: Proceedings of the DARPA Workshop on Spoken Language Technology (1997) 10. Hannun, A., et al.: Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014) 11. Jaitly, N., Hinton, G.E.: Vocal tract length perturbation (VTLP) improves speech recognition. In: Proceedings of ICML Workshop on Deep Learning for Audio, Speech and Language (2013) 12. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH, pp. 3586–3589 (2015) 13. LeCun, Y.A., Bottou, L., Orr, G.B., M¨ uller, K.-R.: Efficient backprop. In: Montavon, G., Orr, G.B., M¨ uller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012). https://doi.org/ 10.1007/978-3-642-35289-8 3
460
T. S. Nguyen et al.
14. Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: INTERSPEECH, pp. 3214– 3218 (2015) 15. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011 16. Prisyach, T., Mendelev, V., Ubskiy, D.: Data augmentation for training of noise robust acoustic models. In: Ignatov, D.I., et al. (eds.) AIST 2016. CCIS, vol. 661, pp. 17–25. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52920-2 2 17. Pujol, P., Macho, D., Nadeu, C.: On real-time mean-and-variance normalization of speech recognition features. In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings, vol. 1, p. I. IEEE (2006) 18. Rousseau, A., Del´eglise, P., Est`eve, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: Proceedings of LREC (2014) 19. Sainath, T.N., Kingsbury, B., Mohamed, A.R., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 297–302. IEEE (2013) 20. Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE (2013) 21. St¨ uker, S., Kilgour, K., Kraft, F.: Quaero 2010 speech-to-text evaluation systems. In: Nagel, W., Kr¨ oner, D., Resch, M. (eds.) High Performance Computing in Science and Engineering ’11. Springer, Heidelberg (2012). https://doi.org/10.1007/ 978-3-642-23869-7 44 22. Viikki, O., Laurila, K.: Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Commun. 25(1), 133–147 (1998) 23. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: The 25th International Conference on Machine Learning, pp. 1096–1103. ACM (2008) 24. Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: Interspeech, vol. 237, p. 240 (2011) 25. Zeiler, M.D., et al.: On rectified linear units for speech processing. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3517–3521. IEEE (2013) 26. Zeyer, A., Schl¨ uter, R., Ney, H.: Towards online-recognition with deep bidirectional LSTM acoustic models. In: Interspeech 2016, pp. 3424–3428 (2016)
Choice of Signal Short-Term Energy Parameter for Assessing Speech Intelligibility in the Process of Speech Rehabilitation Dariya Novokhrestova , Evgeny Kostyuchenko(B) , and Roman Meshcheryakov Tomsk State University of Control Systems and Radioelectronics, Tomsk, Russia [email protected], [email protected] http://www.tusur.ru
Abstract. The article describes an approach to assessing the intelligibility of speech in the process of speech rehabilitation by finding the measure of the similarity of the standard and distorted pronunciation of phonemes. The approach is based on the calculation of the correlation coefficient between the transformed signal envelopes. The envelope of the signal is constructed on the basis of the calculation of the shortterm energy of signal. The selection of the short-term energy parameter (window size) is also described. The parameter selection is based on comparing the differences between the correlation coefficients for pairs with normal pronunciation and pairs with distorted pronunciation, calculated for different window sizes. The window sizes for each problem phoneme are selected. Keywords: Correlation Speech quality criteria
1
· Cancer of the oral cavity and oropharynx
Introduction
The urgency of the development of rehabilitation techniques after surgical treatment of oncological diseases of the organs of the speech-forming apparatus is confirmed every year by an increasing number of their detection. In 2016, about 25,000 new cases of diseases have been identified, and the total number of patients with cancer of this location is currently more than 100,000 [1,2]. Currently, rehabilitation is carried out according to GOST R 50840-95 [3], the lack of which is a subjective assessment of speech intelligibility. As part of the development of new methods of voice rehabilitation is one of the important tasks the development of an automated system for assessing speech intelligibility of patients that would avoid the subjective evaluation. Such evaluation may be obtained by comparing the reference pronunciation of syllables by the patient with the evaluated one, and the dynamics of rehabilitation based on comparing c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 461–469, 2018. https://doi.org/10.1007/978-3-319-99579-3_48
462
D. Novokhrestova et al.
the estimates of different sessions is of interest. The reference is the speech of the same patient before surgery. In the previous stages of the study, an approach for the formation of such an estimate based on the correlation coefficient was described [4]. In [5], an approach was proposed for automatic time normalization (in view of the impossibility of comparing two signals of different lengths) on the basis of a dynamic transformation of the time scale (dynamic time warping - DTW), and an attempt to apply smoothing to the energy of the signal was made. This article describes an approach to the evaluation of speech intelligibility through the calculation of the correlation coefficient using the algorithm of the dynamic time warping applied to the envelopes of signals and the choice of the envelope parameter. As the envelope, the short-term energy values found are used. The correlation value is analysed depending on the window size when calculating the short-term energy.
2 2.1
Description of the Algorithm Description of the Input Data
At this stage of the study, the intelligibility assessment was carried out only for problematic phonemes defined earlier [6]. For this, 3 sets of records were made, the recording was made by a healthy speaker: 2 sets of records with normal pronunciation of phonemes in syllables (normal phonemes), 1 set of recordings with pronunciation of problematic phonemes without using the language (modified phoneme). Each set is 90 records of different syllables: 6 problem-phonemes [k], [k ’], [s], [s’], [t], [t ’], for each 15 syllables (5 syllables with a phoneme at the beginning of the syllable, 5 syllables with a phoneme in the middle of the syllable, 5 syllables with a phoneme at the end of the syllable), a complete list of phonemes is also presented in [6]. 2.2
Short-Term Energy
Applicability of short-term energy in speech analysis is described in [7]. Shortterm energy characterizes the signal energy within a window of size N and is defined as T +N −1 (s(n) ∗ w(n))2 (1) ET = n=T
where s(n) is the amplitude of the signal in the count n, w(n) is the window function. This algorithm uses a rectangular window, that is, w(n) = 1. The overlap between the windows is N −1. 2.3
Description of the Sequence of Actions
The algorithm can be described as a sequence of steps.
Choice of Signal Short-Term Energy Parameter
463
1. Segmentation of the phoneme in the syllable. In each record, a problematic phoneme was singled out, the segmentation was carried out manually by listening and analysing the oscillogram and the spectrogram of the signal. Phonemes were cut into separate sound files, and further work was carried out with these files. 2. Transformation of records of phonemes in a sequence of values of amplitudes of signals. 3. Finding the values of signal short-term energies with the size of the window N for three realizations of phonemes in the syllable (phonemes from the same syllable from different sets of records). 4. Application of the DTW algorithm for each of the pairs of obtained arrays of short-term energy values. The DTW algorithm itself is described in [8]. 5. Finding the correlation coefficient between the transformed values of the short-term energy values Rnj (the correlation coefficient for the normal pronunciation of phonemes from the j-th syllable), R1ij and R2ij (correlation coefficients for pairs normal pronunciation - distorted pronunciation for the phoneme from the j-th syllable). For each phoneme with the same location in the syllable (for example, background-ma [s] at the beginning of the syllable) repeat step 2–5. Thus, 5 values of Rnj , R1ij and R2ij were found for each location of the phoneme in syllables. 6. Finding the average values of Rn (2), R1i (3), and R2i (4), that is, the average value for a pair of normal phoneme pronunciation and the average values for pairs of distorted-normal phoneme pronunciation, as 5 j=1 Rnj (2) Rn = 5 5 j=1 R1ij R1i = (3) 5 5 j=1 R2ij R2i = (4) 5 7. Finding d is the ratio of the average arithmetic of the correlation coefficients average values for pairs with a changed pronunciation to the average value of correlation coefficient for normal pronunciation, according to R1i + R2i (5) 2Rn Steps 2–7 are repeated for window sizes N from 10 to 300 for phonemes [k], [k’], [s] and [s’] and from 10 to 250 for phonemes [t] and [t’] in increments of 10. The window size is usually takes 10–20 ms depending on the data analysed. At a sample rate of 16000 Hz, this is 160–320 samples, but since the phoneme duration [t] [t] is 0.06–0.04 s for normal pronunciation, the window more than 15 ms strongly distorts the signal data. 8. Construction of the approximating quadratic function f (N ) by the method of least squares [9] to analyse the dependence of the values of d on the window size N . d=
464
3
D. Novokhrestova et al.
Results and Discussion
The average values of the correlation coefficients for each of the pairs (Rn , R1i , R2i ), as well as the ratio of the average arithmetic mean values of the correlation coefficients for pairs, the distorted phoneme-normal phoneme to the mean correlation coefficient between normal phonemes, were presented as graphs of the window size of the instantaneous energy for each of the locations in the syllable of problematic phonemes. Also, for an average relation, an approximating quadratic function was determined by the least squares method [9]. Let us consider each of the phonemes successively. As for the hard and soft realization of phonemes similar results are obtained, then these implementations are considered and analysed together. Those window sizes N for which the value of d is less than the value of d for N = 1 will be considered as suitable window sizes for constructing the signal envelope. In the graphs below, the average correlation coefficients are plotted along the main axis (left scale), the values of d and the approximating function f (N ) are plotted along the auxiliary axis (right scale). 3.1
Phonemes [k] and [k ’]
Figure 1 shows the resulting values for different locations of phonemes [k] and [k ’] in syllables. For the arrangement of the phoneme [k] at the beginning of the syllable, though, a significant decrease in the values of d with increasing window size N is observed. However, with a window size greater than 150, one of the average correlation coefficients for the pair is a distorted pronunciationnormal pronunciation becomes negative, which means an inverse relationship that is difficult to interpret within the scope of the problem being solved for assessing the quality of pronunciation of syllables. For the phoneme [k ’] at the beginning of the syllable, according to the approximating function, the minimum values of d are attained at the minimum and maximum values of the size of the window N . However, if we look at the values of d themselves, then the values are less than d for N = 1, at N from 260 and above. The minimum value of the approximating function for the phoneme [k] at the beginning of the syllable is at N = 300, for the phoneme [k ’] at the beginning of the syllable it is at N = 300. When phonemes [k] and [k ’] are arranged in the middle of the syllable, similar patterns are observed: with an increase in the size of the window of short-term energy, the value of d decreases. All values of d are less than 1, but for N from 10 to 60 for the phoneme [k] and for N from 10 to 20 [k ’], the values of d are greater than the initial values of d for N = 1. For a phoneme [k ’] with a window size N greater than 150, one of the correlation coefficients for pairs of distorted-normal pronunciation becomes negative. The minimum value of the approximating function for the phoneme [k] in the middle of the syllable is at N = 300, for the phoneme [k ’] in the middle of the syllable is at N = 300. When phonemes [k] and [k ’] are arranged at the end of the syllable, similar patterns are also observed. By the form of the approximating function, it’s possible to say that as the window size is increased, the value of d is also reduced. The values of d are less than the original values for N from 190 for the phoneme [k] and
Choice of Signal Short-Term Energy Parameter
465
for N from 200 for the phoneme [k ’]. The minimum value of the approximating function for the phoneme [k] at the end of the syllable is at N = 300, for the phoneme [k ’] at the end of the syllable N = 300. Based on the obtained results, it can be concluded that for phonemes [k] and [k ’], in constructing an envelope based on short-term energy, it is possible to select the window size N about 20 ms, i.e. about 300 counts.
Fig. 1. Results for phonemes [k] and [k ’] - (a), (b), (c) and (d), (e), (f) respectively (a, d - phoneme at the beginning of the syllable, b, e - phoneme in the middle of the syllable, c , f - phoneme at the end of the syllable). x - window size, y - result quality estimation (0–1).
3.2
Phonemes [s] and [s’]
Figure 2 shows the resulting values for different locations of phonemes [s] and [s’] in syllables. Even though these phonemes were also combined for analysis, similar values were obtained only for the location of the phoneme at the end of the syllable. If the phoneme [s] is located at the beginning of the syllable, by the form of the approximating function and the obtained values of d, the minimum values are attained at N from 120 to 150, the minimum of the function is observed for N = 135. Also, with N greater than 270, there is a sharp decrease in the correlation coefficient between a pair of normal phoneme pronunciations. If the phoneme [s’] is positioned at the beginning of the syllable, the value of d
466
D. Novokhrestova et al.
increases with increasing window size. For a given arrangement of the phoneme, the construction of the enveloping on the basis of short-term energy will not lead to an increase in the difference between the normal and distorted pronunciation of the phoneme. The minimum of the approximating function f (N ) on the investigated segment is observed for N = 1. The construction of the envelope based on short-term energy for the phoneme [s] in the middle of the syllable also does not lead to an improvement in results, since the value of d also increases with increasing the window. The minimum of the approximating function is observed for N = 7. For the phoneme [s’] in the middle of the syllable, on the contrary, as the window size increases, the value of d decreases. And all the obtained values of d are less than d for N = 1. The minimum of the approximating function on the investigated segment is observed in N = 300. For phonemes [s] and [s’] at the end of the syllable, there is no obvious decrease in the values of d as the window is increased, the short-term energy is not observed. According to the approximating function, the minimum values of d for the phoneme [c] and for the phoneme [c ’] are in N = 165 and N = 134, respectively.
Fig. 2. Results for phonemes [s] and [s’] - (a), (b), (c) and (d), (e), (f) respectively (a, d - phoneme at the beginning of the syllable, b, e - phoneme in the middle of the syllable, c , f - phoneme at the end of the syllable).
Choice of Signal Short-Term Energy Parameter
3.3
467
Phonemes [t] and [t’]
Figure 3 shows the resulting values for different locations of phonemes [t] and [t ’] in syllables. With the arrangement of phonemes at the beginning of the syllable, in spite of the different kinds of approximating functions, for any values of the window sizes the values of d are less than the value of d for N = 1. For the phoneme [t], the minimum of the approximating function, as well as the minimum values of d, are observed for N = 151. For the phoneme [t’], despite the fact that the minimum of the approximating function is at N = 1, the minimum values of d are observed at N from 30 to 50. When the phoneme [t] is located in the middle of the syllable, the values of d decrease with increasing the window size, however, when the minimum of the function at the point N = 65 is reached, the values begin to increase. But all the values obtained are less than the value of d for N = 1. The minimum values of d are observed at N equal to the following values: 10, 20, 50, 60. For the phoneme [t’] in the middle of the syllable, all d values are greater than the original value, despite the form of the approximating function. The minimum of the function is observed in N = 300. When phonemes [t] and [t’] are located at the end of the syllable, the minimum values are observed for N = 1, as the window size increases, the value of d also increases.
Fig. 3. Results for phonemes [t] and [t’] - (a), (b), (c) and (d), (e), (f) respectively (a, d - phoneme at the beginning of the syllable, b, e - phoneme in the middle of the syllable, c, f - phoneme at the end of the syllable).
468
4
D. Novokhrestova et al.
Conclusion
After analyzing the obtained data, one can conclude that it is impossible to select a single parameter for all problem phonemes for constructing an envelope based on instantaneous energy. For phonemes [k] and [k ’], the maximum difference between a pair with a normal pronunciation of phonemes and pairs with a changed pronunciation is the minimum values achieved with a window size of 20 ms. For the phonemes [s], [s’], [t] and [t’], the window sizes were also determined, for which the minimum values of d reflecting the similarity of the normal and changed pronunciation are achieved. For the phoneme [s] at the beginning of the syllable, the window size with the minimum value is 135, for the same phoneme in the middle and the end of the syllable the window size is 1 and 165, respectively. For the phoneme [s’] at the beginning, middle and end of the syllable, the sizes of the windows are 1, 300 and 134, respectively. For the phonemes [t] and [t’] at the end of the syllable, the best result for the window size is N = 1. The window size for [t] at the beginning of the syllable is N = 151, in the middle of N = 65. For [t’] at the beginning and middle of the syllable N = 40 and N = 300, respectively. The approach to the evaluation of speech intelligibility based on an automatic calculation of the correlation coefficient between records with normal and distorted phoneme pronunciation between transformed envelopes based on short-term energy is described. This approach can be applied in the process of speech rehabilitation after surgical treatment of oncological diseases of the speech-forming apparatus. On next step of the work will be analysed a combination of the DTW and deep learning for segmentation and recognition of the syllables in task of speech rehabilitation [10]. Acknowledgements. Supported by a grant from the Russian Science Foundation (project 16-15-00038).
References 1. Kaprin, A.D., Starinskiy, V.V., Petrova, G.V.: Status of cancer care the population of Russia in 2016. P.A. Hertsen Moscow Oncology Research Center - branch of FSBI NMRRC of the Ministry of Health of Russia, Moscow (2018) 2. Kaprin, A.D., Starinskiy, V.V., Petrova, G.V.: Malignancies in Russia in 2014 (Morbidity and mortality). P.A. Hertsen Moscow Oncology Research Center branch of FSBI NMRRC of the Ministry of Health of Russia, Moscow (2017) 3. Standard GOST R 50840–95 Voice over paths of communication. Methods for assessing the quality, legibility and recognition. Publishing Standards, Moscow (1995) 4. Kostyuchenko, E., Meshcheryakov, R., Ignatieva, D., Pyatkov, A., Choynzonov, E., Balatskaya, L.: Correlation normalization of syllables and comparative evaluation of pronunciation quality in speech rehabilitation. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 262–271. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3 25
Choice of Signal Short-Term Energy Parameter
469
5. Novokhrestova, D.: Time normalization of syllables with the dynamic time warping algorithm in assessing of syllables pronunciation quality when speaking. Proc. TUSUR 4(20), 142–145 (2017). https://doi.org/10.21293/1818-0442-2017-20-4142-145 6. Kostyuchenko, E., Ignatieva, D., Meshcheryakov, R., Pyatkov, A., Choynzonov, E., Balatskaya, L.: Model of system quality assessment pronouncing phonemes. In: Dynamics of Systems, Mechanisms and Machines, Dynamics. Omsk (2016). https://doi.org/10.1109/Dynamics.2016.7819016 7. Bachu, R., Kopparthi, S., Adapa, B., Barkana, B.: Voiced/Unvoiced Decision for Speech Signals Based on Zero-Crossing Rate and Energy. In: Elleithy K. (eds.) Advanced Techniques in Computing Sciences and Software Engineering, 279–282. Springer, Dordrecht (2010). https://doi.org/10.1007/978-90-481-3660-5 47 8. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 1(26), 43–49 (1978). https://doi.org/10.1109/tassp.1978.1163055 9. Legendre: Adrien-Marie New Methods for the Determination of the Orbits of Comet. F. Didot, Paris (1805) 10. Kipyatkova, I.S., Karpov, A.A.: Variants of deep artificial neural networks for speech recognition systems. Proc. SPIIRAS 49(6), 80–103 (2016). https://doi.org/ 10.15622/sp.49.5
The Benefit of Document Embedding in Unsupervised Document Classification Jarom´ır Novotn´ y(B) and Pavel Ircing Faculty of Applied Sciences, Department of Cybernetics, The University of West Bohemia, Plzeˇ n, Czech Republic {fallout7,ircing}@kky.zcu.cz http://www.kky.zcu.cz/en
Abstract. The aim of this article is to show that the document embedding using the doc2vec algorithm can substantially improve the performance of the standard method for unsupervised document classification – the K-means clustering. We have performed rather extensive set of experiments on one English and two Czech datasets and the results suggest that representing the documents using vectors generated by the doc2vec algorithm brings a consistent improvement across languages and datasets. The English dataset – 20NewsGroups – was processed in a way that allows direct comparison with the results of both supervised and unsupervised algorithms published previously. Such comparison is provided in the paper, together with the results of supervised classification achieved by the state-of-the-art SVM classifier.
Keywords: Document embedding K-means · SVM
1
· Doc2vec · Classification
Introduction
It is generally accepted that even such a simple unsupervised algorithm as the classic K-means achieves surprisingly good classification results, if it is presented with appropriate feature vectors. Our previous research [8] confirmed that the well-established tf-idf vectors work rather well. The aim of the work presented in this paper was to test whether the recently introduced document embeddings produced by the doc2vec method [2,4,15] can further improve the performance.
2
Datasets
As our basic dataset, we have again picked the 20NewsGroups English corpus1 which is widely used as a benchmark for document classification [1,3,5,7,8,11,12]. 1
This data set can be found at http://qwone.com/∼jason/20Newsgroups/ and it was originally collected by Ken Lang.
c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 470–478, 2018. https://doi.org/10.1007/978-3-319-99579-3_49
The Benefit of Document Embedding
471
It contains 20 000 text documents which are evenly divided into 20 categories that each contain discussion about a specific topic. The second data set CNO and all sub-sets of this data set are in the Czech language. It contains approximately 68 000 articles divided into 31 categories2 . This corpus was created so that it is at least in size and partially also in topics comparable to the English data set. Third group of data sets – TC and Large TC – consists of the transcription of phone calls from the Language Consulting Center (LCC) of the Czech Language Institute of the Academy of Sciences of the Czech Republic, which provides a unique language consultancy service in the matters of the Czech language. The counselors of the LCC are answering questions regarding the Czech language problems on a telephone line open to public calls. The data, gathered from these language queries are unique in several aspects. The Language Consulting Center deals with completely new language material so it is the only source of advice for new language problems. It also records peripheral matters that will never be explained in dictionaries and grammar books as these are focused on the core of the language system. In order to compare our results with the ones published previously, we have re-created two subdivision of the 20NewsGroup corpus. The first one is created according to [12] and also used in our previous work [8] where it is described in more details. The other subdivision is created in order to compare the results with experiments described in [1,3]. 20NG1 data sub-set consists of the 5 new categories (according to [1]) created by joining original ones as follows: Motorcycle – Motorcycle and Autos; Hardware – Windows and MAC; Sports – Baseball and Hockey; Graphics – Computer graphics; Religion – Christianity, Atheism and misc. Furthermore, they divided this sub-set to three training and testing data sets, where they used [50, 200, 350] documents as test data and the rest as training data. 20NG2 input is whole unchanged 20NewsGroup corpus divided into training (13 000 documents) and testing (aproximatelly 7 000 documents) data (the same divisions as in [3]). The results achieved on the CNO and TC sets and sub-sets cannot be directly compared with the results of other research teams as the data are not (yet) made publicly available. However, these data are important for our own research and we decided to publish the results here to show some important properties of the doc2vec embedding (see the discussion below). From the first Czech data – CNO – we have created the following subsets: – Set CNO consists of all 31 original categories. This results in approximately 68 000 documents in total. 2
It was created from a database of news articles downloaded from the http://www. ceskenoviny.cz/ at the University of West Bohemia and constitutes only a small fraction of the entire database – the description of the full database can be found in [14].
472
J. Novotn´ y and P. Ircing
– Set RCNO1 consists of 11 original categories which contain at least 1000 documents. – Set RCNO2 consists of 10 original categories containing between 500 and 1500 documents. – RCNO3 set is created from 12 categories, each containing randomly chosen 1000 documents from the original categories. This set is created for the purpose to be similar to 20NewsGroup corpus. The data TC and Large TC sets were created from a corpus obtained by LCC. These data sets consist of manually transcribed 607 parts of historical mono phone calls (each call can contain more than one parts, each part with different questions about different topic) and automatically transcribed (by ASR system3 ) 3128 parts of actual stereo phone call, all divided into 20 categories by their topic. These 20 categories were manually assigned by counselors from LCC (for example “semantics” or “lexicology”) and corresponds with the higher level of the linguistic topic tree. The division of phone calls into categories is not uniform, some categories contain only a few parts. The setting is based on previous findings. TC consists of mentioned 20 categories containing 3713 transcripted text parts of the phone calls. Some of the categories are formed from a small number of texts (for example only 10), we responded to that by creating Large TC data consisting of 10 original categories (3343 transcripted text parts) where each contains at least 100 text parts.
3
Preprocessing
First processing step is only in case of the 20NewsGroups data, where we removed all the headers except for the Subject. Then all uppercase characters were lowercased and all digits were replaced by one universal symbol. As the next processing step, we wanted to conflate different morphological forms of the given word into one representation. We opted for lemmatization. The MorphoDiTa [13] tool was picked for the task – it works for both English and Czech and is available as a Python package.4 Traditional stop word removal is further preprocessing operation done in this paper by picking only the top T lemmas with highest mutual information (MI). After applying all these processing steps we can create following vector representations: 3.1
Representation by TF-IDF Weights
Common representation in text processing task named TF-IDF weights – i.e. combination of Term Frequency (TF ) and Inverse Document Frequency (IDF )
3 4
Created by colleagues at University of West Bohemia. ufal.morphodita at https://pypi.python.org/pypi/ufal.morphodita.
The Benefit of Document Embedding
473
weights. The well-known formula to compute TF-IDF weights wl,d for the lemmas l ∈ L and documents d ∈ D: wl,d = tfl,d ∗ idfl
(1)
where tfl,d denotes the number of times the lemma l occurs in document d and idfl is computed using formula: idfl =
N N (l)
(2)
where N is a total number of documents and N (l) denotes a number of documents containing the lemma l. In essentially all further experiments we use implemented Python package sklearn [9]5 for computing TF-IDF weights. 3.2
Representation by Doc2vec Weights
According to [4] doc2vec representation is simple extension of word2vec. This is done by embedding word sequences into vectors. Input can be n-grams, sentences, paragraphs or whole documents. This type of representation is considered as state-of-the-art for sentiment analysis, which is essentially also a classification task. There was therefore a good chance that it will help in our task as well. In this paper we use the doc2vec implementation in Gensim package [10] for Python. Input data are in form of pairs consist of feature vector representation gain from 3 and label of the given document. The output is then vectors of doc2vec weights, where every row corresponds to a specific document. 3.3
Use of LSA Reduction on Representations 3.1 and 3.2
We have also tried to further reduce the dimension of the vector representations described in 3.1 and 3.2 by the Latent Semantic Analysis (LSA) and consequently analyze the effect on the classification accuracy. The LSA method is implemented in the Python package sklearn – the module TruncatedSVD.
4
Classification Methods
For our purposes, we picked one simple supervised and one simple unsupervised method. Our goal is to use unsupervised classification and at least get similar results to supervised ones.
5
More precisely the TfidfVectorizer module from that package.
474
4.1
J. Novotn´ y and P. Ircing
K-Means
Simple unsupervised classification algorithm – the classic K-means clustering method [6] – is being used here. It is generally accepted that even such a simple method is quite powerful for unsupervised data clustering if it is given an appropriate feature vector. As we have shown in [8], even simple feature vectors consisting of the tf-idf weights appear to capture the content of the document rather well (and the reduced feature vectors obtained from LSA do it even better). However, we expected to obtain even better results from doc2vec weights as they have been shown to be very good for extraction of the semantic information from the documents. The sklearn package implementation is being used as the version of the K-means algorithm. All preprocessed representation created according to 3 are used and this model is applied to all the data sets described in Sect. 2. Results can be found in Sect. 6. 4.2
SVM
The supervised classification method being used here is the classic Linear SVM algorithm. This simple but powerful supervised data classification algorithm could be quite sufficient. This algorithm was run only with TF-IDF weights representation. We have used the version of Linear SVM algorithm implemented in our favourite sklearn package (to be exact the module Linear SVM). Results can be found in Sect. 6.
5
Evaluation
Quite a few measures for evaluation of the classification algorithms are widelyused in published papers. In our experiments, we have decided to use accuracy, precision, recall and F1; this choice was guided mostly by the fact that we wanted to compare the performance of our algorithms to the previously published results. The Accuracy (Acc) measure is picked only because of 20NG2 data set. It represents the percentage of correctly classified documents. This percentage is simply a number of the test documents, which are assigned with the correct topic. The Tables 1 and 3 lists the results with the use of Precision and Recall measures computed according to [12]. Following equations for computing microaverage type of Precision and Recall measures are explained in our previous work [8] or in article [12]. c α(c, T ) (3) P (T ) = c α(c, T ) + β(c, T ) c α(c, T ) R(T ) = (4) α(c, T ) + γ(c, T ) c
The Benefit of Document Embedding
475
Standard equation for computing F1 measure is [1]: F1 = 2 ∗
P ∗R P +R
(5)
The results reported in Tables 1 and 3 lists only the Precision measure, this is caused by usage of uni-labeled data sets (number of original categories in corpus have to be also the same as the number of output clusters from algorithms), the P (T ) is necessarily equal to R(T ) and to F 1 and it is sufficient to report only one of those values.
6
Results
First sets of results are listed in Table 1; these results were achieved on 20NG, 10NG, Binary[0/1/2], 5Multi[0/1/2], 10Multi[0/1/2] data sets. We are reporting only 10Multi Average, 5Multi Average, 2Multi Average result of the smaller data sub-sets and compare it with the values reported in the previously published paper [12]. It were used only results of unsupervised Sequential Information Bottleneck (sIB ) method created by the autors of the mentioned paper. In our experiments, Linear SVM uses 10-fold cross validation technique and we run K-means algorithms 10 times over each subset (same approach used in [12]). Averaged results from those runs are listed in Table 1. The meaning of the Kmeans experiment labels is listed in the following table: – TF-IDF uses tf-idf weights as input, every vector has size 5000. – TF-IDF (LSA) uses tf-idf weights reduced by LSA method, every vector has size 200. – doc2vec uses doc2vec weights as input, every vector has size 5000. – doc2vec (LSA) uses doc2vec weights reduced by LSA method, every vector has size 200. – TF-IDF + doc2vec is combination of TF-IDF (LSA) with doc2vec (LSA) weights, every vector has size 400. In Table 2 are listed second sets of results. We again compare our results with values reported in the previously published papers [1,3]. The authors of the [1] paper used SVM based 1 (SVM b. 1 ) and SVM based 2 (SVM b. 2 ) methods. Both of these methods are classic SVM algorithms, in case of SVM b. 1 method uses as input generated training data by use of WordNet, documents of input corpus and preprocessing as: stop-word removal, tokenization, TF-IDF representation, clusters created by Latent Semantic Indexing (LSI), etc. The SVM b. 2 method is same in preprocessing but uses the corpus of input documents. The results of both their methods and our used algorithms are macro F1-measures from three data sub-sets divided into training and testing data according to Sect. 2. The method listed as HM stated in [3] is semi-supervised classification and uses the hybrid model of deep belief network and soft regression. The unlabeled data are used to train deep belief network model and labelled data are used to
476
J. Novotn´ y and P. Ircing Table 1. Comparison of our results with results achieved in [12].
20NewsGroups sub-sets
Precision of methods [%] sIB
Linear K-means method with input representations SVM TF-IDF TF-IDF doc2vec doc2vec TF-IDF + (TF-IDF) (LSA) (LSA) doc2vec
20NG
57.50
96.38
51.75
51.68
70.91
70.76
73.14
10NG
79.50
95.61
41.43
42.42
62.80
67.81
62.67
Average “large”
68.50
95.99
46.59
47.05
66.86
69.29
67.91
10Multi Average
67.00
91.63
40.26
40.79
47.15
49.90
52.18
5Multi Average
91.67
96.85
63.65
63.25
72.45
77.76
80.95
2Multi Average
91.20
99.25
93.49
93.57
96.81
96.91
96.08
Average “small”
83.30
95.91
65.80
65.87
72.13
74.86
76.40
train softmax regression model and fine-tune the coherent whole system. The results stated as HM are only one of the few results in [3], they use different division of the data set to training and testing data, for these results they used 7 500 as the test set, 11 000 as unlabeled training set and 3000 as the labelled training set. For gaining our results we used similar division used in [3]. We gained training (we concatenated their unlabeled and labelled data – approximately 13 000 labelled documents for Linear SVM and without labels for K-means) and test data (approximately 7000 documents). Table 2. Comparison of our results with results achieved in [1, 3]. 20News Group Sets
20NG1 c F1 [%]
Methods SVM b. 1a
73.00
SVM b. 2b
64.00
HM
–
Our approach Lin. SVM (TF-IDF)
K-means method with input representations TF-IDF
TF-IDF (LSA)
doc2vec
doc2vec (LSA)
TF-IDF + doc2vec
80.00
54.00
54.00
69.01
48.00
52.00
25.72
66.06
27.15
29.47
20NG2 d Acc [%] – – 82.63 95.21 52.74 a Training done by using 20News Group and Web Features b Training done by using only 20News Group c Data set prepared according to [1] and describe in Sect. 2 d Data set prepared according to [3] and describe in Sect. 2
Results on Czech data sets are listed in Table 3. We state these only for the purpose of testing our approach on the data in the different language than English. The results on the language rather distant from English shows that our approach of the preparation of the data can be also applied in this case.
The Benefit of Document Embedding
477
Table 3. Results on Czech data sets. Czech data sets
7
Precision of methods [%] Linear SVM (TF-IDF)
K-means method with input representations TF-IDF
TF-IDF (LSA)
doc2vec
doc2vec (LSA)
TF-IDF + doc2vec
CNO
76.79
28.79
28.91
30.87
29.97
29.45
RCNO1
93.94
46.13
47.06
53.71
52.79
54.60
RCNO2
96.30
42.20
42.85
49.24
49.46
53.04
RCNO3
93.54
51.11
51.86
61.00
61.00
61.29
TC
77.92
31.29
32.12
31.51
28.65
32.53
Large TC
78.89
40.34
38.79
38.68
38.54
42.08
Conclusion
A reasonably effective pipeline for unsupervised text documents classification according to their topic is introduced in this paper. Preprocessing of the raw input text6 and extracted feature vectors7 are key factors in our approach. Simple supervised Linear SVM and unsupervised classification K-means algorithms were used and as was predicted, the supervised one is superior to the unsupervised. Our main goal is to at least have similar results with unsupervised algorithm to supervised one. The performance of this unsupervised method (stated in Table 2) was almost on par with semi-supervised algorithm and even better against supervised algorithms used in [1]. Also as you can see from all Tables 1, 2 and 3 representation with use of doc2vec model increases performance of our unsupervised method around 10%. This is an important finding of our research, since the benchmark training data – which are necessary for supervised learning – are often not available. Also our approach of preprocessing input data texts is suitable even for simple supervised Linear SVM algorithm whose performance is comparable with more complex one (Table 2). Acknowledgments. This research was supported by the Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506.
References 1. Chinniyan, K., Gangadharan, S., Sabanaikam, K.: Semantic similarity based web document classification using support vector machine. Int. Arab J. Inf. Technol. (IAJIT) 14(3), 285–292 (2017) 2. Hamdi, A., Voerman, J., Coustaty, M., Joseph, A., d’Andecy, V.P., Ogier, J.M.: Machine learning vs deterministic rule-based system for document stream segmentation. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 5, pp. 77–82. IEEE (2017) 6 7
Applying lemmatization and data-driven stop-word removal. Use of LSA method.
478
J. Novotn´ y and P. Ircing
3. Jiang, M., et al.: Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29(1), 61–70 (2018) 4. Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368 (2016) 5. Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2418–2424 (2015) 6. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: 5-th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967) 7. Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015) 8. Novotn´ y, J., Ircing, P.: Unsupervised document classification and topic detection. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 748–756. Springer, Cham (2017). https://doi.org/10.1007/978-3319-66429-3 75 9. Pedregosa, F.: Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011). http://scikit-learn.org ˇ uˇrek, R., Sojka, P.: Software framework for topic modelling with large cor10. Reh˚ pora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010). https://radimrehurek.com/gensim/ 11. Siolas, G., d’Alche Buc, F.: Support vector machines based on a semantic kernel for text categorization. In: IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN), vol. 5, pp. 205–209 (2000) 12. Slonim, N., Friedman, N., Tishby, N.: Unsupervised document classification using sequential information maximization. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 129–136 (2002) 13. Strakov´ a, J., Straka, M., Hajiˇc, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 13–18 (2014) ˇ 14. Svec, J., et al.: General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes. Lang. Resour. Eval. 48(2), 227– 248 (2014). https://doi.org/10.1007/s10579-013-9246-z 15. Trieu, L.Q., Tran, H.Q., Tran, M.T.: News classification from social media using twitter-based doc2vec model and automatic query expansion. In: Proceedings of the Eighth International Symposium on Information and Communication Technology, pp. 460–467. ACM (2017)
A Comparative Survey of Authorship Attribution on Short Arabic Texts Siham Ouamour and Halim Sayoud ✉ (
)
University of Science and Technology Houari Boumediene, Algiers, Algeria [email protected], [email protected]
Abstract. In this paper, we deal with the problem of authorship attribution (AA) on short Arabic texts. So, we make a survey on a set of several features and classifiers that are employed for the task of AA. This investigation uses characters, character bigrams, character trigrams, character tetragrams, words, word bigrams and rare words. The AA is ensured by 4 different measures, 3 classifiers (MultiLayer Perceptron (MLP), Support Vector Machines (SVM) and Linear Regres‐ sion (LR)) and a new proposed fusion called VBF (i.e. Vote Based Fusion). The evaluation is done on short Arabic texts extracted from the AAAT dataset (AA of Ancient Arabic Texts). Although the task of AA is known to be difficult on short texts, the different results have revealed interesting information on the performances of the features and classification techniques on Arabic text data. For instance, character-based features appear to be better than word-based features for short texts. Furthermore, the proposed VBF fusion provided high performances with an accuracy of 90% of good AA, which is higher than the score of the original classifier using only one feature. Globally, the results of this inves‐ tigation shed light on the efficiency and pertinency of several features and clas‐ sifiers in AA of short Arabic texts. Keywords: Natural language processing · Artificial intelligence Authorship attribution · Arabic language · Short texts · Text-mining
1
Introduction
As per definition, the task of author recognition can be divided into several fields: • authorship attribution (AA) or identification: consists in identifying the author(s) of a set of different texts; • authorship verification: consists in checking whether a piece of text is written or not by an author who claimed to be the writer; • authorship discrimination: consists in checking if two different texts are written by a same author or not [1]; • plagiarism detection: in this research field we look for the sentences or paragraphs that are taken from another author [2]; • text indexing and segmentation: which consists in segmenting the global text into homogeneous segments (each segment contains the contribution of only one author) by giving the name of the appropriate author for each text segment [3]. © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 479–489, 2018. https://doi.org/10.1007/978-3-319-99579-3_50
480
S. Ouamour and H. Sayoud
Although several works are reported for the English and Greek [4] languages, the authors have not found a lot of serious research works made with Arabic texts. That is why; they propose an overall research work of AA that handles several texts written by 10 ancient Arabic travelers who wrote several books describing their travels. A special Arabic corpus has been built by the authors of this paper in order to assess several features and classifiers. The paper is organized as follows: In Sect. 2, we quote some previous works related to AA. In Sect. 3, we describe our textual corpus. Section 4 defines the different classifiers and distances used during the experiments. Results are presented in the Sect. 5 and an overall conclusion is given.
2
Related Works
Authorship attribution consists in identifying the author of a given text. Several works have tested different features during the last three decades. For instance, Holmes in 1994 [5], Stamatatos in 2000 [6] and Zheng in 2006 [7] proposed taxonomies of features to quantify the writing style. Mendenhall in 1887 [8] proposed sentence length counts and word length counts. A significant advantage of such features is that they can be applied to any language. Several researchers used lexical features to represent the author style. However other works used common words instead [9, 10]. Hence, various sets of words have been used for English, we can quote the works of Abbasi and Chen in 2005 [11]; the works of Argamon in 2003 [12]; the works of Zhao and Zobel in 2005 [13]; and the works of Koppel and Schler in 2003 [14]. Similarly, in the works of Argamon in 2007 [15], A new interesting feature was proposed by [16] and [17], namely: the word ngrams, which provided very good performances. Concerning the character n-grams, the application of this approach to AA has shown an interesting success. Character bigrams and trigrams have been used in the works of Kjell [18]. In the works of Forsyth and Holmes [19], one found that bigrams and character n-grams of variable-length performed better than lexical features. They have been successfully used in the works of Peng [20], Keselj [21] and Stamatatos [22]. On the other hand, it is not only the feature which is important; in fact, the choice of a suitable classifier is important too. That is, in 2010, Jockers and Witten [23] tested five different classifiers. Concerning the Arabic language, there are not a lot of works that are reported. However, we can cite some recent works such as those reported by Sayoud 2012 [1] and Shaker [24]. Sayoud conducted an investigation on authorship discrimination between two old Arabic religious books: the Quran (The holy words of God) and Hadith (statements of the prophet Muhammad) [1]. Shaker investigated the AA problem in Arabic, using Function Words [24]. In this investigation, we are interested in using several features and classifiers for an evaluation in Arabic stylometry. The AAAT dataset is built by the authors of this paper for a purpose of AA.
3
Description of the Text Dataset
Our textual corpus is composed of 10 groups of old Arabic texts extracted from 10 different Arabic books. The books are written by ten different authors and each group
A Comparative Survey of Authorship Attribution on Short Arabic Texts
481
contains different texts belonging to a unique author. This set of texts has been collected in 2011 from “Alwaraq library” (www.alwaraq.net); we called it AAAT. Furthermore, this corpus represents a reference dataset for AA in Arabic, which has been used by several researchers working in this field. The texts of the corpus are quite short: the average text length is about 550 words and some texts have less than 300 words.
4
Classification Methods
For the evaluation task, we have evaluated 4 distances (Manhattan, Cosine, Stamataos, and Canberra distances) and 3 classifiers (SVM, MLP and LR). Several features are also used, namely: characters, character n-grams, words, word n-grams and rare words in order to find the most reliable characteristic for the Arabic language. Furthermore, a Vote Based Fusion (VBF) has been proposed to enhance the overall classification performances. 4.1 Manhattan Distance (Man) The Manhattan distance between two vectors X and Y of length n is defined as follows:
Man(X, Y) =
n ∑ i=1
|X − Y | i| | i
(1)
4.2 Cosine Distance Cosine similarity is a measure of similarity between two vectors X and Y (of length n) that measures the cosine of the angle between them (denoted by θ). The cosine distance, cos(θ), is represented using a dot product and magnitude as: ∑n X ∗ Yi X.Y i=1 i =√ cos 𝜃 = √ ) ( ∑n ∑n ( ) 2 ‖X‖‖Y‖ 2 ∗ X Yi i i=1 i=1
(2)
4.3 Stamatatos Distance (Sta) This distance was introduced by Stamatatos [25] to measure texts similarity. It was successfully employed in AA. It is given by the following formula:
Sta(X, Y) =
n ∑ [ ( ) ( )]2 2 Xi − Yi ∕ Xi + Yi i=1
(3)
482
S. Ouamour and H. Sayoud
4.4 Canberra Distance (Can) The Canberra distance between vectors X and Y is given by the following equation:
Can(X,Y) =
∑n i=1
| (Xi − Yi ) | | | | X +Y | | i i |
(4)
4.5 Sequential Minimal Optimization-Based Support Vector Machines (SVM) In machine learning, SVM are supervised learning models with associated learning algorithms that analyze data and recognize patterns. They are used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, a SVM training algorithm builds a model that assigns new examples into one category or the other. A SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. In addition to performing linear classification, SVM can efficiently perform nonlinear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. The SVM is a very accurate classifier that uses bad examples to form the boundaries of the different classes. Sequential minimal optimization (SMO) is an algorithm for solving the quadratic programming problem that arises during the training of the SVM. The SMO algorithm is used to speed up the training of the SVM. In our application, we solved the multi-class problems by using pairwise classification technique. 4.6 Multi-layer Perceptron (MLP) The MLP is a feed-forward neural network classifier that uses the errors of the output to train the neural network: it is the “training step”. The MLP is organized in layers: one input layer of distribution points, one or more hidden layers of artificial neurons (nodes) and one output layer of artificial neurons. Each node, in a layer, is connected to all other nodes in the next layer and each connection has a weight (which can be zero). The MLP is considered as universal approximator and is widely used in supervised machine learning classification. The MLP can use different back-propagation schemes to ensure the classifier training. 4.7 Linear Regression Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the “lack of fit” in some other norms
A Comparative Survey of Authorship Attribution on Short Arabic Texts
483
(as with least absolute deviations regression), or by minimizing a penalized version of the least squares loss function as in ridge regression. In linear regression, data are modeled using linear predictor functions, and unknown model parameters are estimated from the data. Such models are called linear models. Usually, the predictor variable is denoted by the variable X and the criterion variable is denoted by the variable y. Most commonly, linear regression refers to a model in which the conditional mean of y given the value of X is an affine function of X. Less commonly, linear regression could refer to a model in which the median of the condi‐ tional distribution of y given X is expressed as a linear function of X. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of y given X, rather than on the joint probability distribution of y and X, which is the domain of multivariate analysis. 4.8 Classification Process The general classification process is divided into two methods: Training Model based Classification and Nearest Neighbor based Classification. In the first type, a training step is required to build the model or the centroid (in case of similarity measures); afterward, the testing step could be performed by using the resulting model. In the second type, the training is not required, since a simple similarity distance is computed between the unknown document and each referential text: the smallest distance gives an indication on the most probable class. Furthermore two types of measures are employed: a simple distance and a centroid based distance. The first type is known to be inaccurate, while the second one (i.e. centroid) is more accurate and robust against noises. The first classification type includes the following classifiers: Centroid based Similarity meas‐ ures, Multi-Layer Perceptron, SMO-based Support Vector Machines and Linear Regres‐ sion; whiles the second classification type includes only the nearest neighbor similarity measures. After every identification test, a score of good AA is computed in order to get an estimation on the overall classification performances.
5
Experiments of Authorship Attribution
In this section, we present the different experiments of AA, which are conducted on the historical Arabic texts. Several features are tested such as: characters, character bigrams, character trigrams, character tetragrams, words, word bigrams, word trigrams, word tetragrams and rare words. On the other hand, different types of classifiers (MLP, SVM and LR) and distances are employed to ensure the AA classification. The AA Score (AAS) is calculated by using the RandAccuracy formula, as follows: AAS score = Rand Accuracy =
number of texts that are well attributed total number of texts
(5)
484
S. Ouamour and H. Sayoud
5.1 Comparative Performances For a purpose of comparison, several figures are represented and commented on to make a comparative study of the different features and classifiers. That is, Fig. 1 summarizes the overall best results given by each classifier. In this figure, we remark that the Manhattan centroid distance seems to be very accurate, with a score of 90%, followed by the classifiers MLP and SVM, with a score of 80%, after that, we retrieve the Manhattan nearest neighbor distance and the LR classifier, which provide a score of 70%. Finally, the remaining distances: Canberra, Cosine and Stama‐ tatos distances, give the worst performances, score of 60%.
Fig. 1. Best scores of authorship attribution (AAS) given by the different classifiers.
In Fig. 2, we have presented the average AA performances for every feature. Those performances are obtained by calculating the mean of all the feature scores.
Fig. 2. Overall authorship attribution score for the different features used.
A Comparative Survey of Authorship Attribution on Short Arabic Texts
485
From Fig. 2, we can deduce that the best feature in these experiments is character trigrams, followed by character tetragrams, character bigrams and rare words. The performances of AA continue to decrease respectively by using words, characters, word bigrams, word trigrams and finally, word tetragrams, which represents the worst features in our experiments. In overall, we notice two important points: On one hand, the AAS increases with the character n-gram size (i.e. the size n) and decreases with the word ngram size. On the other hand, character n-grams seems to be more accurate than word n-grams and rare words. Similarly and in a dual form, Fig. 3 displays the average scores that are obtained by the different classifiers. These scores of performance are obtained by calculating the mean of all the scores of a specific classifier. So, we notice that the machine learning classifiers are the most accurate, especially the SMO-SVM (average score exceeding 70%), which provides high performances of AA. The MLP is strongly accurate with a score of about 70% of good attribution and the linear regression is quite interesting (score over 60%). On the other hand, we notice that the distances are less accurate in the overall, since the average attribution scores do not exceed 58.33%.
Fig. 3. Average AA score per classifier.
Once again, we can observe that character n-grams are better than word n-grams according to this same figure (Fig. 3) and we can also notice that the system presents a failure when using word n-grams. These last ones seem to be not suitable for short texts: this result is logical because short texts do not contain enough words or enough word n-grams either to make a fair statistical representation of the features. Figure 4 presents the best score given by each feature. We see that a score of 90% is given by character tetragrams, followed by a score of 80% for character bigrams, character trigrams and rare words, thereafter, a score of 70% for words, 60% for char‐ acters, 50% for word bigrams, and a score of 20% for word trigrams and tetragrams.
486
S. Ouamour and H. Sayoud
Fig. 4. Best score obtained with the different features.
5.2 Vote Based Fusion In order to enhance the attribution performance, we thought to use several classifiers, which are combined in order to get a lower discrimination error: this combination is called Fusion. The fusion in the broad sense can be performed at different hierarchical levels or processing stages [26], as follows: • Feature level, where the feature sets of different modalities are combined; • Score (matching) level is the most common level where the fusion takes place. The scores of the classifiers are normalized and then combined in a consistent manner; • Decision level where, the outputs of the classifiers establish the decision via techni‐ ques such as majority voting. In this investigation, we have chosen to use the SMO-SVM classifier, which seems to be the best classifier in our experiments. The proposed fusion method is done at the decision level and is called “Vote-Based Fusion technique” or VBF. It consists in fusing the output decisions of the different systems (i.e. each system uses the SVM classifier with one specific feature) as it is described in Eq. 6. For the choice of the features, we have decided to keep only the most pertinent ones, namely those presenting a “bestscore” of at least 80%. So according to Fig. 5, those pertinent features are: Character bigram; Character trigram and Rare words. VBFFusion = Round{(𝛼1 .Char2gramCLASS + 𝛼2 .Char3gramCLASS 1 }, + 𝛼3 .RareWordsCLASS ) 𝛼1 + 𝛼2 + 𝛼3
where CLASS represents the classifier output and 𝛼i is a constant smaller than one.
(6)
A Comparative Survey of Authorship Attribution on Short Arabic Texts
487
Fig. 5. Vote fusion technique. The outputs Oj are fused to produce the author identity.
The same previous experiments of AA have been conducted by using the proposed fusion technique. Results show that the fusion provides an accuracy of 90%, which is higher than all the scores provided by the SVM. This result is interesting since it shows that it is possible to enhance the identification accuracy only by combining several features and/or classifiers together. Furthermore, it is important to mention that an accu‐ racy of 90% with short texts is motivating, since previous works showed that the minimum amount of required text for a fair AA is at least 2500 tokens [27].
6
Conclusion
An investigation of AA has been conducted on an old Arabic set of text documents that were written by ten ancient Arabic travelers. In this investigation, eleven different clas‐ sifiers and distances have been used for the attribution task, by using nine different features. Moreover a fusion technique, called VBF, has been proposed to enhance the AA performances. The main conclusions of the different experiments can be summar‐ ized by the following points: • Character bigram, trigram and tetragram appear to be interesting: Character tetra‐ grams appear to be suitable for distances (Manhattan, Canberra, Cosine and Stama‐ tatos), while for the machine learning, character bigram is the most accurate one. • Manhattan centroid distance has shown excellent performances with an accuracy of 90% when using character tetra-grams. The performances of this distance are more or less comparable to those of the SVM, which is considered very reliable. • As expected theoretically, the SVM has shown excellent average performances in most experiments, which recommends the use of this type of classifier in AA. • Character-based features are better than word-based ones for short documents. • The proposed VBF fusion provided high performances with an accuracy of 90% of good AA, which highly recommends the use of the fusion in AA. • Although the word-based features did not give good results, rare words have presented good scores for almost all the classifiers. This result shows that some linguistic information of the author style are embedded in the rare words.
488
S. Ouamour and H. Sayoud
Finally, we think that the results of this investigation are interesting since they shed light on the efficiency of several features and classifiers in AA of short Arabic texts. As perspectives, one proposes to evaluate our system on dialectical Arabic language.
References 1. Sayoud, H.: Author discrimination between the Holy Quran and Prophet’s statements. Lit. Linguist. Comput. 27(4), 427–444 (2012) 2. Chowdhury, H.A., Bhattacharyya, D.K.: Plagiarism: taxonomy, tools and detection techniques. In: Paper of the 19th National Convention on Knowledge, Library and Information Networking (NACLIN 2016) held at Tezpur University, Assam, India (2016) 3. Sayoud, H.: Segmental analysis based authorship discrimination between the Holy Quran and Prophet’s statements. Can. Soc. Digit. Hum., Digital Studies Journal (2015) 4. Tambouratzis, G., Hairetakis, G., Markantonatou, S., Carayannis, G.: Applying the SOM model to text classification according to register and stylistic content. Int. J. Neural Syst. 13(1), 1–11 (2003) 5. Holmes, D.I.: Authorship attribution. Comput. Humanit. 28, 87–106 (1994) 6. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000) 7. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006) 8. Mendenhall, T.C.: The characteristic curves of composition. Science 9, 237–249 (1887) 9. Argamon S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing (2005) 10. Burrows, J.F.: Word patterns and story shapes: the statistical analysis of narrative style. Lit. Linguist. Comput. 2, 61–70 (1987) 11. Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. Intell. Syst. 20(5), 67–75 (2005) 12. Argamon, S., Saric, M., Stein, S.: Style mining of electronic messages for multiple authorship discrimination: first results. In: Proceedings of 9th ACM SIGKDD, pp. 475–480 (2003) 13. Zhao, Y., and Zobel, J.: Effective and scalable authorship attribution using function words. 2nd Asia Information Retrieval Symposium (2005) 14. Koppel, M., Schler J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: IJCAI Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003) 15. Argamon, S., et al.: Stylistic text classification using functional lexical features. J. Am. Soc. Inform. Sci. Technol. 58(6), 802–822 (2007) 16. Peng, F., Shuurmans, D., Wang, S.: Augmenting naive Bayes classifiers with statistical language models. Inf. Retrieval J. 7(1), 317–345 (2004) 17. Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation. In: Proceedings of the International Conference on Empirical Methods in Natural Language Engineering, pp. 482–491 (2006) 18. Kjell, B.: Discrimination of authorship using visualization. Inf. Process. Manag. 30(1), 141– 150 (1994) 19. Forsyth, R., Holmes, D.: Feature-finding for text classification. Lit. Linguist. Comput. 11(4), 163–174 (1996)
A Comparative Survey of Authorship Attribution on Short Arabic Texts
489
20. Peng, F., Shuurmans, D., Keselj, V., Wang, S.: Language independent authorship attribution using character level language models. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, pp. 267–274 (2003) 21. Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. Pacific Association for Computational Linguistics, pp. 255–264 (2003) 22. Ouamour, S., Sayoud, H.: Authorship attribution of ancient texts written by ten arabic travelers using character N-Grams. CITS-2013, Athens, Greece, CITS (2013) 23. Jockers, M.L., Witten, D.M.: A comparative study of machine learning methods for authorship attribution. Lit. Linguist. Comput. 25(2), 215–223 (2010) 24. Shaker, K.: Investigating features and techniques for Arabic authorship attribution, PhD thesis Heriot-Watt University (2012) 25. Stamatatos, E.: Author identification using imbalanced and limited training texts, text-based Information Retrieval, pp. 237–241 (2007) 26. Jain, A.K., Ross, A., Prabhakar, S.: An introduction to biometric recognition. Trans. Circ. Syst. Video Technol. 14(1), 4–20 (2004) 27. Ouamour, S., Khennouf, S., Bourib, S., Hadjadj, H., Sayoud H.: Effect of the text size on stylometry-application on arabic religious texts. In: International Conference on Computer Science Applied Mathematics and Applications, pp 215–228, Vienna, Austria (2016)
How Good Is Your Model ‘Really’ ? On ‘Wildness’ of the In-the-Wild Speech-Based Affect Recognisers Vedhas Pandit1(B) , Maximilian Schmitt1 , Nicholas Cummins1 , Franz Graf2 , orn Schuller1,3 Lucas Paletta2 , and Bj¨ 1
ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany [email protected] 2 Joanneum Research Forschungsgesellschaft mbH, Graz, Austria 3 Group on Language, Audio, and Music (GLAM), Imperial College London, London, UK
Abstract. We evaluate, for the first time, the generalisability of in-thewild speech-based affect tracking models using the database used in the ‘Affect Recognition’ sub-challenge of the Audio/Visual Emotion Challenge and Workshop (AVEC 2017) – namely the ‘Automatic Sentiment Analysis in the Wild (SEWA)’ and the ‘Graz Real-life Affect in the Street and Supermarket (GRAS2 )’ corpus. The GRAS 2 corpus is the only corpus to date featuring audiovisual recordings and time-continuous affect labels of the random participants recorded surreptitiously in a public place. The SEWA database was also collected in an in-the-wild paradigm in that it also features spontaneous affect behaviours, and real-life acoustic disruptions due to connectivity and hardware problems. The SEWA participants, however, were well aware of being recorded throughout, and thus the data potentially suffers from the ‘observer’s paradox’. In this paper, we evaluate how a model trained on a typical data suffering from the observer’s paradox (SEWA) fairs on a real-life data that is relatively free from such psychological effect (GRAS2 ), and vice versa. Because of the drastically different recording conditions and the recording equipments, the feature spaces for the two databases differ extremely. The in-the-wild nature of the real-life databases, and the extreme disparity between the feature spaces are the key challenges tackled in this paper, a problem of a high practical relevance. We extract bag of audio words features using, for the very first time, a randomised database-independent codebook. True to our hypothesis, the Support Vector Regression model trained on GRAS2 had better generalisability, as this model could reasonably predict the SEWA arousal labels. Keywords: Affective speech analysis · Transfer learning Observer’s paradox · One-way mirror dilemma Authentic emotions · In-the-wild c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 490–500, 2018. https://doi.org/10.1007/978-3-319-99579-3_51
How Good Is Your Model ‘Really’ ?
1
491
Introduction
Human speech is a complex signal, featuring a plethora of information beyond the spoken words. In addition to the linguistic content, a speech signal tells the listener a lot about the speaker – such as their age, gender, native language, motivations and emotions. It is important for a human-machine interaction (HCI) system to recognise these contexts correctly, to be able to respond in accordance. Today, we are continuously surrounded by human-machine interfaces. A virtual assistant in a handheld device has no longer remained a science fiction, but is simply an everyday reality. There is, therefore, a growing interest in the field of affective computing, to make the machines ‘understand’ human speech in its entirety, i. e., including the featured emotions and contexts. Broadly speaking, there are three types of databases used in affect research. Early research utilised acted speech data, which typically featured a highly exaggerated affect behaviours, far from the natural ones (e. g., EmoDB [1,12]). In another data collection strategy, the participants are made to converse in a laboratory environment. While the behaviours collected are mostly natural and spontaneous, the collected data is typically clean and unaffected by the reallife effects such as noise (e. g., RECOLA [16]). The third, ‘in-the-wild’ databases refer to the data collected in a non-laboratory, the everyday, unpredictable noisy environments. However, the so-called ‘in-the-wild’ databases mostly feature the recordings collected in an identical real-life settings, with very similar acoustic disruptions. This has direct implications on the trained models, limiting their generalisability. Also, most of these databases suffer from the phenomenon called ‘observer’s paradox’ or ‘one-way mirror dilemma’ – where the participants are typically well aware of being recorded right from the beginning of the recordings – which affects featured affect behaviours [19]. In this contribution, we test, for the first time, the hypothesis that the models trained on a closer-to-real-life database is likely to generalise better [14]. While there have been transfer learning studies on affect [2–4,11], there is hardly any research on generalisability of time-continuous affect recognising models for the real-life or in-the-wild datasets. To this end, we first introduce the two databases used in this study in Sect. 2. We describe our experiments in detail in Sect. 3. After this, we present our findings in Sect. 4 before we conclude the paper in Sect. 5.
2
Databases
To test which of the two affect recognising models generalises better – i. e., whether the one trained on a ‘more’ in-the-wild data or the one using database collected under relatively restrained or ‘laboratory’-like settings – we use two prominent benchmark databases, namely the ‘Automatic Sentiment Analysis in the Wild’ (SEWA) corpus used in the AVEC 2017 challenge and the ‘Graz Reallife Affect in the Street and Supermarket’ (GRAS2 ) corpus. The SEWA database features video chat recordings of the participants discussing the commercials they just watched. The recordings were collected using
492
V. Pandit et al.
the standard webcams and computers from the participants’ homes or offices. The data collection took place over the internet using a video-chat interface specifically designed for this task. The recordings feature spontaneous affect behaviours, real-life noises and delays due to connectivity and hardware problems. The participants dominated the conversations more or less the equally. The GRAS2 database features audiovisual recordings of the conversations with the unsuspecting participants from a first-person point of view in a busy shopping mall. The participants were made aware of being recorded only half way through the conversations, and were requested to sign a consent form agreeing to release the recordings for research purposes. The database, thus, features spontaneous and ‘more’ authentic affective behaviours, as they are relatively more observer’s paradox-free. Because the conversations were totally spontaneous, the durations of the conversations vary widely (standard deviation = 56.3 s). Also the extent to which the participants dominate the conversations, i. e., relative durations of the subject’s speech and the speech by the student research assistant collecting the data, varies widely. Unfortunately, the student research assistants dominate many of the conversations. The sections of the recordings where the participants read the documents before signing the consent form hardly feature subject’s speech. The recordings also contain dynamically varying noise, including the impact sounds, bustle, background music, and background speech. There are only 28 conversations available. All these factors combine to make the this database a lot more ‘in-the-wild’ and the affect tracking task lot more challenging. The corpus was used previously in a research study establishing correlation between an eye-contact and the speech [6], and another study on time-continuous authentic affect recognition in-the-wild [13].
3 3.1
Experimental Design Data Splits
We split both the SEWA and GRAS2 corpus into training, validation and test sets in a roughly similar 2:1:1 ratio, in terms of both the number of files in a split and the cumulative duration of the audio clips. We use the same splits used in the AVEC 2017 challenge [15] when running our experiments (Fig. 1) on the SEWA database. The splits are made such that a participant-independent model can be trained, i. e., no participant is present in more than one split. The splits on GRAS2 are made such that each split features a different student assistant likewise, i. e., no student assistant is present in more than one split. The statistics for the three splits are presented in Table 1. 3.2
Feature Engineering
We need the features from the two databases such that they are compatible with one another, the two ideally share a common feature space. Because we are interested in predicting time-continuous signals of emotion dimensions, the
How Good Is Your Model ‘Really’ ?
493
Fig. 1. Entire experimental design pipeline. Table 1. Duration statistics for the SEWA and GRAS2 data splits. GRAS2
SEWA Train Duration (seconds) Total
Validation
5608.02 2272.30
Test
Train
Validation
2807.42 2018.75 1000.45
Test 998.02
Max
175.64
175.45
175.81
218.77
290.94
Min
46.68
97.43
174.9
71.77
100.31
86.40
164.94
162.31
175.46
126.17
166.74
166.34
34.90
63.67
74.93
Mean Std. Dev. Number of participants
31.24
26.71
34
14
0.24 16
16
6
309.40
6
features should also ideally capture the temporal dynamics of the varying lowlevel descriptor (LLD) space. The features should ideally be robust to noise. We generate the bags of audio words (BoAW) features using our own openXBOW toolkit [17] by vector quantising the ‘enhanced Geneva Minimalistic Acoustic Parameter Set’ (eGeMAPS) [5] low level descriptors (LLDs) extracted using our openSMILE toolkit [7]. This feature set is quite popular in the affective computing field already; we have used these exact features for establishing a baseline model performance for the AVEC 2017 challenge as the challenge organisers. The eGeMAPS LLDs is a minimalistic set of acoustic parameters, particularly tailor-made for affective vocalisation and voice research, consisting of only 23 LLDs. To capture the temporal dynamics of the individual parameters and LLD types, we extract BoAW features based on these LLDs. The BoAW approach generates a sparse fixed length histogram representation of the quantised features in time, thus capturing the temporal dynamics of the LLD vectors, while remaining noise-robust due to its inherent sparsity and the quantisation step [13,17,18]. However, the eGeMAPS LLDs are drastically different for the two databases in terms of their value ranges. Because the critical statistics – such as the mean, the variance, the maximum and the minimum value – are radically different (some with even the opposite signs), the statistics computed on one database cannot be reliably be used to standardise or normalise the other database such that they share a common feature space. Furthermore, the codebook used in the AVEC 2017 challenge utilises a random sampling of the SEWA eGeMAPS LLD vectors. For transfer learning experiments however, we ideally should not gener-
494
V. Pandit et al.
ate the codebook by sampling only one of the two databases; a codebook that is likely to represent one dataset better. It is imperative to use an identical codebook to vector quantise the two databases that is completely data-independent – especially when the ranges of feature values are drastically different. It is only then that we can independently assess generalisability of the trained models objectively, free from effect of the codebook better representing temporal dynamics in one dataset over the other. We thus generate a codebook of size 1000, independent of the two databases, consisting of 23-length LLDs. An array of shape 1000 × 23, populated with random samples from a normal distribution (mean = .5, standard deviation = .1) is used as a codebook matrix. We preprocess the LLDs by scaling and offsetting all of the data splits, using the offsets and the scaling factors that normalise the respective training split in the range [0, 1]. We then vector quantise all of the LLDs to the randomised codebook generated with 10 soft assignments for every LLD. We compute the distribution of the assignments in a moving window of 6 s, with a hop size of 0.1 s – similar to how AVEC 2017 features were generated [15]. 3.3
Gold Standard Generation
We use the gold standard arousal and valence values of the AVEC 2017 challenge when training using the SEWA database [15]. We generate the gold standard for the GRAS2 database using the same algorithm as of SEWA. The gold standard used in our previous studies on GRAS2 differs only in that, we previously did not compensate for annotator-specific mean annotation standard deviations [13]. We use the modified Evaluator Weighted Estimator (EWE) method to generate the gold standard s, one per subject per emotion dimension. The goal of the EWE metric is to take into account the reliability of the individual annotators, signified by the weight rk for every annotation yk . This confidence value is computed by quantifying extent to which the annotations by that annotator agree with the rest of the annotations. The gold standard, yEW E is defined as: 1 yEW En = K
K
k=1 rk k=1
rk yn,k ,
(1)
where yn,k is an annotation by the annotator k (k ∈ N, 1 ≤ k ≤ K) at instant n (n ∈ N, 1 ≤ n ≤ N ) contributing to the annotation sequence yk . The symbol rk is the corresponding annotator-specific weight. The lower bound for rk is set to 0. In [8], the weight rk is defined to be normalised cross-correlation between yk and the averaged annotation sequence y¯n . The gold standards used in both the AVEC 2017 baseline paper [15] and the GRAS2 -based affect recognition study [13] redefined the weight rk such that it gets strongly influenced by the total number of annotations yk is in agreement with, and also by the extent to which they agree, by simply averaging the pair-wise correlations. The weights are lower bounded to 0 as usual. They are then normalised such that they sum to 1. N yn,ki − µki yn,kj − µkj 1 , where: µk = yn ,k , 2 2 N N n =1 y − µ − µ y n,ki ki n,kj kj n=1 n=1 N
rki ,kj = N
n=1
(2)
How Good Is Your Model ‘Really’ ?
rk i 3.4
K 1 =
K
kj =1 rki ,kj
0
if if
K
kKj =1
rki ,kj > 0
kj =1 rki ,kj
≤0
,
r rki = K ki
kj =1 rkj
495
.
(3)
Annotator Lag Compensation
To compensate for the reaction time of the annotators, we delay the feature vectors in time [10]. We use the delay value of 2.2 s, based on our previous grid search analysis on SEWA corpus [15]. In this study, we remove the repeating feature vectors at the beginning of every sample sequence introduced due to the lag compensating function used in AVEC 2017. We find that there is minute to no difference in performance because of removal of erroneously repeating feature vectors. This is expected, since the number of removed features (=22, in case of annotator lag compensation of 2.2 s) is less than 2% of the total number of feature vectors for an average SEWA audio recording. Though it does not improve or deteriorate the performance of the models, we note this addition to our preprocessing steps in comparison with the AVEC 2017 workflow [15], for the sake of correctness and completeness. 3.5
Regression Models
For the new BoAW feature sets generated using a randomised codebook, we first generate baseline regression results by training support vector machine (SVM)based regression models (SVR) using a linear kernel with complexity values, C = [2−15 , 2−14 , ..., 20 ], just as was done when establishing the AVEC 2017 challenge baseline. We also experiment with additional C values in the range [10−8 , ..., 10−5 ] as the GRAS2 -trained arousal model was found to perform well for C ∈ [2−15 , 2−7 ] . We ran regression models using simple feedforward neural networks (FFFN) and the double-stacked and a single-stacked recurrent neural network (RNN) with gated recurrent units (GRUs) in cascade with FFNNs. To train a GRU-based model, we used feature sequences of length 60, corresponding to 6 s. We experimented with several configurations for the network topologies (with 20 to 100 GRU nodes, 10 to 50-node layered FFNNs) , activation function permutations (selu, tanh, linear), feature lengths (60,80), learning rates (0.001 to 0.01 in the steps of 0.003), and optimisers (rmsprop, adam, adagrad, and adamax). 3.6
Post-processing
We post-process the predictions using the equation: σ1 (4) Ynew = (Yorig − μ2 ) + μ1 , σ2 where Yorig is the primary prediction, Ynew is the post-processed prediction, μ1 , σ1 , μ2 , σ2 are the mean and standard deviation of the training label sequence and the model’s prediction on the training data respectively [20].
496
4
V. Pandit et al.
Results and Discussions
All of the models we trained (SVRs, GRU-RNNs and FFNNs) performed reasonably well, so long as the test split and the training splits came from the same database, with concordance correlation coefficient (CCC) [9] close to 0.25 on an average. Of these, only the SVR-based models trained on GRAS2 arousal annotations could reasonably make predictions in the transfer learning experiments (Table 2). The models otherwise mostly fail to generalise to a different dataset, with CCC values close to zero. For these transfer learning experiments from SEWA to GRAS2 , and vice versa, following are our key findings. 4.1
Neural Networks Tended to Overfit to the Primary Database
We observed the neural network-based models tended to overfit to the database they were trained on. The predictions were reasonably good for the test and validation splits of the same database that the training split came from. While performance on the same primary database depends also on the random initialisation of its weights and biases, the models invariably failed to make reasonable predictions on a different database (CCC close to zero). 4.2
Valence Tracking Learnings Were Not Generalisable Beyond the Database
A valence prediction is a particularly a harder problem as compared to an arousal prediction [13,16,18]. We observed that the models could predict the valence dimension for the validation and test splits of the same database (CCC as high as 0.42), but the prediction models tend to overfit to the database. This observation was irrespective of the type of model used, and the direction of transfer learning (i. e., whether SEWA to GRAS2 , or GRAS2 to SEWA). 4.3
GRAS2 -trained SVR-Based Arousal Tracking was Reasonably Generalised
Interestingly though, an SVR-based arousal prediction models trained on GRAS2 alone faired reasonably well on SEWA database with CCC values as high as 0.222 over the complete SEWA database – despite SEWA database being twice the size of GRAS2 . In the interest of reproducibility of the experiments presented in this paper, the complexity values and the corresponding performance values for the different models are as indicated in Table 2. We note that, out of the three SEWA splits, the model performs the worst on its training data split, which also is the most diversified split out of the three splits Table 1. Despite having a lot smaller training set, the GRAS2 to SEWA model transfer learning for the arousal prediction worked reasonably well. SEWA to GRAS2 transfer learning however does not quite work (again, CCC close to zero), despite the training split having twice as much the data to train the model on, with an
How Good Is Your Model ‘Really’ ?
497
Table 2. Performance of the models in the transfer learning experiments for the arousal dimension. The models were trained only using the training split of the GRAS2 database, and were tested on the remaining data splits of GRAS2 and the entire SEWA German database. We note the performance on the individual data-splits of the SEWA database, to get better understanding of the coincidental data disparities and similarities between the two databases, and how the performance varies across splits with change in the complexity values. Interestingly enough, the similar SVR-based models trained on SEWA did not perform well on GRAS2 database. C Value Database 10−5
GRAS2
SEWA
2−15
GRAS2
SEWA
2−13
GRAS2
SEWA
2−11
GRAS2
SEWA
2−9
GRAS2
SEWA
Phase Training
Data split CCC PCC RMSE Training
.501
.501
.137
Validation Validation .363
.370
.144
Testing
Testing
.280 .320
.152
Testing
Training
.171
.149
Training
.216
Validation .325
.356
.144
Testing
.230
.132
.197
Entirety
.223 .263
.144
Training
.582
.125
.582
Validation Validation .382
.386
.140
Testing
Testing
.266
.303
.149
Testing
Training
.170
Training
.128
.178
Validation .280
.340
.161
Testing
.250
.144
.188
Entirety
.191
.252
.162
Training
.691
.691
.108
Validation Validation .350
.353
.144
Testing
Testing
.241
.256
.143
Testing
Training
.188
Training
.082
.103
Validation .236
.290
.184
Testing
.191
.160
.155
Entirety
.156
.193
.180
Training
.778
.778
.091
Validation Validation .331
.341
.152
Testing
Testing
.228
.235
.144
Testing
Training
.198
Training
.107
.122
Validation .251
.279
.195
Testing
.191
.169
.171
Entirety
.175
.196
.190
Training
.834
.834
.079
Validation Validation .248
.265
.170
Testing
Testing
.180
.183
.145
Testing
Training
.233
.120
.146
Validation .156
.174
.231
Testing
.208
.239
.186
Entirety
.156
.181
.221
498
V. Pandit et al.
identical model parameters. We speculate that the SEWA database is not as inthe-wild as GRAS2 . GRAS2 features also the random background speech, bustle, impact sounds, background music, and even the long non-speech sections. There exist emotion dimension labels for even these non-speech/rare-speech sections which the model needs to learn, which in itself is a challenging task. Such more in-the-wild nature of the data manifests itself in lot more challenging training instances that help model to learn arousal predictions with more nuances.
5
Conclusions and Future Work
We present a first-of-its-kind transfer learning study on the speech-based timecontinuous in-the-wild affect recognising models. To this end, we used a novel BoAW approach that uses a novel data-independent randomised codebook. The GRAS2 database – featuring relatively more observer’s paradox-free affective behaviours, and a lot more data diversity in terms of conversation durations, acoustic events, noise dynamics, spontaneity of the featured affective behaviours – proved to be highly effective in training a more generalised arousal tracking model than the SEWA database, despite its smaller size. As for the valence dimension, none of the databases were effective enough in training a better-generalised valence tracking model. Furthermore, none of our neural network-based models could predict emotion dimensions (both arousal and valence) on a different database through transfer learning. All these models were observed to perform well on unseen data from the databases they were trained on. The new BoAW paradigm of using the data-independent randomised codebooks helps one project dissimilar databases onto a common normalised feature space, while also inherently capturing the temporal dynamics of the LLDs; the technique which can be further developed and fine-tuned. We intend to investigate effect of different randomisation strategies (sampling from differently skewed distribution, or uniform or different normal distributions), also the codebook size and the number of assignments on the model performance. We would like to also extend on this work by adding more in-the-wild databases. Our findings on better generalisability of the GRAS2 -trained arousal tracking model encourage us to use more of such databases that are free from the observer’s paradox. Unfortunately, there are no other observer’s paradox-free databases to work with, that are publicly available today. We plan to therefore collect new data using a similar data collection strategy used to build GRAS2 . The next logical step is to add other prominent affect recognition databases – such as RECOLA [16]. This will culminate into an exhaustive study on affectrelated databases on their effectiveness in training the most-generalised, real-life time-continuous affect recognisers. Acknowledgments. This work was partly supported by the EU’s Horizon 2020 Programme through the Innovative Action No. 645094 (SEWA), and European Community’s 7th Framework Program under the Grant No. 288587 (MASELTOV).
How Good Is Your Model ‘Really’ ?
499
References 1. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Proceedings of the 9th EUROSPEECH, pp. 1517– 1520 (2005) 2. Coutinho, E., Deng, J., Schuller, B.: Transfer learning emotion manifestation across music and speech. In: Proceedings of the IJCNN, Beijing, China, pp. 3592–3598. IEEE (2014) 3. Deng, J., Xia, R., Zhang, Z., Liu, Y., Schuller, B.: Introducing shared-hiddenlayer autoencoders for transfer learning and their application in acoustic emotion recognition. In: Proceedings of the 39th ICASSP, Florence, Italy, pp. 4851–4855. IEEE (2014) 4. Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Proceedings of the 5th HUMAINE Association Conference on ACII, Geneva, Switzerland, pp. 511–516. IEEE (2013) 5. Eyben, F., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2016) 6. Eyben, F., Weninger, F., Paletta, L., Schuller, B.: The acoustics of eye contact detecting visual attention from conversational audio cues. In: Proceedings of the 6th Workshop on Eye Gaze in Intelligent Human Machine Interaction: Gaze in Multimodal Interaction (GAZEIN) at 15th ICMI, Sydney, Australia, pp. 7–12. ACM (2013) 7. Eyben, F., Weninger, F., Groß, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM MM 2013, Barcelona, Spain, pp. 835–838. ACM (2013). (Honorable Mention (2nd place) in the ACM MM 2013 Open-source Software Competition, acceptance rate: 28%, >200 citations) 8. Grimm, M., Kroschel, K.: Evaluation of natural emotions using self assessment manikins. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 381–385 (2005) 9. Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1), 255–268 (1989) 10. Mariooryad, S., Busso, C.: Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In: Affective Computing and Intelligent Interaction (ACII), pp. 85–90 (2013) 11. Ng, H.W., Nguyen, V.D., Vonikakis, V., Winkler, S.: Deep learning for emotion recognition on small datasets using transfer learning. In: Proceedings of the 17th ICMI, pp. 443–449. ACM (2015) 12. Paeschke, A., Kienast, M., Sendlmeier, W.F.: F0-contours in emotional speech. In: Proceedings of the 14th International Congress of Phonetic Sciences, vol. 2, pp. 929–932 (1999) 13. Pandit, V., et al.: Tracking authentic and in-the-wild emotions using speech. In: Proceedings of the 1st ACII Asia 2018, Beijing, P. R. China. AAAC/IEEE (2018) 14. Pantic, M., Sebe, N., Cohn, J.F., Huang, T.: Affective multimodal human-computer interaction. In: Proceedings of the 13th ACM MM, Multimedia 2005, Singapore, pp. 669–676. ACM (2005)
500
V. Pandit et al.
15. Ringeval, F., et al.: AVEC 2017 - real-life depression, and affect recognition workshop and challenge. In: Ringeval, F., Valstar, M., Gratch, J., Schuller, B., Cowie, R., Pantic, M. (eds.) Proceedings of the 7th International Workshop on Audio/Visual Emotion Challenge (AVEC 2017) at 25th ACM MM, Mountain View, CA, pp. 3–9. ACM (2017). 6 p 16. Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In: 10th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2013), Shanghai, P. R. China, pp. 1–8. IEEE (2013) 17. Schmitt, M., Schuller, B.: openXBOW - Introducing the passau open-source crossmodal bag-of-words toolkit. J. Mach. Learn. Res. 18, 3370–3374 (2017) 18. Schmitt, M., Ringeval, F., Schuller, B.: At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech. In: Proceedings of the 17th INTERSPEECH, San Francisco, CA, pp. 495–499. ISCA (2016) 19. Speer, S., Hutchby, I.: From ethics to analytics: aspects of participants’ orientations to the presence and relevance of recording devices. Sociology 37(2), 315–337 (2003) 20. Trigeorgis, G., et al.: Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings of the 41st ICASSP, Shanghai, P. R. China, pp. 5200–5204. IEEE (2016)
RAMAS: Russian Multimodal Corpus of Dyadic Interaction for Affective Computing Olga Perepelkina1,2(B) , Evdokia Kazimirova1 , and Maria Konstantinova1,2 1 Neurodata Lab LLC, Miami, FL, USA {o.perepelkina,e.kazimirova,m.konstantinova}@neurodatalab.com, [email protected] http://www.neurodatalab.com/en/ 2 Lomonosov Moscow State University, Moscow, Russia
Abstract. Emotion expression encompasses various types of information, including face and eye movement, voice and body motion. Emotions collected from real conversations are difficult to classify using one channel. That is why multimodal techniques have recently become more popular in automatic emotion recognition. Multimodal databases that include audio, video, 3D motion capture and physiology data are quite rare. We collected The Russian Acted Multimodal Affective Set (RAMAS) − the first multimodal corpus in Russian language. Our database contains approximately 7 h of high-quality close-up video recordings of faces, speech, motion-capture data and such physiological signals as electrodermal activity and photoplethysmogram. The subjects were 10 actors who played out interactive dyadic scenarios. Each scenario involved one of the basic emotions: Anger, Sadness, Disgust, Happiness, Fear or Surprise, and such characteristics of social interaction like Domination and Submission. In order to note emotions that subjects really felt during the process we asked them to fill in short questionnaires (self-reports) after each played scenario. The records were marked by 21 annotators (at least five annotators marked each scenario). We present our multimodal data collection, annotation process, inter-rater agreement analysis and the comparison between self-reports and received annotations. RAMAS is an open database that provides research community with multimodal data of faces, speech, gestures and physiology interrelation. Such material is useful for various investigations and automatic affective systems development. Keywords: Affective computing · Multimodal affect recognition Multimodal database · Russian emotion database
1
Introduction
Emotions are difficult to classify by means of one channel so multimodal techniques have recently become more popular in automatic emotion recognition. c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 501–510, 2018. https://doi.org/10.1007/978-3-319-99579-3_52
502
O. Perepelkina et al.
There are several data corpora suitable for multimodal emotion recognition purposes. The USC CreativeIT database [20] may serve as an example here. It includes full-body motion capture, video and audio data. This database provides annotations for each fragment concerning valence, activation and dominance categories. The interactive emotional dyadic motion capture database (IEMOCAP) [9] contains audio-visual and motion data for faces and hands only, but not for the whole body. The MPI Emotional Body Expressions Database [29] was also collected by means of several channels, yet only motion capture data are available for the research community. Finally, there are well-designed multimodal databases (e.g. RECOLA dataset [23], GEMEP corpus [8]) that provide multichannel information, but leave out motion data. The investigations of facial expression recognition can be classified into two parts, the detection of facial affect (human emotions), and the detection of facial muscle action (action units) [10]. Currently a good classification accuracy (more than 90% [7,27]) of basic emotions has been achieved by using face images. Speech is another essential channel for emotion recognition. Some emotions, such as sadness and fear, could be distinguished from an audio stream even better than from video [11]. The average recognition level across different studies varies from 45% to 90% and depends on the number of categories, the classifier type and the dataset type [14]. Physiological signals like cardiovascular, respiratory and electrodermal measures are also successfully applied in emotion recognition. In several studies of biosignal-based affect classification recognition rate is more than 80% [6,15]. The analysis of body motion data for emotion recognition has become more common only recently. Movement data gives recognition rates comparable to facial expressions or speech in multimodal scenarios and also improves overall accuracy in multimodal systems when combined with other modalities [17]. In general, recognition based on several modalities gives better results than one-channel recognition [12,22]. Thus, a vast amount of researches have been conducted but the problem of emotion recognition remains challenging. As existing datasets have some limitations, we decided to collect multimodal database in Russian with multiple channels including motion and physiology data along with audio-visual data. The Russian Acted Multimodal Affective Set (RAMAS) expands and complements existing datasets for the needs of affective computing.
2 2.1
RAMAS Dataset Dataset Collection
The Russian Acted Multimodal Affective Set (RAMAS) consists of multimodal recordings of improvised affective dyadic interactions in Russian (see example in “Fig. 1”). It was obtained in 2016–2017 by Neurodata Lab LLC [3]. Ten semi-professional actors (5 men and 5 women, 18–28 years old, native Russian speakers) participated in the data collection. Semi-professional actors are more
RAMAS: Russian Multimodal Corpus of Dyadic Interaction
503
Fig. 1. Examples of RAMAS database. A. Screenshot from a close-up video. B. Screenshot from a full-scene video with depicted skeleton data.
preferable than professional actors for analyzing movements in emotional states, as professional theater actors may use stereotypical motion patterns [24,29]. The actors were given scenarios with descriptions of different situations, but no exact lines. They were encouraged to gesticulate and move, but they had to stay in a certain, specially marked part of the room to achieve stable close-up footage of their faces. In order to perform refined motion tracking all the actors were dressed in tight black clothes. First of all the participants contributed a sample of their neutral emotional state. Then they improvised from 30 to 60 seconds on each of the 13 topics. Interactions were conducted in mixed-gender dyads where participants were assigned to be either friends or colleagues. Scenarios implied the presence of one of the six basic emotions (Anger, Sadness, Disgust, Happiness, Fear, and Surprise) in each dialogue or the neutral state. In each scenario, one actor was instructed to play a dominant role and the other a submissive role, and these roles were balanced across scenarios (He, English translations of scenarios could be found under the link [4]. Each scenario was played out from two to five times to achieve the best quality and the highest variety in emotional and behavior expressions. The roles were assigned to actors in such a way that all states were evenly distributed between men and women. We also asked the subjects to fill in short questionnaires (self-reports) after each played scenario in order to note emotions they really felt during the process. The actors received fixed payments for the production days and consented to the usage of all of the recordings for scientific and commercial purposes. 2.2
Apparatus and Recording Setup
Audio was recorded with Sennheiser EW 112-p G3 portable wireless microphone system (wav format, 32-bit, 48000 Hz). Microphones were placed on the participants’ necklines. General acoustic scene was obtained by stereo Zoom H5 recorder. Microsoft Kinect RGB-D sensor v. 2 [2] was used to gather 3D skeleton data, RGB and depth videos (of both actors simultaneously). A green background was used to ensure the motion capture and video quality enhancement. Close-up videos of each participant were recorded by means of two cameras (Canon HF G40 and Panasonic HC-V760). Photoplethysmogram (PPG) and
504
O. Perepelkina et al.
electrodermal activity (EDA) were registered by Consensys GSR Development Kit [1]. As a result, the RAMAS database comprise approximately 7 h of synchronized multimodal information including audio, 3D skeleton data, RGB and depth video, EDA, and PPG. We used SSI software to synchronize the streams [30].
3 3.1
Post-processing and Annotation Annotation
We asked 21 annotators (18–33 years old, 71% women) to evaluate emotions in the received video-audio fragments. Each annotator worked with 150 video fragments (except for two annotators who had 150 fragments in sum) and at least five annotators marked each fragment. We asked all the applicants to take the emotional intelligence test [19,25], and only those, who got average or above the average results, were picked to annotate the material. The Elan [26] tool from Max Planck Institute for Psycholinguistics (the Netherlands) was used for emotion annotation. Oral and written instructions as well as templates of all the emotional states were provided. The task was to mark the beginning and the end of each emotion that seemed natural. The work of the annotators was monitored and paid for. 3.2
Inter-Annotator Agreement
We used Krippendorff’s alpha statistic [18] to estimate the amount of inter-rater agreement in the RAMAS database. Alpha defines the two reliability scale points as 1.0 for perfect reliability, 0.0 for the absence of reliability, and alpha < 0 when disagreements are systematic and exceed what can be expected by chance. We choose Krippendorff’s alpha because it is applicable to any number of coders, allows for reliability data with missing categories or scale points, and measures agreements for nominal, ordinal, interval, and ratio data [16]. The expression for calculation of Alpha (α) is as follows: α=
1 − D0 De
(1)
where D0 is the disagreement observed and De is the disagreement expected by chance. Results of annotators who labelled the same video fragments (150 in most cases) were grouped, and Alpha was counted for each group for each of 9 emotional, social and neutral scales. Elan provides an opportunity to set variative length for fragment annotation (i.e. the subject determines the starting and ending point of emotions). Subsequently, to count realiability with Krippendorff’s alpha statistics we split all annotations into second intervals. The average Alpha statistics for RAMAS dataset is 0.44. The mean and median statistics for each scale is presented in Table 1.
RAMAS: Russian Multimodal Corpus of Dyadic Interaction
505
Table 1. Mean and median Krippendorff’s alpha for each scale Scale
3.3
Mean alpha Median alpha
Disgust
0.54
0.66
Happiness
0.58
0.6
Anger
0.5
0.56
Fear
0.47
0.48
Domination 0.45
0.46
Submission 0.46
0.44
Surprise
0.41
0.38
Sadness
0.35
0.31
Neutral
0.22
0.07
Self-report Analysis
Real emotions of the actors were collected by means of self-reports they gave after the last take of each played scenario. The actors were to fill in short questionnaires and evaluate his/her state during the scenario (Angry, Sad, Disgusted, Happy, Scared, and Surprised) on a 5-point Likert scale (1 = did not experience the emotion, 5 = experienced it a lot). They also answered the question about the complexity of the scenario played (1 = very easy, 5 = very difficult). We analyzed self-reports trying to answer the following questions: (1) Are there any differences between the emotions that actors experienced during all the scenarios? (2) Are there any differences in the complexity of the scenarios? If yes, which kind of emotions (types of scenarios) were more difficult for playing? (3) What are relations between played and experienced emotions? Did actors really feel the same emotions they played? Question #1: The differences between experienced emotions. First, we tested the diversity of emotions each actor experienced during the experiment. We wanted to find out which emotions they experienced more often regardless of the type of the emotion in the scenario. The answers from self-reports were compared with pairwise Wilcoxon rank sum test. There were no significant differences between experienced emotions. That means the actors experienced balanced overall amount of all the emotions during all sessions. Since the number of the emotions in the scenarios was balanced, it also corresponds to the experienced emotions. Question #2: Complexity of the scenarios. We analyzed what kinds of scenarios were more difficult for playing according to the actors’ self-reports. For this purpose the logistic regression model was constructed, with complexity evaluations as dependent variables and the types of scenario as predictors, and tested the comparisons of interests by using contrast/lsmeans. Scenarios for disgust were more difficult than the scenarios for angriness (p = 0.002, z = −3.773) and happiness (p = 0.018, z = 3.191), and the scenarios for fear were more difficult than the scenarios for angriness (p = 0.014, z = −3.264). Then we studied the
506
O. Perepelkina et al.
effect domination-submission type had on the evaluated complexity of the scenario. Logistic regression model and contrast/lsmeans revealed that there were no differences between these types of scenarios. Question #3: Relations between played and experienced emotions. First, we studied how the intensity of the experienced emotions depended on the individual differences between the actors. We counted sum of all emotional scores in each answer (intensity of emotions, varies from 6 to 30 in each answer, M 10.5 ± SD 2.6 in all answers). Then generalized linear model was created with the intensity as response and the actors as predictors, and tested with Anova type two. The model was significant (p < 0.001) which meant the intensity of the experienced emotions varied from actor to actor. Since emotion intensity depended on actors, we normalized their evaluations for further analysis: in order to do so we divided score of the emotion by the sum of all emotions in each answer. Normalized scores of the experienced emotions were compared to the emotions that actors had to play according to the scenarios. We analyzed the relation between played and experienced emotions using proportional odds logistic regression with the normalized score of the questionnaire as the response and the logical variable (that reflected whether the played emotion matched the felt emotion) - as the predictor. Anova type two revealed that this predictor was significant, that is: the actors tend to evaluate the emotion they had just played as their most intense feeling. In other words, the actors reported experiencing the same emotions they had played out in the scenario (see “Fig. 2”). The properties of the database
Fig. 2. Played and experienced emotions. False – the emotion in the scenario did not match the emotion in the self-report, true – they coincided. Plot with mean values and bootstrap estimated confidence intervals (CI).
RAMAS: Russian Multimodal Corpus of Dyadic Interaction
507
are summarized in Table 3. The RAMAS database is a novel contribution to the affective computing area since it contains multimodal recordings of the six basic emotions and two social categories in the Russian language.
4
Discussion
We collected the first emotional multimodal database is Russian language. Semiprofessional actors played out prepared scenarios and expressed basic emotions (Anger, Sadness, Disgust, Happiness, Fear and Surprise), as well as two social interaction characteristics – Domination and Submission. Audio, closeup and whole scene videos, motion capture and physiology data were collected. Twenty one annotators marked emotions in the received videos, at least five annotators marked each video. The further analysis of the annotations revealed that the RAMAS database has moderate inter-rater agreement (Krippendoroff’s alpha = 0.44). Among all the scales except for the neutral condition, the smallest inter-rater agreement was observed for sadness (0.35), while the largest agreement was observed for happiness (0.58). After playing out each scenario all the actors answered several short questions about the emotions they experienced. Analysis of these answers revealed that the actors experienced balanced overall amount of all emotions during all sessions (Table 2). Table 2. Scripts, videos and scenarios in RAMAS database Expression
Written Number Length, scripts of videos minutes
Emotions Disgust
4
80
56.6
Happiness
4
64
42.2
Anger
5
62
45.5
Fear
5
94
65.8
Surprise
5
70
44.6
Sadness
5
84
63.8
Neutral (speaker + listener)
6+6
63 + 64
37.8 + 38.6
Social scales Domination (emotional + neutral) 14 + 6
227 + 63 158.8 + 37.8
Submission (emotional + neutral) 14 + 6
227 + 64 158.8 + 38.6
The analysis of the question about the complexity of the played emotion revealed that the actors had more difficulties with the scenarios for disgust compared to the scenarios for angriness and happiness, and the scenarios for fear were more difficult for them than the scenarios for angriness. There were no differences between the complexity of playing dominative vs. submissive types
508
O. Perepelkina et al. Table 3. Basic properties of RAMAS database Number of videos
581
General length of database, minutes/hours
395/6.6
Number of actors
10 (5 men, 5 women)
Age of actors
18–28 years
Language of videos
Russian
Video length, seconds (min/max/average)
9/96/41
Number of annotators
21 (6 men, 15 women)
Inter-rater agreement (Krippendorff’s alpha) 0.44
of the scenarios. Presumably, experience and natural expression of fear and disgust are more rooted in the real context of the situation in comparison with other emotions, because the function of those affects is avoiding threats here and now [13,21,28]. For example, it’s easier to trigger natural anger than fear by memories. The comparison of played and experienced emotions revealed that, according to their self-reports, the actors experienced the same emotions that they played during the scenario. Due to this fact RAMAS could be considered quite a naturalistic database.
5
Conclusion
RAMAS is the unique play-acted multimodal corpus in the Russian language. The database in open and provides research community with multimodal data of face, speech, gestures and physiology modalities. Such material is useful for various investigations and automatic affective systems development. It can also be applicable for psychological, psychophysiological and linguistic studies. Access to the database is available under the link [5]. Acknowledgements. Supported by Neurodata Lab LLC. The authors would like to thank Elena Arkova for finding the actors and helping with the scenarios and experimental procedure, and Irina Vetrova for evaluating the emotional intelligence of the annotators with MSCEIT v 2.0 test.
References 1. Consensys GSR development kit. http://www.shimmersensing.com/products/gsroptical-pulse-development-kit 2. Kinect v. 2. http://www.microsoft.com/en-us/kinectforwindows 3. Neurodata Lab LLC. http://www.neurodatalab.com 4. RAMAS scripts. http://neurodatalab.com/upload/technologies files/scenarios/ Scripts RAMAS.pdf 5. RAMAS database (2016). http://neurodatalab.com/en/projects/RAMAS
RAMAS: Russian Multimodal Corpus of Dyadic Interaction
509
6. Anderson, A., Hsiao, T., Metsis, V.: Classification of emotional arousal during multimedia exposure. In: Proceedings of the 10th International Conference on Pervasive Technologies Related to Assistive Environments, pp. 181–184. ACM (2017) 7. Ayvaz, U., G¨ ur¨ uler, H., Devrim, M.O.: Use of facial emotion recognition in elearning systems. Inf. Technol. Learn. Tools 60(4), 95–104 (2017) 8. B¨ anziger, T., Pirker, H., Scherer, K.: GEMEP-GEneva multimodal emotion portrayals: a corpus for the study of multimodal emotional expressions. In: Proceedings of LREC, vol. 6, pp. 15–19 (2006) 9. Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008) 10. Chaw, T.V., Khor, S.W., Lau, P.Y.: Facial expression recognition using correlation of eyes regions. In: The FICT Colloquium 2016, p. 34, December 2016 11. De Silva, L.C., Miyasato, T., Nakatsu, R.: Facial emotion recognition using multimodal information. In: Proceedings of 1997 International Conference on Information, Communications and Signal Processing, ICICS 1997, vol. 1, pp. 397–401. IEEE (1997) 12. D’mello, S.K., Kory, J.: A review and meta-analysis of multimodal affect detection systems. ACM Comput. Surv. 47(3), 43:1–43:36 (2015). http://doi.acm.org/10.1145/2682899 13. Douglas, M.: Purity and danger: an analysis of pollution and taboo London (1966) 14. El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011) 15. Gouizi, K., Bereksi Reguig, F., Maaoui, C.: Emotion recognition from physiological signals. J. Med. Eng. Technol. 35(6–7), 300–307 (2011) 16. Hayes, A.F., Krippendorff, K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007) 17. Karg, M., Samadani, A.A., Gorbet, R., K¨ uhnlenz, K., Hoey, J., Kuli´c, D.: Body movements for affective expression: a survey of automatic recognition and generation. IEEE Trans. Affect. Comput. 4(4), 341–359 (2013) 18. Krippendorff, K.: Estimating the reliability, systematic error and random error of interval data. Educ. Psychol. Meas. 30(1), 61–70 (1970) 19. Mayer, J.D., Salovey, P., Caruso, D.R., Sitarenios, G.: Measuring emotional intelligence with the MSCEIT V2. 0. Emotion 3(1), 97 (2003) 20. Metallinou, A., Lee, C.C., Busso, C., Carnicke, S., Narayanan, S.: The USC creativeIT database: a multimodal database of theatrical improvisation. Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, p. 55 (2010) 21. Rachman, S.: Anxiety. Psychology Press Ltd., Publishers, East Sussex (1998) 22. Ranganathan, H., Chakraborty, S., Panchanathan, S.: Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9, March 2016. https://doi.org/ 10.1109/WACV.2016.7477679 23. Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013) 24. Russell, J.A., Fern´ andez-Dols, J.M.: The Psychology of Facial Expression. Cambridge University Press, Cambridge (1997)
510
O. Perepelkina et al.
25. Sergienko, E.G., Vetrova, I.I., Volochkov, A.A., Popov, A.Y.: Adaptation of J. Mayer P. Salovey and D. Caruso emotional intelligence test on russian-speaking sample. Psikhologicheskii Zhurnal 31(1), 55–73 (2010) 26. Sloetjes, H., Wittenburg, P.: Annotation by category: ELAN and ISO DCR. In: LREC (2008) 27. Tarnowski, P., Kolodziej, M., Majkowski, A., Rak, R.J.: Emotion recognition using facial expressions. Procedia Comput. Sci. 108, 1175–1184 (2017) 28. Tomkins, S.: Affect Imagery Consciousness: Volume II: The Negative Affects. Springer, New York (1963) 29. Volkova, E., De La Rosa, S., B¨ ulthoff, H.H., Mohler, B.: The MPI emotional body expressions database for narrative scenarios. PloS one 9(12), e113647 (2014) 30. Wagner, J., Lingenfelser, F., Baur, T., Damian, I., Kistler, F., Andr´e, E.: The social signal interpretation (SSI) framework: multimodal signal processing and recognition in real-time. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 831–834. ACM (2013)
Investigating Word Segmentation Techniques for German Using Finite-State Transducers G´ abor Pint´er(B) , Mira Schielke, and Rico Petrick Linguwerk GmbH, 01069 Dresden, Germany {gabor.pinter,mira.schielke,rico.petrick}@linguwerk.com
Abstract. Word segmentation plays an important role in speech recognition as a text pre-processing step that helps decrease out-of-vocabulary items and lowers language model perplexity. Segmentation is applied mainly for agglutinative languages, but other morphologically rich languages, such as German, can also benefit from this technique. Using a relatively small, manually collected broadcast corpus of 134k tokens, the current study investigates how Finite-State Transducers (FSTs) can be applied to perform word segmentation in German. It is shown that FSTs incorporating word-formation rules can reach high segmentation performance with 0.97 precision and 0.93 recall rate. It is also shown that FSTs incorporating n-gram models of manually segmented data can reach even higher performance with accuracy and recall rates of 0.97. This result is remarkable considering the fact that the bottom-up approach performs on par with the expert system without requiring explicit knowledge about morphological categories or word formation rules. Keywords: Word segmentation
1
· Text processing · Morphology
Introduction
Agglutinative languages, such as Turkish or Japanese, are commonly reported to be challenging for automatic speech recognition (ASR), partly because the vocabulary of these languages cannot be effectively accounted for by a simple enumeration of words. Listing is impractical due to the highly productive derivational and inflectional morphology. This morphological characteristic may pose problems for speech recognition as the large number of word types can lead to high out-of-vocabulary rates in the pronunciation lexicon and high perplexities in language models. German is not an agglutinative language, but its relatively complex inflectional system and its productive compounding characteristics raise problems similar to that of agglutinative languages [4–6,13,16]. For example, German adjective modifiers can have different endings according to the gender, number and case of the nouns they modify (e.g., das kalt-e Bier ‘cold beer[nom]’ dem kalt-en Bier ‘cold beer[dat]’ kalt-es Wasser ‘cold water[acc]’). Calculating c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 511–521, 2018. https://doi.org/10.1007/978-3-319-99579-3_53
512
G. Pint´er et al.
only with eleven forms per adjective (null, -e -en -em -er -es -ste -sten -stem -ster -stes) would require adding each adjective eleven times to the lexicon. Compounding can also considerably increase the vocabulary size in German as spelling conventions require most compounds to be written as single words, without any hints of morpheme boundaries. This practice results in such super-long word formations as the infamous Donaudampfschifffahrtsgesellschaft ‘Danube Steamboat Shipping Company’. An apparent solution to these problems is to split up word forms, that is, to introduce some kind of word segmentation step that provides input for lexicon and language model related tasks. There are several word segmentation techniques and tools available ranging from morphological analyzers to completely data-driven, unsupervised segmentation techniques. General purpose morphological analyzers, such as Tagh [7] for German, or ChaSen [11] for Japanese provide full-fledged, linguistically accurate morphological analysis, in which morpheme boundaries can be used as splitting points. Although it is not uncommon for studies to implement custom, morphology-based segmentation tools [2,14], the costs associated with the development and maintenance of general morphological analyzers is prohibitive. Languages without appropriate morphological analyzers can be processed by selfor semi-supervised, data-driven algorithms that identify sub-word units automatically, without relying on morphological information. Besides some sporadic, heuristically formulated attempts [10,17], Morfessor [3] has to be highlighted as an established data-driven segmentation tool, frequently occurring in studies concerned with sub-word models of speech recognition [18–20]. While data-driven tools are extremely convenient and their performance tend to improve with more data, they can produce unexpected errors, and their behavior is difficult to control. The current study aims to briefly overview how Finite-State Transducers (FSTs) can be used for word segmentation, and provide a simple performance measure for the techniques introduced—using German data. FSTs can function as a convenient mechanism to segment words, and are often used in morphological analyzers. But FSTs can also operate using bottom-up information, for example in the form of n-gram models. This study introduces and compares two top-down and a range of bottom-up FST models for word segmentation. As preliminary experiments show, more morphological knowledge leads to better segmentation performance, but self-supervised approaches—with no morphological knowledge—can perform on par with expert systems.
2 2.1
Word Segmentation with Transducers Morphological Analysis as Segmentation
The simplest word segmentation transducer can be constructed similarly to a two-level morphological parser [8], except that instead of underlying morphemes and features the output contains only the split input. Input and output labels share the same set of characters with extra segment boundary symbols (e.g. ‘+’) on the output side. The segmentation transducer is defined as a closure over all
Investigating Word Segmentation Techniques for German Using FSTs
513
acceptable segments. Figure 1 demonstrates a sample transducer that splits up the input compound zeitraum ‘time period’ into its components: zeit+raum.1
Fig. 1. A sample word segmenter FST.
A transducer with a simple closure over all lexical items, however, is not an effective segmenter, because it accepts any sequence of segments in any order. For example, the transducer above also accepts nonsense words like zeitzeit or raumraumraum. This problem of over-generation can be addressed by incorporating word-formation constraints into the transducer. A widely utilized technique is to formulate constraints over morphological categories, such as prefixes and suffixes. Figure 2 displays a transducer that accepts prefixes only at the beginning of words and suffixes only at the end. For example, prefix ab- attaches only to left side of verbs, suffix -ung only to their right side (e.g. ab+schaff+ung).
Fig. 2. Segmenter FST with morphological knowledge about prefixes and suffixes.
This naive prefix/suffix only approach also leaves plenty of room for overgeneration. The system can be greatly improved by incorporating more finegrained word formation rules through taking affix types, part of speech categories and other subcategorization features into consideration. Figure 3 represents a more sophisticated attempt for a morphological approach using various derivational and inflectional suffix types. Although discovering and implementing word formation rules is a tedious task, it can lead to remarkable segmentation performance as demonstrated by the German morphological analyzers such as Tagh [7]. 1
Words are lowercased for clarity. In German nouns start with capital letters, so the segmentation would be more correctly: Zeitraum → Zeit+Raum.
514
G. Pint´er et al.
Fig. 3. Excerpt of a segmenter FST with expert morphological knowledge.
2.2
Supervised Word Segmentation with N-Grams
While morphological analyzers are obvious choices for segmenting words, the analysis they provide is not necessarily optimal for further processing. For instance, word stems combined with morphological features, instead of the written forms, do not provide optimal input for grapheme-to-phoneme algorithms (e.g. wirfst → werfen < V >< 2 >< Sg >). Also, too short morphemes can be sub-optimal for speech recognition tasks. These along with similar constraints can easily result in disagreements with the morphological analysis. Data-driven segmentation techniques can remedy this problem by providing means to learning arbitrary segmentation patterns. One way of doing this is by training n-gram models on segmented data. The idea of using n-gram-based segmentation as a text pre-processing step is an established method for Asian languages [9] but it has also been applied to German. Incorporation of n-gram models into segmentation FSTs is not a complicated task: FST-based language models are commonly used in various speech and language processing tasks [12,15]. A notable problem concerning the combination of segmentation and n-gram FSTs is that FST-based segmenters typically operate on characters, while n-gram models are defined over words or morphemes. This mismatch can be easily remedied by rewriting character sequences to morpheme labels in segmenters as demonstrated in Fig. 4. As a first approximation, n-gram information can be integrated into the segmentation process in two steps. First a lattice of possible segmentations is created; second, this lattice is re-scored with an n-gram FST. The two-step approach, however, is slow and cumbersome. A more elegant approach is to merge the segmenter and the n-gram transducers into one FST. The merged— or composed—FST preserves the overall structure and weights of the n-gram
Investigating Word Segmentation Techniques for German Using FSTs
515
Fig. 4. Word segmenter from Fig. 1 with morpheme-level output labels.
Fig. 5. Fragment of a transducer n-gram model with arcs for an, ab and abend. Epsilon output labels are omitted for clarity.
Fig. 6. Determinized and weight-pushed version of transducer in Fig. 5.
transducer. Figure 5 displays a fragment of an n-gram FST whose input morpheme arcs were replaced by characters. Making the resulting transducer deterministic and sorting it by input label are useful optimization steps as they help reduce model size and enable faster search of arcs. Figure 6 represents an optimized version of the FST of Fig. 5. A disadvantage of these models is that they require custom search and composition algorithms, as their treatment of back-off and epsilon arcs is different from standard FST-based n-gram models.
3
Experiment
A series of experiments was conducted to compare the performance of top-down with bottom-up approaches to FST-based segmentation. The top-down approach was represented by two FST models that implemented different amounts—Naive
516
G. Pint´er et al.
and Expert levels—of morphological knowledge. The bottom-up approach was associated with transducers that were based on n-gram models. A relatively small (134k) broadcast news corpus was used in a 10-fold cross-validation setup to evaluate segmentation performance. The folds were analyzed for perplexity and OOV rates as well as precision, recall and f-measure. In preparation of those calculations n-gram models with Katz smoothing were trained using 9 folds out of 10. The quality measures were calculated against the retained folds. 3.1
Corpus Data and Segmentation
As there is no standardized way to segment German text, there is also no standardized segmented corpus available. For development and testing purposes, German news broadcast text was collected from the Deutsche Welle news portal www.dw.de between early 2017 and early 2018. The texts collected, extracted from 207 news reports, was manually normalized and segmented. After normalization each file contained on average 646.7 tokens. Segmentation involved only the splitting up of words, no morphological categories or features were added. Some examples from the corpus are: Woche-n-arbeit-s-zeit ‘hours worked per week’, Zahl-reich-e H¨ auser sind zer-st¨ or-t ‘several houses are destroyed’. Admittedly, this manual segmentation diverged from traditional morphological analyses. For example, in order to keep the lexical model simple, words were kept together if segmentation would have produced alternative pronunciation, such as with H¨ auser *→Haus+er. 3.2
Perplexity and OOV Rates
The corpus had a relatively small size of circa 134k tokens after text normalization. The segmentation has increased the token count to 198k, while it almost halved the type count. As expected, the segmented corpus had a lower perplexity of 14.1 compared to 21.4 of the original text (Table 1). Perplexity values were calculated using 3-gram language models with Katz smoothing. As shown in Fig. 7, word segmentation has achieved a considerable decrease in perplexity in unseen data: from 219.98 to 79.69 on average. OOV type and token ratios were also calculated for the unseen folds. The weighted average of OOV tokens was 7.47%, which dropped to 1.89% after segmentation. A similar decrease was observed with types: from 20.88% to 9.30% on average. Values for each fold are seen in Fig. 8. Table 1. Text-normalized and segmented news broadcast data. Unit
Token counts
Type counts
word
133,664 (100.00%) 18,131 (100.00%) 21.377
morpheme 197,536 (147.79%) 9,934
(54.79%)
Perplexity 14.057
Investigating Word Segmentation Techniques for German Using FSTs
210.8
217.4
225.4
80
80.5
g
h
219.1
j
f
81.4
i
76.4
original e
216.1
a
segmented b c d
74.4
250
517
226
217.9
226.7
81.2
83.4
229.3 80.6
80.5
211
150
78.6
Perplexity
200
100 50 0
Fig. 7. Perplexity values in unseen folds with segmented and unsegmented text. 0.30
h
0.208
i j 0.216
g
0.195
f
0.204
e
0.091
d
0.202
0.218
0.15
0.05
0.098
0.10 0.101
OOV ratio
0.078
0.068 0.015
0.00
0.02
0.075
0.072 0.019
0.019
0.071
0.077 0.019
0.019
0.078
0.077 0.021
0.02
0.018
0.02
0.019
0.078
0.04
0.073
OOV ratio
0.20 0.06
c b
0.093
a
original
0.092
j
0.217
h
0.098
g
0.218
f
0.211
e
0.105
c
segmented 0.25
i
0.092
b
d
0.198
0.08
a
original
0.085
segmented
0.075
0.10
0.00
Fig. 8. Out-of-vocabulary ratios for tokens (left) and for types (right) in unseen folds.
3.3
Segmentation Models
A series of FST-based word segmenters was created following the concepts outlined in Sect. 2. A Naive model was created with an FST structure relying only on three morpheme categories: prefixes, suffixes and stems (cf. Fig. 2). Weights were set to a constant value for all segments to prefer longer chunks. The Expert model implemented a thorough, but non-exhaustive set of morphological rules (see Fig. 3). The weights were defined manually, based on experimentation. Both Naive and Expert models used around 80% of the corpus as a development set. Other sources, such as affix dictionaries and word lists were also used to define morpheme classes, transducer structure and weights.
518
G. Pint´er et al.
In addition to the two top-down approaches, five data-driven models were created using 1- to 5-gram language models with Katz smoothing. In preparation of these models, first transducer-based n-gram models were trained using normalized and segmented text of the training folds. Next, word and morpheme labels in the n-gram transducer were replaced by character sequences on the input side (cf. Fig. 4). Finally, the transducers were determinized, minimized, and the weights were pushed forward for faster performance (cf. Fig. 6). A special, non-epsilon symbol was used as back-off arc label. All transducers and necessary tools were developed using OpenFst [1] and OpenGrm [15]. 3.4
Results
Recall, precision and f -measure values were calculated to evaluate segmentation performance. The unseen data folds from the cross-validation setup were used as test sets for the n-gram models. For Naive and the Expert models, the separation of seen and unseen data was not consistent, as parts of the corpus were used— besides other sources—to manually discover morphological generalizations. For easier comparison the same “unseen” folds were used for all segmentation models. Table 2 summarizes the means of performance metrics over the test sets. A visual presentation of precision and recall values with medians are presented in Fig. 9. 3.5
Discussion
In terms of f -measures the best segmentation performance was achieved by the 2-gram model. This result, however, is not significantly different from other Table 2. Segmentation performance: mean values over “unseen” folds. Naive
Expert 1-gram 2-gram 3-gram 4-gram 5-gram
Recall
0.9140 0.9324 0.9410 0.9684
Precision
0.8739 0.9656 0.9468 0.9673 0.9661
0.9657
0.9657
f-Measure 0.8935 0.9487 0.9438 0.9678 0.9673
0.9671
0.9671
1.00
1.00
0.968 0.969 0.969 0.969
0.932
0.967 0.966 0.966 0.966
0.966 0.947
0.941
precision
0.914
0.90
0.90
0.874
Fig. 9. Segmentation performance: recall (left) and precision (right).
5−grm
4−grm
3−grm
2−grm
1−grm
Expert
5−grm
4−grm
3−grm
2−grm
1−grm
Naive
0.85 Expert
0.85
0.95
Naive
recall
0.95
0.9686 0.9686 0.9686
Investigating Word Segmentation Techniques for German Using FSTs
519
higher order n-gram models. Unquestionably the Naive approach had the worst performance among the compared models. This result is not surprising given its over-simplified morphological model. Incorporating more sophisticated morphological knowledge proved to be useful as demonstrated by the performance improvements of the Expert model. Of course the question is if such expert systems are worth developing as n-gram models without morphological knowledge can deliver similar performance. A closer look at the errors may influence the interpretation of the seemingly outstanding results. Almost half of OOV words in the unseen folds were named entities in non-affixed forms. These unsplit OOV items did not contribute to the evaluation as non-parsable input words were treated as single units.2 Thus neither the reference nor the hypotheses had morpheme boundaries. Provided that words used for training are segmented correctly, the seen data together with non-splittable OOV items can account for the seemingly impressive results. The low error rates are attributable to the low number of multi-segment OOV items.
4
Conclusion
The goal of this article was to present a brief overview and a few examples of how FSTs can be used for word segmentation. The introduced top-down and bottomup approaches, while performing well in the experiments, provided only a limited insight of what FSTs are capable of. For example, top-down models can easily be augmented with stochastic elements; or inversely, the n-gram approach can integrate morphological classes. It is also possible to detect word-embedded OOV tokens with fall-back arcs in combination with confidence measures. Orthogonal to the direction of these technical improvements, another straightforward extension of this research would involve evaluation of segmentation models in context. The presented low perplexity and OOV rates may imply better ASR performance, but the actual effect on recognition accuracy needs to be verified through experimentation. Although the current literature does not provide a conclusive answer, it seems that segmentation may lead to better ASR performance, but this gain may decrease with the increase of vocabulary size [19]. While we cannot answer questions related to speech recognition performance at present, we believe that our work provides a useful base for further studies concerning word segmentation using finite-state techniques.
2
However, a few OOV words were falsely segmented into—typically short— morphemes, leading to errors (e.g. Tories → Tor+ie+s).
520
G. Pint´er et al.
References 1. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: A General ˇ arek, and Efficient Weighted Finite-State Transducer Library. In: Holub, J., Zd´ J. (eds.) CIAA 2007. LNCS, vol. 4783, pp. 11–23. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76336-9 3 2. Arisoy, E., Saraclar, M.: Compositional neural network language models for agglutinative languages. In: Interspeech 2016, pp. 3494–3498 (2016) 3. Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: ACL Workshop on Morphological and Phonological Learning, pp. 21–30 (2002) 4. El-Desoky, A., Shaik, A., Schl¨ uter, R., Ney, H.: Sub-lexical language models for German LVCSR. In: Spoken Language Technology Workshop, pp. 159–164 (2010) 5. El-Desoky, A., Shaik, A., Schl¨ uter, R., Ney, H.: Morpheme level feature-based language models for German LVCSR. In: Interspeech 2012, pp. 170–173 (2012) 6. Geutner, P.: Using morphology towards better large-vocabulary speech recognition systems. In: IEEE International Conference on Acoustic, Speech Signal Processing, vol. 1, pp. 445–448 (1995) 7. Geyken, A., Hanneforth, T.: TAGH: a complete morphology for german based on weighted finite state automata. In: Yli-Jyr¨ a, A., Karttunen, L., Karhum¨ aki, J. (eds.) FSMNLP 2005. LNCS (LNAI), vol. 4002, pp. 55–66. Springer, Heidelberg (2006). https://doi.org/10.1007/11780885 7 8. Jurafsky, D., Martin, J.: Speech and Language Processing, 2nd edn. Prentice-Hall Inc., Upper Saddle River (2009) 9. Kang, S.-S., Hwang, K.-B.: A language independent n-gram model for word segmentation. In: Sattar, A., Kang, B. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 557–565. Springer, Heidelberg (2006). https://doi.org/10.1007/11941439 60 10. Larson, M., Willett, D., K¨ ohler, J., Rigoll, G.: Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches. In: Interspeech 2000, pp. 945–948 (2000) 11. Matsumoto, Y.: Easy to use practical freeware for natural language processing: morphological analysis system ChaSen. IPSJ Mag. 41(11), 1208–1214 (2000) 12. Mohri, M.: Finite-state transducers in language and speech processing. Comput. Linguist. 23(2), 269–311 (1997) 13. Nußbaum-Thom, M., El-Desoky, A., Schl¨ uter, R., Ney, H.: Compound word recombination for German LVCSR. In: Interspeech 2011, pp. 1449–1452 (2011) 14. Renshaw, D., Hall, K.: Long short-term memory language models with additive morphological features for automatic speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5246–5250 (2015) 15. Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., Tai, T.: The OpenGrm open-source finite-state grammar software libraries. In: ACL 2012 System Demonstrations, pp. 61–66 (2012) 16. Shaik, A., El-Desoky, A., Schl¨ uter, R., Ney, H.: Feature-rich sub-lexical language models using a maximum entropy approach for German LVCSR. In: Interspeech 2013, pp. 3404–3408 (2013) 17. Shamraev, N., Batalshchikov, A., Zulkarneev, M., Repalov, S., Shirokova, A.: Weighted finite-state transducer approach to german compound words reconstruction for speech recognition. In: AINL-ISMW FRUCT, pp. 96–101 (2015) 18. Smit, P., Virpioja, S., Kurimo, M.: Improved subword modeling for WFST-based speech recognition. In: Interspeech 2017, pp. 2551–2555 (2017)
Investigating Word Segmentation Techniques for German Using FSTs
521
19. Tachbelie, M., Abate, S., Menzel, W.: Using morphemes in language modeling and automatic speech recognition of Amharic. Nat. Lang. Eng. 20, 235–259 (2012) 20. Zablotskiy, S., Minker, W.: Sub-word language modeling for Russian LVCSR. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 413–421. Springer, Cham (2015). https://doi.org/10.1007/978-3-31923132-7 51
A Comparison of Language Model Training Techniques in a Continuous Speech Recognition System for Serbian Branislav Popović1,3,4(&), Edvin Pakoci1,2, and Darko Pekar1,2 1
3
Department for Power, Electronic and Telecommunication Engineering, Faculty of Technical Sciences, University of Novi Sad, Trg Dositeja Obradovića 6, 21000 Novi Sad, Serbia [email protected] 2 AlfaNum Speech Technologies, Bulevar Vojvode Stepe 40, 21000 Novi Sad, Serbia Department for Music Production and Sound Design, Academy of Arts, Alfa BK University, Nemanjina 28, 11000 Belgrade, Serbia 4 Computer Programming Agency Code85 Odžaci, Ive Andrića 1A, 25250 Odžaci, Serbia
Abstract. In this paper, a number of language model training techniques will be examined and utilized in a large vocabulary continuous speech recognition system for the Serbian language (more than 120000 words), namely Mikolov and Yandex RNNLM, TensorFlow based GPU approaches and CUED-RNNLM approach. The baseline acoustic model is a chain sub-sampled time delayed neural network, trained using cross-entropy training and a sequence-level objective function on a database of about 200 h of speech. The baseline language model is a 3-gram model trained on the training part of the database transcriptions and the Serbian journalistic corpus (about 600000 utterances), using the SRILM toolkit and the Kneser-Ney smoothing method, with a pruning value of 10−7 (previous best). The results are analyzed in terms of word and character error rates and the perplexity of a given language model on training and validation sets. Relative improvement of 22.4% (best word error rate of 7.25%) is obtained in comparison to the baseline language model. Keywords: Language modeling
RNNLM LSTM LVCSR
1 Introduction Language modeling is an essential component of natural language processing and automatic speech recognition systems. In many applications, a good language model can even overcome flaws of an acoustic model by providing the data necessary to recognize natural sentences. It has been shown that language models may become ever so close to human language understanding [1], allowing their application in different domains, and for a range of machine learning problems. For the last several decades, statistical language modeling was based on relatively simple, yet highly effective and widely successful n-grams, i.e., frequencies of word © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 522–531, 2018. https://doi.org/10.1007/978-3-319-99579-3_54
A Comparison of Language Model Training Techniques
523
sequences of up to given length n [2]. This approach, however, has a few well-known issues, such as data sparsity (n-gram approach usually requires smoothing [3]), as well as statistical dependence on a very limited number (n-1) of preceding words (longer contexts are ignored and it is difficult to train them on a limited amount of data). Several attempts have been made in order to improve n-gram results. Nevertheless, they usually bring more complexity, and more importantly, they could beat n-grams only when the amount of training data is limited – in case of larger datasets, n-grams usually came on top. One of those examples is the use of neural networks for language model training (NNLMs). Recurrent neural network based language models (RNNLMs) have been utilized in recent years to resolve issues concerning confusing and difficult implementation and to reduce the computational complexity. Implementation of these networks is much simpler. More importantly, they are able to resolve two major n-gram model issues – they project each word into a compact continuous vector space, which could be described using only a limited set of parameters, and their recurrent connections are able to model longer contexts – sequences of rather arbitrary length, to be precise. Several experiments have already shown their superiority in relation to both n-grams [4] and feed-forward NNLMs [5]. Nonetheless, RNNLMs are very computationally demanding, which can result in low training speeds, especially when using a lot of training data. The focus of this paper is to compare several language model training approaches on the most widely used textual database for Serbian, as well as to compare the results with the previously obtained n-gram based language model, using a fixed acoustic model trained on the same audio database from our previous research [6, 7]. These approaches include Mikolov RNNLM toolkit (a CPU-based implementation) [8], Yandex RNNLM toolkit (a faster RNNLM toolkit variant which uses a Huffman binary tree [9, 10] or an approximation of the Mikolov’s softmax activation function at the output layer with noise contrastive estimation (NCE)) [11], as well as different TensorFlow-based GPU approaches – vanilla, LSTM and fast (pruned) LSTM approach [12], and finally, the CUED-RNNLM toolkit (another GPU-based approach that involves an efficient GPU implementation with an improved training criterion) [13]. The proposed language models have been evaluated and compared in terms of their perplexity (PPL) on the training and validation sets used for language model training, as well as the word error rate (WER) on the given test set, obtained by lattice rescoring (either regular or pruned), or the N-best list rescoring. All the above mention approaches (except CUED-RNNLM) have already been implemented within the Kaldi speech recognition toolkit [14]. Therefore, they do not require any external libraries other than CUDA and TensorFlow (where applicable). Finite state transducers (FSTs) representing language model differences have been used, as well as n-gram approximation techniques for lattice rescoring [15], in order to reduce the amount of computation needed and to prevent lattice explosion.
524
B. Popović et al.
2 Theoretical Background For n-gram language models, the probability of each word depends only on up to n previous words. The probability of a word sequence is estimated according to the number of appearances in the training corpus. In the case of highly inflective and morphologically rich languages, this approximation is highly inadequate, because there are many sequences that will never occur (in Serbian, there are 7 cases, 2 numbers, 3 genders, 14 verb forms and several dialects). The baseline language model used in this paper is a 3-gram language model, trained on the training part of the database transcriptions (148558 utterances, see Sect. 3.1), and the additional part coming from the Serbian journalistic corpus for more realistic probability estimation (442000 utterances). The model contains 121197 unigrams, 1279389 bigrams and 357721 trigrams. The Kneser-Ney smoothing method was applied with a pruning value of 10−7. The baseline WER was 9.34%, and the baseline CER 2.46%. In our previous research, it was determined that short words, e.g. prepositions, particles and conjunctions, and vowels, poorly covered by the language model, contribute very significantly to the total number of word errors (number of insertions, deletions and substitutions in LVCSR system). It was also concluded that a more suitable language model could be used in order to resolve most of the issues [6]. Bearing that in mind, several language model training techniques and multiple training configurations have been examined and will be briefly described in the rest of this section. 2.1
TensorFlow-Based GPU Approaches
Three TensorFlow-based GPU approaches have been examined. The first one was the vanilla RNNLM approach (a simple, regular neural network, with a single hidden layer trained using the standard, i.e., vanilla backpropagation algorithm). However, two major issues have been reported concerning these networks - the vanishing gradient problem (the network is unable to learn long-term dependencies), and the exploding gradient problem (the possibility of overflow). Therefore, the second approach, more robust against the problems of long-term dependency, uses somewhat more complicated long short-term memory (LSTM) network. This is a sequence to sequence approach, using one word at a time to produce probabilities for the next word in a sentence. The third approach is also a LSTM approach, but a pruning algorithm and a modified softmax function, i.e., a function able to train a self-normalized network, where sum of outputs is automatically close to zero, are used in order speed up the computation [15]. 2.2
Mikolov RNNLM Toolkit
The Mikolov RNNLM toolkit utilizes a recurrent neural network (RNN) architecture, which consists of the input layer, one hidden layer and the output layer [8]. In the training phase, words are subjected to the input layer in 1-of-N representation and further concatenated with the previous state of the hidden layer. The standard sigmoid
A Comparison of Language Model Training Techniques
525
activation function is used for the neurons in the hidden layer. The output layer represents the probability of the current word in relation to the previous word and the state of the hidden layer for the previous time step. The standard stochastic gradient descent algorithm is used for the RNNLM training. The recurrent weights are calculated by using the so-called truncated backpropagation through time (the network is unfolded for the specified amount of time steps). The training is conducted iteratively (11 to 13 iterations in the case of our experiments). One part of the training set was used for the actual training, while the rest (30000 utterances, about 5% of them) was used for validation purposes (the same configuration was used for each of our experiments). 2.3
Yandex RNNLM Toolkit
This is a faster RNNLM implementation. The topology consists of one input layer (fed by the full history vector, obtained by concatenation of a given word in 1-of-N representation and a continuous vector for the remaining context), one hidden layer that computes another representation using the sigmoid activation function, and one output layer, producing the RNNLM probabilities. Instead of the explicit computation of the output layer normalization term, i.e., the softmax activation function, either a Huffman binary tree [9], that assign short binary codes to the most frequent words, or NCE [11] is used, i.e., a nonlinear logistic regression to discriminate among the observed data and some noise distribution is performed, allowing the efficient implementation during both the training and the testing phase. In the case of our experiments, the –nce option (the number of noise samples) was set to 20 (the recommended value). Around 50% faster training is obtained on average, compared to the Mikolov RNNLM toolkit. 2.4
CUED-RNNLM Toolkit
This is a FRNNLM approach [13], i.e., a RNNLM approach with a full output layer instead of the class based RNNLM. Instead of the conventional objective function based on the cross-entropy criterion, improved training criteria have been implemented, namely variance regularization, which explicitly adds the variance of the normalization term to the standard objective function, and the previously described NCE, where each word is assumed to be generated by both data and noise distributions.
3 Experimental Setup 3.1
Acoustic Model
The baseline acoustic model is a chain sub-sampled time-delay neural network (TDNN), trained using cross-entropy training and a sequence-level objective function, i.e., the log-probability of the correct phone sequence. The training procedure consists of pre-DNN training and DNN training phase. The pre-DNN training phase [3] consists of feature extraction (14 Mel-frequency cepstral coefficients and 3 additional features for pitch: the probability of voicing (POV) - a warped normalized cross correlation function (NCCF) feature, the log-pitch value with POV-weighted mean subtraction
526
B. Popović et al.
over a 1.5 s window, and delta-pitch - delta feature computed on raw log pitch, plus the first and second order time derivatives of all static features), monophone model training (1000 Gaussians, 40 iterations), several triphone model trainings (first pass - 9000 Gaussians and 1800 states, 35 iterations, second pass - 25000 Gaussians, 3000 states, 35 iterations), and speaker adaptive training (25000 Gaussians, 3000 states, 35 iterations, feature space maximum likelihood linear regression (fMLLR) using diagonal matrices). The DNN phase [7] uses high-resolution features (40 high-resolution melfrequency cepstral coefficients, calculated on 30 ms frames with 10 ms frame shift, plus the pitch features). The network consists of 8 hidden layers and 625 neurons per each hidden layer. In order to reduce the amount of computation, the topology is simplified. Instead of the standard 3-state left-to-right topology, the chain topology can be traversed in a single frame, i.e., the most hidden layers at the output of the neural network have to be evaluated on every 3rd frame. “−1,0,1” layer splicing configuration is used for the initial layers (describing 3 consecutive frames), and “−3,0,3” was used for the most hidden layers (describing 3 frames, separated by 3 frames from each other). The TDNN was trained in 4 epochs, i.e., 60 iterations, on CUDA enabled GeForce GTX 1080 Ti GPU. 3.2
Data Preparation
The database is comprised of 2 data sets, described in our previous research (see e.g. [3, 6, 7]). The larger one is a set of audio books, read by 132 professional male and female speakers (74 males and 58 females) in a studio environment (mostly high quality audio). This is a set of larger and more complex utterances. The total duration of the first set is 154 h and 3 min. The second one consists of domain-oriented speech, mostly short utterances recorded over various mobile phone devices. This part of the database was added in order to increase the total amount of data and to improve the recognition accuracy in voice assistant type applications. The sentences were spoken by 169 male and 181 female speakers and the total duration was 60 h and 57 min. The training part of the database comprises 197 h of speech (including silence), divided into training (95%) and cross-validation (5%) parts. 18 h of speech (including silence) was selected for testing purposes. Both data sets were recorded in mono PCM format, sampled at 16000 Hz, using 16 bits per sample.
4 Experimental Results In Tables 1 to 6, the results are presented for all the above mentioned toolkits and various language model configurations, and a test vocabulary of around 121000 words. The columns are given in the following order: rescoring type (abbreviated as RT, either language model rescoring of lattices or N-best list rescoring), n-gram order (e.g. if n-gram order is 4, any history that shares the last 3 words would be merged into a single state), number of most frequent words in the shortlist (while the rest of words are grouped together and their probabilities are distributed uniformly according to the respective unigram counts), number of classes and hidden layers (where applicable),
A Comparison of Language Model Training Techniques
527
word (WER) and character (CER) error rates, and the perplexity (PPL) for given training and validation sets. Other parameters (not given in the table) were set to their default values in Kaldi and TensorFlow. Size of the network, i.e., the number of hidden layers and neurons (for all of the experiments) was tailored to obtain the optimal performance in terms of the trade-off between the training speed and the accuracy. In some of the experiments (Tables 4 and 5), the N-best rescoring was used instead of lattice rescoring (the number of hypothesis was set to 1000, so the results were quite similar, but the training was much faster). Table 1. LSTM TensorFlow-based GPU configurations. RT Lattice Lattice Lattice Lattice Lattice Lattice
n-gram 3 4 3 4 3 4
Words 40000 40000 57464 57464 77803 77803
Classes – – – – – –
Layers 2 2 2 2 2 2
Neurons 200 200 200 200 200 200
WER 7.48 7.31 7.51 7.25 7.35 7.27
CER 2.02 1.99 2.02 1.99 1.99 1.98
PPL train 75.072 75.214 78.893 78.820 81.340 81.354
PPL valid 114.098 114.289 124.185 124.905 130.842 130.658
In Tables 1 to 3, the results are given for various TensorFlow-based GPU approaches, i.e., LSTM (Table 1), fast (pruned) LSTM (Table 2) and vanilla approach (Table 3). The number of words in the training vocabulary was set to either 40000 (approximately one third of the input lexicon), 57464 (words appearing 5 or more times in the training set) or 77803 (words that appear 3 or more times in the training set), respectively. Other parameters were set to their default values (200 neurons in the hidden layer, 2 layers in the case of LSTM and fast (pruned) LSTM training, and a single layer in the case of vanilla training). Table 2. Fast (pruned) LSTM TensorFlow-based GPU configurations. RT Lattice Lattice Lattice Lattice Lattice Lattice
n-gram 3 4 3 4 3 4
Words 40000 40000 57464 57464 77803 77803
Classes – – – – – –
Layers 2 2 2 2 2 2
Neurons 200 200 200 200 200 200
WER 7.64 7.47 7.65 7.44 7.60 7.58
CER 2.05 2.03 2.05 2.01 2.04 2.03
PPL train 71.972 72.017 76.646 76.291 81.263 80.462
PPL valid 123.010 122.664 135.569 135.343 143.401 142.515
The average PPL on the validation set was 123.163 for the LSTM case, 133.750 for the fast LSTM case and 422.256 for the vanilla case. Those values correspond well to the average word error rate (LSTM 7.36%, fast LSTM 7.56%, vanilla 8.99%). In Table 4, the results are presented for the Mikolov RNNLM implementation, using the default number of 300 neurons in the hidden layer. The number of classes in
528
B. Popović et al. Table 3. Vanilla TensorFlow-based GPU configurations. RT Lattice Lattice Lattice Lattice Lattice Lattice
n-gram 3 4 3 4 3 4
Words 40000 40000 57464 57464 77803 77803
Classes – – – – – –
Layers 1 1 1 1 1 1
Neurons 200 200 200 200 200 200
WER 8.91 8.93 8.97 9.20 9.00 8.91
CER 2.39 2.38 2.41 2.44 2.40 2.38
PPL train 350.179 386.209 420.131 439.877 377.774 413.213
PPL valid 372.627 406.665 443.908 465.647 409.681 435.018
Tables 4 and 5 (the Yandex implementation) was increased with the number of words (400, 450 and 500 classes were examined). The average PPL was 110.242 for the Mikolov RNNLM and 139.618 for the Yandex RNNLM, and the average WER was 7.48% (Mikolov) and 7.55% (Yandex).
Table 4. Mikolov RNNLM configurations. RT N-best N-best N-best N-best N-best N-best
n-gram 3 4 3 4 3 4
Words 40000 40000 57464 57464 77803 77803
Classes 400 400 450 450 500 500
Layers 1 1 1 1 1 1
Neurons 300 300 300 300 300 300
WER 7.41 7.53 7.42 7.45 7.59 7.47
CER 2.00 2.04 2.01 2.04 2.05 2.03
PPL train PPL valid – 116.131 – 105.498 – 112.605 – 104.060 – 120.344 – 102.813
In Table 6, results are given for the CUED-RNNLM implementation. The average PPL on the validation set was 163.205 and the average WER 7.62%. Concerning their respective training times, the TensorFlow versions completed in about an hour, Mikolov RNNLM trainings took approximately 20 h, and Yandex RNNLM trainings finished in about 10 h, as well as the CUED-RNNLM trainings.
Table 5. Yandex RNNLM configurations. RT N-best N-best N-best N-best N-best N-best
n-gram 3 4 3 4 3 4
words 40000 40000 57464 57464 77803 77803
classes 400 400 450 450 500 500
layers 1 1 1 1 1 1
neurons 300 300 300 300 300 300
WER 7.52 7.72 7.53 7.55 7.54 7.46
CER 2.10 2.11 2.07 2.07 2.06 2.06
PPL train PPL valid – 134.434 – 155.795 – 135.226 – 133.465 – 148.749 – 130.039
A Comparison of Language Model Training Techniques
529
The best result (7.25% WER, 22.4% relative improvement, i.e., the proposed configuration in Figs. 1 and 2) was obtained for the case of LSTM training, for 57464 words and a 4-gram language model. The number of substitutions was the most prominent (8080 vs. 708 insertions and 2720 deletions). Although the influence of the short words and vowels to the final WER is somewhat decreased (more errors are caused by an improper case), the distribution of error, given in terms of the number of unique deletions/insertions/substitutions per total number of deletions/insertions/ substitutions, remains almost the same. Table 6. CUED-RNNLM configurations. RT Lattice Lattice Lattice Lattice Lattice Lattice
n-gram 3 4 3 4 3 4
Words 40000 40000 57464 57464 77803 77803
Classes – – – – – –
Layers 1 1 1 1 1 1
Neurons 200 200 200 200 200 200
WER 7.67 7.58 7.66 7.59 7.68 7.53
CER 2.07 2.04 2.07 2.03 2.07 2.03
PPL train 114.104 115.085 124.129 124.201 129.459 127.410
PPL valid 181.733 183.289 158.669 158.745 149.567 147.229
The comparison between the results obtained using the baseline (SRILM) language model and the proposed (best) configuration is presented in Fig. 1. In Fig. 2, the distribution of error for the baseline and the proposed configuration, i.e., the percentages of unique deletions/insertions/substitutions in total number of deletions/insertions/ substitutions, are shown. Figure 2 suggests there are no new prominent instances of deletions, insertions or substitutions that haven’t emerged before, but on the other hand, there were no new instances of errors in general. Also, the relative number of deletions and insertions is slightly reduced in favor of the number of substitutions.
Fig. 1. A comparison between the baseline and the proposed (best) configuration in terms of the number of deletions, insertions and substitutions.
530
B. Popović et al.
Fig. 2. The distribution of error for the baseline and the proposed configuration (unique deletions/insertions/substitutions per total number of deletions/insertions/substitutions [%]).
5 Conclusion Several language modeling approaches have been analyzed in this paper. Significant improvement (more than 20% WER in relative terms) has been obtained in terms of both WER and CER in comparison to the baseline language model, which confirms the hypothesis given in our previous research [6]. However, the distribution of error among different word forms has remained more or less the same. Bearing in mind the language infectivity and the fact that most of the errors are influenced by an improper case (CER was relatively low in all of the experiments), the class n-gram approach is the next logical step in the development of our LVCSR system. Acknowledgments. The work described in this paper was supported in part by the Ministry of Education, Science and Technological Development of the Republic of Serbia, within the project “Development of Dialogue Systems for Serbian and Other South Slavic Languages”, EUREKA project DANSPLAT, “A Platform for the Applications of Speech Technologies on Smartphones for the Languages of the Danube Region”, id E! 9944, and the Provincial Secretariat for Higher Education and Scientific Research, within the project “Central Audio-Library of the University of Novi Sad”, No. 114-451-2570/2016-02.
References 1. Goodman, J.T.: A bit of progress in language modeling, extended version. Microsoft Research, Technical report, MSR-TR-2001-72 (2001) 2. Rosenfeld, R.: Two decades of statistical language modeling: where do we go from here? Proc. IEEE 88, 1270–1278 (2000) 3. Pakoci, E., Popović, B., Pekar, D.: Language model optimization for a deep neural network based speech recognition system for Serbian. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 483–492. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-66429-3_48
A Comparison of Language Model Training Techniques
531
4. Mulder, W.D., Bethard, S., Moens, M.F.: A survey on the application of recurrent neural networks to statistical language modeling. Comput. Speech Lang. 30(1), 61–98 (2015) 5. Mikolov, T., Kombrink, S., Burget, L., Černocký, J.H., Khudanpur, S.: Extensions of recurrent neural network language model. In: Proceedings of ICASSP, pp. 5528–5531. IEEE (2011) 6. Popović, B., Pakoci, E., Pekar, D.: End-to-end large vocabulary speech recognition for the Serbian language. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 343–352. Springer, Cham (2017). https://doi.org/10.1007/978-3319-66429-3_33 7. Pakoci, E., Popović, B., Pekar, D.: Fast sequence-trained deep neural network models for Serbian speech recognition. In: 11th Digital Speech and Image Processing, DOGS, Novi Sad, Serbia, pp. 25–28 (2017) 8. Mikolov, T., Kombrink, S., Deoras, A., Burget, L., Černocký, J.H.: RNNLM - recurrent neural network language modeling toolkit. In: Procedings of ASRU Workshop (2011) 9. Mikolov, T., Chen K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space, arXiv:1301.3781 (2013) 10. Niu, F., Recht, B., Ré, C., Wright, S.J.: Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, Chicago, pp. 693–701 (2011) 11. Chen, X., Liu, X., Gales, M.J.F., Woodland, P.C.: Recurrent neural network language model training with noise contrastive estimation for speech recognition. In: Proceedings of ICASSP, pp. 5411–5415. IEEE (2015) 12. Abadi, M: TensorFlow: large-scale machine learning on heterogeneous distributed systems, arXiv:1603.04467 (2016) 13. Chen, X., Liu, X., Qian, Y., Gales, M.J.F., Woodland P.C.: CUED-RNNLM – an opensource toolkit for efficient training and evaluation of recurrent neural network language models. In: Proceedings of ICASSP, pp. 6000–6004. IEEE (2015) 14. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 1–4. IEEE Signal Processing Society (2011) 15. Xu, H., et al.: A pruned RNNLM lattice-rescoring algorithm for automatic speech recognition (2017)
Perceptual-Auditory Evaluation of the Aggressive Speech Behavior: Gender Aspect (on the Basis of Russian and Spanish Languages) Rodmonga Potapova1 , Liliya Komalova1,2(&) and Vsevolod Potapov3
,
1
Institute of Applied and Mathematical Linguistics, Moscow State Linguistic University, Ostozhenka Street 38, 119034 Moscow, Russia {RKPotapova,GenuinePR}@yandex.ru 2 Department of Linguistics, Institute of Scientific Information for Social Sciences of the Russian Academy of Sciences, Nakhimovsky Prospect 51/21, 117997 Moscow, Russia 3 Centre of New Technologies for Humanities, Lomonosov Moscow State University, Leninskije Gory 1, 119991 Moscow, Russia [email protected]
Abstract. The purpose of the research was to identify prosodic features of perceived images of a “male aggressor” (10 Russian samples, 10 Spanish samples) and a “female aggressor” (10 Russian samples, 10 Spanish samples) reconstructed by Russian male (n = 13) and female (n = 42) listeners in the course of the perceptual-auditory experiment. All listeners reported that the speech of all Russian and Spanish subjects (informants) was perceived as if they were externalizing an offensive type of aggression during the aggression escalation period. It is characterized by strong negative emotions, differing in male and female subjects’ (informants’) groups. Qualitative analysis of the prevailing speech prosodic tendencies revealed that in the studied conditions, female listeners estimated the male subjects’ (informants’) voice intensity in the aggressor image as stronger in comparison with the female subjects’ (informants’) voice intensity. Male listeners found female voice pitch in the aggressor image higher than the male aggressor voice pitch. Male and female listeners perceived the Spanish subjects’ (informants’) speech tempo as faster than the Russian subjects’ (informants’) speech tempo. Female listeners considered the perceptualauditory aggressor images to have clear speech rhythm, while male listeners perceived speech rhythm of the aggressor image as irregular (with no gender difference of a speaker). The obtained results are compared with the previous findings on the material of British and American English. Keywords: Aggressive speech behavior Speech prosody Gender Speech perception Emotions Speech rhythm Speech pauses Speech breathing Speech timbre Speech melodic pattern
© Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 532–541, 2018. https://doi.org/10.1007/978-3-319-99579-3_55
Perceptual-Auditory Evaluation of the Aggressive Speech Behavior
533
1 Introduction In the realm of pragmalinguistics, the effects of antisocial speech behavior in general have long been studied. However, less well understood are the perceptional and cognitive mechanisms of aggressive speech behavior that influence bystanders. In the aggressive interaction “bystanders are of particular interest as they have the potential to amend the situation by intervening” [2]. Moreover, “the mere perception of another’s behavior automatically increases the likelihood of engaging in that behavior oneself” [5]. Emotionally colored speech behavior of a speaker can intervene in the decision making towards sharing his / her ideas and follow them. Analyzing political communication, Nau and Stewart emphasize that “verbally aggressive political speakers are perceived as less communicatively appropriate and credible than nonaggressive speakers, and are less likely to win agreement with their messages” [23]. Allan points out the difference between how the insulter, the victim and the onlooker / overhearer (side participant) perceive a certain kind of behavior as insulting, arguing that “verbal insult depends in large part on the language used because the insult arises from its perlocutionary effect” [1]. Berrocal says “victimhood”, as a social role, is constructed in discourse. The author examines a display of parliamentary discourse, which presents the violator as a victim of conspiracy, calling for sympathy and providing selfjustification, on the one hand, and using verbal attacks to undermine and disqualify a number of overt and covert enemies, and highlights the importance of the discourse analysis in case of detecting the real aggressor and victim [3]. The topicality of the speech perception in the frame of the aggressive speech act analysis cannot be overestimated, for the individual-subjective approach prevails over the formalized-objective procedure in the process of interpersonal communication. Based on many years of practical and research experience, Potapova, Potapov, Lebedeva and Agibalova argue that “the human ear is able to distinguish the speaker’s emotions, even in the absence of any indication of this on the part of semantics, vocabulary and grammar” [31, p. 128]. Analyzing speech indicators the listener gets informed about the emotional change from neutral (safe mode) to aggressive (alert mode) mood of the speaker that helps the percipient tune up with the speaker and prepare a reciprocal communicative response. Anyway, the fact of speech aggression recognition signifies the potential threat to the recipient’s psychological integrity. Extreme aggressive behavior and psychophysiological deviations accompanied with aggressive outcomes are well distinguishable and described in modern pathopsychology [8, 20], psychiatry of deviant behavior [4, 19, 35, 36], and sociology of crowds [9, 34]. The aim of this paper is to bring some clarity to debates about the so called “everyday aggression”. Enikolopov, Kuznetsova and Chudova [7] introduce the concept as the aggression manifested by law-abiding, mentally healthy, educated citizens. The consequences of such aggression are hardly recognized even by victims themselves. Damages are invisible for legal prosecution and cannot be avoided in everyday life. Harmful psychological damage of everyday aggression to emotional, cognitive and axiological spheres is quite real and manifests itself both on individual and social levels of interaction [7, p. 5]. Among possible destructive consequences of
534
R. Potapova et al.
this phenomenon are individual and social adaptability declines, emotional destabilization, and inefficiency in solving problems. Due to the fact that the recognition and evaluation of everyday aggression is made by recipients (victims and bystanders) on the basis of subjective (speech, visual, tactile) perception, the experimental approach involving audio-perceptual analysis will make it possible to sketch the perceived image of the emotional-modal complex “aggression”. Communicative behavior is gender identified [11, 12, 22, 24, 25, 32]. According to the fact that prosody also indicates social and national identities [6, 18, 21], the gender factor can be considered as influential on prosodic behavior. For example, Darania and Darani [13] argue that, “the paralinguistic cue of gender can play an influential role in same-sex and cross-sex talks especially in societies where men and women are viewed differently” [13, p. 427]. That’s why prosodic parameters are usually used in various approaches to solve author gender identification issues (see, for example [33]).
2 Method and Procedure The perceptual-auditory method (analysis) supposes assessment of the spoken language materialized in physical form of specially selected recordings by means of special questionnaires. The analyzed subject (speech samples) must reflect selected parameters which are investigated. The methodology assumes generalization of the test evaluations obtained from the homogeneous group, the interpretation of the identified patterns and trends, and validation of the data for statistical significance [15, p. 85]. The purpose of the research was to identify differential prosodic features of the “male aggressor” and “female aggressor” images reconstructed by male and female listeners in the course of the perceptual-auditory experiment. The group of listeners consisted of 55 individuals (42 females and 13 males) – native Russian speakers aged 19 to 23. The experimental material consisted of a dataset of authentic monologue speech samples (N = 40) of speakers perceived as “aggressors”: male Russian native speakers (n = 10); female Russian native speakers (n = 10); male native speakers of Castilian Spanish (n = 10); female native speakers of Castilian Spanish (n = 10). The experimental dataset was constructed at the previous stage of the research (see [27, 28]). The procedure involved 32 Russian native listeners – graduates of the Moscow State Linguistic University, advanced Spanish speakers. The subjects characterized the selected speech stimuli as samples with aggressive speech behavior (98%). All the listeners (N = 55) gave written consent to participate in the experiment before the experiment started. They were willing to stop the experiment at any time they wished. The experimental task consisted of the auditory test. After listening to each speech sample as many times as required, the listeners were asked to answer questions of special questionnaires (as described in [29]). Those tasks were performed strictly individually. In the experiment the listeners played the role of bystanders passively perceiving verbal realization of an aggressive act.
Perceptual-Auditory Evaluation of the Aggressive Speech Behavior
535
3 Results On the basis of the data obtained during the perceptual-auditory analysis in this research, the following perceived images of male and female “aggressors” were identified (Table 1). Table 1. Perceptual-auditory aggressor images (average normalized evaluations).
Parameters Voice pitch Voice intensity
Voice timbre (by pair)
Speech melodic pattern Speech tempo Speech rhythm Speech pauses Speech breathing
Features low medium high weak moderate strong clear muffled limpid hoarse sing-song sharp soft rough pleasant unpleasant smooth irregular monotonous slow moderate fast clear irregular short medium long normal irregular discomfort
Female aggressor Russian Spanish 6,6 5,1 31 28 19 22 1,6 2 27 31 27 22 30 28 13 12 20 19 14 12 7,2 10 31 22 7,4 11 23 13 7,5 8,5 19 15 9,4 14 43 38 3,1 2,1 9,3 0,8 26 17 21 37 36 36 20 18 33 38 14 11 2,1 1,2 22 34 24 16 11 5,3
Male aggressor Russian Spanish 12 20 37 27 7,3 7,3 3,2 3,8 27 28 26 23 20 10 20 26 20 9,3 17 24 7,2 4,5 31 27 5,6 5 26 23 9,3 6,4 19 17 11 5,7 43 45 1,8 3,3 4,6 0,7 34 17 17 37 33 30 23 24 29 39 18 11 3,4 1,6 27 20 19 23 9,8 12
All listeners reported that the speech of all Russian and Spanish subjects (informants) was perceived as if they were externalizing an offensive type of aggression during the aggression escalation period. The Russian female aggressor image is characterized by strong negative emotions such as rage and hatred mixed with such emotional-modal states as indignation and
536
R. Potapova et al.
grievance experienced by subjects (informants). Their speech was characterized by an irregular melodic pattern (sharp changes at the level of melodic registers with alternating peaks and recessions), clear rhythm and moderate-average speech tempo. Values of the voice pitch are medium; the voice is changing from moderate to strong. Speech pauses are of minimum duration (short pauses). Speech breathing is irregular. The voice timbre is marked with only negative nuances (muffled, hoarse, sharp, rough, gruff, and unpleasant voice). The Russian male aggressor image is characterized by strong negative emotions of anger and rage in combination with negative emotional-modal states of anxiety experienced by subjects. Their speech is also characterized by an irregular melodic pattern, clear rhythm, moderate-average speech tempo, short pauses and normal breathing. Dynamic features of the voice do not deviate from the average values. The voice timbre is marked with only negative nuances (muffled, hoarse, sharp, rough, and unpleasant voice). The Spanish female aggressor image is described as radiating a strong negative emotion of anger and emotional-modal state of indignation and grievance. Dynamic features of the voice do not deviate from the average values. The melodic pattern is perceived as irregular, the rhythm as clear, the speech tempo as fast, pauses as short, and speech breathing as normal. The voice timbre is marked with only negative nuances (sharp, rough, unpleasant voice). The Spanish male aggressor image has similar characteristics: subjects (informants) are perceived as experiencing strong negative emotions of anger and rage in combination with negative emotional-modal states of grievance. Dynamic voice features do not deviate from the average values. The melodic pattern is perceived as irregular, the rhythm as clear, the speech tempo as fast, pauses as short, and speech breathing as irregular. The voice timbre is marked with only negative nuances (muffled, hoarse, sharp, rough, and unpleasant voice). The listeners assessed the Russian and Spanish male and female aggressor images as 5–6 points on the 10 points scale that correlated with the escalation period on the conflict development scale [10]. Therefore, the analyzed speech samples could be placed between two extremums: (1) transition from the inception of the conflict to escalation, (2) transition from escalation to the peak of aggression. These findings differ from what we obtained from the previous analysis of British and American English samples through the same experimental procedure. Perceiving experimental recordings of British and American English subjects (informants) in the same experimental background marked British recordings as transition from the inception of the conflict to escalation (escalation features prevailing), and American recordings as transition from escalation to the peak of conflict interaction (features of conflict peak prevailing) [26, pp. 150–151]. The Page-test and the Wilcoxon signed rank test were used to measure reliability of the revealed tendencies in the listeners’ evaluations for each parameter and language separately, and the Page-test was conducted to measure reliability of the revealed differences between female and male subjects’ evaluations. All measures were statistically valid (q 0,05). Table 1 shows the average normalized evaluations. If we take the trend indicators as estimated by the mixed listeners’ group (N = 55) and compare them with the evaluations given by the females’ group (n = 42) and the
Perceptual-Auditory Evaluation of the Aggressive Speech Behavior
537
males’ group (n = 13), and then compare the evaluations of the two gender groups between themselves, it is possible to reveal deviations from the mixed group norm correlated with the factor “gender of the recipient”. The qualitative analysis of the prevailing tendencies (Table 2) revealed that in the studied conditions, female listeners estimated male subjects’ (informants’) voice intensity in the aggressor image as stronger in comparison with female subjects’ informants’ voice intensity. Male listeners found the female voice pitch in the aggressor image higher than the male aggressor voice pitch. Male and female listeners perceived the Spanish subjects’ (informants’) speech tempo as faster than the Russian subjects’ (informants’) speech tempo. Female listeners considered the perceptualauditory aggressor image to have clear speech rhythm, while male listeners perceived the speech rhythm as irregular (with no gender difference of a speaker).
Table 2. Results of the qualitative analysis of the perceptual-auditory evaluations of prevailing tendencies.
Female Russian Female listeners subjects Male subjects Male Female listeners subjects Male subjects Female Spanish Female listeners subjects Male subjects Female Male listeners subjects Male subjects
Voice pitch Medium
Voice Speech Speech intensity tempo rhythm Moderate Moderate Clear
Speech pauses Short
Speech breathing Irregular
Medium
Strong
Short
Normal
Medium, high Medium
Moderate Moderate Irregular Short
Normal
Medium
Moderate Fast
Clear
Short
Normal
Low
Strong
Clear
Short
Irregular
High
Moderate Fast
Irregular Short
Normal
Medium
Moderate Fast
Irregular Short
Normal
Moderate Clear
Moderate Moderate Irregular Medium Normal
Fast
4 Conclusions and Discussion One can assume that a clear speech rhythm (of both men and women) signals the speaker’s confidence in what he / she is saying the desire to convey to the recipient the meaning of the message in full, the influence of his / her speech, which is in alignment with the intention of an offensive form of aggression to impose his / her position / point of view. All this is related to the intention of an offensive type of aggression to impose perpetrator’s opinion / point of view on the listener.
538
R. Potapova et al.
For all speech samples female and male listeners reported melodic pattern being perceived as irregular, no matter whether it concerned female or male informant speech, in Russian or Spanish, which indicates the state of emotional imbalance, the presence of the so-called “mixed” feelings [16, p. 123]. The presence of speech pauses of minimum duration indicates the dominant status of the speaker, who is not giving the listener a chance to express his / her opinion and to consider his / her answer. This kind of speech manner in the described communication conditions makes the listener focus all his / her attention on the perceived information and stay in the position of an object (possibly, “victim of aggression”). Speech breathing in the perceived aggressor images is mainly characterized as normal that indicates a relatively balanced emotional state [17, 37] and correlates with the evaluation of awareness (and therefore, the intention) of the speaker’s aggressive actions. Irregular breathing may indicate forcing the air through the narrowed larynx, which, in turn, is also regarded as a “gesture of aggressiveness / negative axiological evaluation” (by S.V. Kodzasov). The voice in the aggressor perceptual-auditory image is characterized by the dominance of negative timbral nuances (hollowness, hoarseness, sharpness, roughness, unpleasantness). According to Potapova and Potapov, a husky, hoarse phonation type may signal deeply felt feelings in many cultures. A sharp drop of the voice pitch in the presence of creaky phonation in the communication of Russian and Spanish male subjects (informants) signals possible intention to humiliate the communication partner, hurt and discredit him / her in the eyes of others [30, p. 297]. When considering prosodic techniques of expressive interactions and their functions, Kodzasov points out that gravel voice is a gesture of a negative evaluation [14, p. 196] of the communicative situation / speech message / interlocutor. Supposedly, gender peculiarities in the female and male aggressor images are explained primarily by stereotypical expectation of aggressive behavior rather by men than by women, which, in turn, leads to greater perceived expressiveness of negative timbral nuances of the voice and a tendency to strengthen the voice in the perceptualauditory male aggressor image. The same pattern is likely to appear in relation to gender-specific perception of the speaker by the listener of the opposite sex. Female listeners describe the male aggressor image using prosodic features that create the perceived image of a stronger subject who is more confident in his speech behavior. Acknowledgements. The research is carried out with the support of the Russian Science Foundation (RSF) as part of the project № 18-18-00477.
References 1. Allan, K.: The pragmeme of insult and some allopracts. In: Allan, K., Capone, A., Kecskes, I. (eds.) Pragmemes and Theories of Language Use, vol. 9. Springer, Cham (2016). https:// doi.org/10.1007/978-3-319-43491-9_4 2. Allison, K.R., Bussey, K.: Cyber-bystanding in context: a review of the literature on witnesses’ responses to cyberbullying. Child. Youth Serv. Rev. 65, 183–194 (2016). https:// doi.org/10.1016/j.childyouth.2016.03.026
Perceptual-Auditory Evaluation of the Aggressive Speech Behavior
539
3. Berrocal, M.: ‘Victim playing’ as a form of verbal aggression in the Czech parliament. J. Lang. Aggress. Confl. 5(1), 81–107 (2017). https://doi.org/10.1075/jlac.5.1.04ber 4. Cabrera, O.A., Adler, A.B., Bliese, P.D.: Growth mixture modeling of post-combat aggression: application to soldiers deployed to Iraq. Psychiatry Res. 246, 539–544 (2016). https://doi.org/10.1016/j.psychres.2016.10.035 5. Chartrand, T.L., Bargh, J.A.: The chameleon effect: the perception – behavior link and social interaction. J. Pers. Soc. Psychol. 76(6), 893–910 (1999). https://doi.org/10.1037/0022-3514. 76.6.893 6. Coates, J.: Women, Men and Language: A Sociolinguistic Account of Sex Differences in Language. Longman, London (1993) 7. Enikolopov, S.N., Kuznecova, Y.M., Chudova, N.V.: Aggression in Everyday Live [Agressiya v obydennoj zhizni]. Politicheskaya ehnciklopediya, Moscow (2014). (in Russian) 8. Frederiksen, K.S., Waldemar, G.: Aggression, agitation, hyperactivity, and irritability. In: Verdelho, A., Gonçalves-Pereira, M. (eds.) Neuropsychiatric Symptoms of Cognitive Impairment and Dementia. NSND, pp. 199–236. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-39138-0_9 9. Gerritsen, C., van Breda, W.R.J.: Simulation-based prediction and analysis of collective emotional states. In: Meiselwitz, G. (ed.) SCSM 2015. LNCS, vol. 9182, pp. 118–126. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20367-6_13 10. Glasl, F.: Selbsthilfe in Konflikten. Konzepte, Übungen, praktische Methoden. Freies Geistesleben, Stuttgart (2002). (in German) 11. Göçtüa, R., Kir, M.: Gender studies in English, Turkish and Georgian languages in terms of grammatical, semantic and pragmatic levels. Procedia Soc. Behav. Sci. 158, 282–287 (2014) 12. Goroshko, O.: Differentiation in male and female speech style. Open Society Institute, Budapest (1999). (in Russian) 13. Darania, L.H., Darani, H.H.: Language and gender: a prosodic study of Iranian’s talks. Procedia Soc. Behav. Sci. 70, 423–429 (2013). https://doi.org/10.1016/j.sbspro.2013.01.080 14. Kodzasov, S.V.: Research in Russian prosody. LRC Publishing House, Moscow (2009). (in Russian) 15. Komalova, L.R.: Aggressogen discourse: The multilingual aggression verbalization typology. Sputnik+, Moscow (2017). http://elibrary.ru/item.asp?id=28993951. (in Russian) 16. Komalova, L.R.: The auditory-perceptual profile (image) of an aggressor. Vestnik Mosc. State Linguist. Univ. 7(746), 116–126 (2016). http://libranet.linguanet.ru/prk/Vest/746-7n. pdf. (in Russian) 17. Krivnova, O.F.: Speech breathing factor in the intonational-pausal speech articulation. In: Vinogradov, V.A. (ed.) Linguistic Polyphony, pp. 424–444. LRC Publishing House, Moscow (2007). (in Russian) 18. Labov, W.: The interaction of sex and social class in the course of linguistic change. Lang. Cariation Change 2(2), 205–254 (1990) 19. LaMotte, A.D., et al.: Sleep problems and physical pain as moderators of the relationship between PTSD symptoms and aggression in returning veterans. Psychol. Trauma Theor. Res. Pract. Policy 9(1), 113–116 (2017). https://doi.org/10.1037/tra0000178 20. Mathes, B.M., Portero, A.K., Gibby, B.A., King, S.L., Raines, A.M., Schmidt, N.B.: Interpersonal trauma and hoarding: the mediating role of aggression. J. Affect. Disord. 227, 512–516 (2018). https://doi.org/10.1016/j.jad.2017.11.062
540
R. Potapova et al.
21. Milroy, L., Milroy, L.: Mechanisms of change in urban dialects: the role of class, social network and gender. Int. J. Appl. Linguist. 3(1), 57–77 (1997) 22. Murashova, L.P., Pravikova, L.V.: Critical analysis of gender studies in western linguistics. Language and Culture 1(33), 33–42 (2016). https://doi.org/10.17223/19996195/33/3, https:// elibrary.ru/item.asp?id=25693295. (in Russian) 23. Nau, Ch., Stewart, C.O.: Effects of verbal aggression and party identification bias on perceptions of political speakers. J. Lang. Soc. Psychol. 33(5), 526–536 (2014). https://doi. org/10.1177/0261927x13512486 24. Khalida, N., Sholpan, Z., Bauyrzhan, B., Ainash, B.: Language and gender in political discourse (Mass media interviews). Procedia Soc. Behav. Sci. 70, 417–422 (2013). 10.1016/j.sbspro.2013.01.079 25. Potapov, V.: Multilevel strategy in linguistic gendorology. Voprosy Jazykoznanija (Top. Stud. Lang.) 1, 103–130 (2002). http://www.ruslang.ru/doc/voprosy/voprosy2002-1. pdf. (in Russian) 26. Potapov, V., Potapova, R., Komalova, L.: The perceived speech prosodic image of “aggressor”: dialog communication gender features. In: Masalóva, S., Polyakov, V., Solovyev, V. (eds.) Cognitive Modeling: The V International Forum on Cognitive Modeling. Part 1: Cognitive Modeling in Linguistics: Proceedings of the XVIII International Conference «Cognitive Modeling in Linguistics. CML-2017», pp. 147–154. Science and Studies Foundation (2017). https://elibrary.ru/item.asp?id=32559206 27. Potapova, R., Komalova, L.: Auditory-perceptual recognition of the emotional state of aggression. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 89–95. Springer, Cham (2015). https://doi.org/10.1007/978-3-31923132-7_11 28. Potapova, R.K., Komalova, L.R.: Gender-based perception of verbal realization of the emotional state of aggression. Human being: image and essence. Humanit. Aspects 26, 169– 180 (2015). https://elibrary.ru/item.asp?id=25224664. (In Russian) 29. Potapova, R., Potapov, V.: Kommunikative Sprechtaetigkeit: Russland u. Deutchland im Vergleich. Boehlau Verlag, Koeln; Weimar; Wien (2011). (In German) 30. Potapova, R.K., Potapov, V.V.: Speech Communication: From the Sound to the Utterance. LRC Publishing House, Moscow (2012). (in Russian) 31. Potapova, R.K., Potapov, V.V., Lebedeva, N.N., Agibalova, T.V.: Interdisciplinarity in Researching of Speech Polyinformativity. LRC Publishing House, Moscow (2015). (in Russian) 32. Samara, A., Smith, K., Brown, H., Wonnacott, E.: Acquiring variation in an artificial language: children and adults are sensitive to socially conditioned linguistic variation. Cogn. Psychol. 94, 85–114 (2017). https://doi.org/10.1016/j.cogpsych.2017.02.004 33. Sboev, A., Moloshnikov, I., Gudovshikh, D., Selivanov, A., Rybka, R., Litvinova, T.: Automatic gender identification of author of Russian text by machine learning and neural net algorithms in case of gender deception. Procedia Comput. Sci. 123, 417–423 (2018). https://doi.org/10.1016/j.procs.2018.01.064 34. Smokowski, P.R., Guo, S.Y., Evans, C.B.R., Wu, Q., Rose, R.A., Bacallao, M., Cotter, K.L.: Risk and protective factors across multiple microsystems associated with internalizing symptoms and aggressive behavior in rural adolescents: modeling longitudinal trajectories from the rural adaptation project. Am. J. Orthopsychiatr. 87(1), 94–108 (2017). https://doi. org/10.1037/ort0000163
Perceptual-Auditory Evaluation of the Aggressive Speech Behavior
541
35. Urben, S., Habersaat, S., Pihet, S., Suter, M., de Ridder, J., Stephan, P.: Specific contributions of age of onset, callous-unemotional traits and impulsivity to reactive and proactive aggression in youths with conduct disorders. Psychiatr. Q. 89(1), 1–10 (2018). https://doi.org/10.1007/s11126-017-9506-y 36. Zapolski, T.C.B., Banks, D.E., Lau, K.S.L., Aalsma, M.C.: Perceived police injustice, moral disengagement, and aggression among juvenile offenders: utilizing the general strain theory model. Child Psychiatr. Hum. Dev. 49(2), 290–297 (2018). https://doi.org/10.1007/s10578017-0750-z 37. Zlatoustova, L.V.: Some comments on the speech breathing. In: Zvegintsev, V.A. (ed.) Studies on the Speech Information, vol. 2, Moscow (1968). (in Russian)
Main Determinants of the Acmeologic Personality Profiling Rodmonga Potapova1 ✉ (
1
)
and Vsevolod Potapov2
Institute of Applied and Mathematical Linguistics, Moscow State Linguistic University, Ostozhenka 38, Moscow 119034, Russia [email protected] 2 Faculty of Philology, Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow 119991, Russia [email protected]
Abstract. This investigation aims at establishing a set of voice and speech personality identification features that predict some phonation and articulation gestures in regard to lexical-semantic, phonological, anthropometrical, acous‐ tical, physiological, psychological, emotional and intellectual peculiarities of the “electronic personality” on the Internet and other automated digital communica‐ tion means and devices. This problem is very significant for forensic investiga‐ tions in the domain of speech communication. It is proposed to undertake special studies to find solutions to these problems in the field of forensic phonetics. In forensic application of the “electronic personality”, it is necessary to be able to specify a temporal dynamics factor for a decision concerning acmeologic quan‐ titative and qualitative changes of the personality in time. Keywords: Social-Network Discourse · Relevant personality features Forensic emotional personality profiling in dynamics Acmeologic personality profiling · Perceptual-auditory analysis
1
Introduction
The development of the social-network discourse (SND) investigations on the Internet involves studying the mechanism of dependence between the acoustic prosodicsemantic interpretation of the speech utterance by the speaker and processing of the discourse construction by the listener considering such factors as: the cognitive-verbal base of the communicants’ idiosyncratic peculiarities; the multimodal (verbal, para‐ verbal, non-verbal, extra-verbal) structure of coding (stimulus generation) and decoding (reaction to stimulus) of communication process items by the communicants [10, 11, 15–18]; the multi-level (phonological-phonetic, syntactic-semantic and pragmalin‐ guistic) structure of verbal coding (speech stimulus) and decoding (speech reaction to the stimulus) of the process by the communicants; paraverbal (emotional, emotionalmodal and connotative) components of the speech stimulus and speech reaction to this stimulus (in the communication act); extraverbal (situational, individual – idiosyncratic, idiolectal [6], sociolectal, etc.) constituents of the speech stimulus and speech reaction © Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 542–551, 2018. https://doi.org/10.1007/978-3-319-99579-3_56
Main Determinants of the Acmeologic Personality Profiling
543
to the utterance taking into account the role of presupposition, as well as the recipient’s previous experience in a particular subject area [15]. Experimental research and modeling of acmeologic variability of spoken social network discourse (SND) forms an important direction of modern communicative variantology [23], and forensic sciences require information about the multimodal indi‐ vidual dynamics of personalities or personality communities [9, 18]. Of particular importance is the above direction in connection with the SND functioning in the infor‐ mation and communication space of the Internet [9, 15, 16, 18]. Analysis of the deep mechanism of the prosodic-semantic variability of the verbal response to the stimulus between communicants within the spoken discourse in respect to the SND requires knowledge in various fields of speechology (spoken language sciences): in general, private and experimental phonetics, cognitive and communicative linguistics, speech acoustics, auditory perception, mathematical statistics, forensic linguistics, etc. [7]. The solution of the problem taking into account the multiversatile variability of the analyzed object includes the following: search for the pronouncing invariant and variants of the prosodic-semantic interpretation of the speech stimulus-utterance in the SND; deter‐ mining the interaction of the various factors listed above in the process of SND construc‐ tion; the degree of influence of the above factors on the final verbal product of the SND in the communication act; identification of the prosodic-semantic dominant within the SND; determining the acceptable range of variation of prosodic-semantic variforms – alloprosodosemants; determining speaker profiling, verification and identification; investigation of the variability of the speech and voice characteristics of the communi‐ cants, personality dynamics of the individual “portrait” in time with regard to the acmeologic method [15, 16, 18]. The research on the prosodic-semantic variability of multilevel verbal, para-, non- and extraverbal components of spoken discourse involves various modern methods of analysis, synthesis and modeling of sounding speech: acoustic, perceptual-auditory, associative, prosodic-semantic [7]. Acmeologic profiling of communicants on the Internet and in other automated means of communication includes first of all interdisciplinary researches: “detailed phonetic and linguistic description of the verbal behavior of an… individual, …careful analysis of dialectal and sociolectal features, speech defects, age, “voice quality,” … a combination of traditional phonetic analysis, techniques, including analytical listening by a phonetician, and modern signal processing techniques…” [4: 80-99].
2
Conceptual Background of the Personality Profiling
Speech activities in the format of spoken social-network discourse (SND) – in particular, based on various modern IP-telephony facilities on the Internet, – can be presented taking into account the following level-by-level components: incentive level: external impact; motive; intent; communicative intention; formation level: the sense-forming phase; deep formation of the space-concept scheme; time (linear) development of the spatial-conceptual scheme of the utterance; formulation level: formulating phrase (choice of words); process of grammatical structuring; realization level: articulatory gestures (articulation); voice modulation (phonation); coarticulation transformations;
544
R. Potapova and V. Potapov
acoustic level: transformation of articulatory gestures at the output of the speech-forma‐ tion system into a sound (acoustic) wave; auditory detection, auditory control and recognition of perceived acoustic stimuli; interpretation level: transformation of acoustic stimuli into verbal images, semantic content realization [9, 10, 12]: In accordance with the expanded understanding of the object of research in speech‐ ology, the following techniques and methods can be mentioned: cognitive-communi‐ cative analysis of the text; indirect checking of models and hypotheses, for example, by studying speech errors, linguistic reactions, etc.; neurophysiological methods; bioelec‐ tric methods; registration and analysis of articulation, for example, using computed tomography, etc. [9]. The study of the realization and monitoring of motor programs should be related to the information processing system in the central and peripheral nervous systems. It is likely that in the central nervous system there is no functional center that would specialize in processing verbal information exclusively. Neural networks processing verbal information also include all functions.
3
SND Communication Analysis Considering Human Speech Functions
In investigations [8, 13–16, 18], the concept of SND was first substantiated based on its definition as a special electronic macro-polylogue with regard to a number of categories of form, content and functional weight. An example is one of the form categories on the basis of the “univector – polyvector” opposition. This opposition is correlated not only on the basis of location of communicative interaction vectors on the Internet, but also on the SND participants’ interaction configuration, which is directly dependent on the number of communicants on the Internet. When examining the speech behavior of the speaker in the SND format using, for example, IP-telephony, we proceed from the following postulate: human speech is both a symptom and a signal in relation to the real world: a symptom as a direct psychophy‐ siological response to external stimuli and a signal as a sign language response of neuro‐ psychological nature to stimuli of a more complex behavioral level in the communication act [11]. In this regard, speech is presented as a poly-informative and multifunctional phenomenon. The development of this issue assumes special importance in the study of spoken-speech communication with the help of IP-telephony: to solve a number of problems of forensic examination. “… The latter circumstance naturally led to the fact that phonograms of conversations by the channels of cellular communication and the Internet became objects of investigation for experts in forensic examination of sound recordings” [5: 129, 20]. The pronunciation of the speaker includes a set of specific properties of this indi‐ vidual manifested in formation of the sound flow in the speech apparatus and conditioned by the peculiarities of its structure, the features of the pronunciation-auditory skills, the specifics of thinking, and the formulation of thoughts with the help of linguistic means [10–12]. The speech “portrait” of the speaker includes verbal, paraverbal, non-verbal and extraverbal features. Verbal components refer to such aspects as the language used in the communication process (native, non-native, dialect, vernacular, sociolect, etc.)
Main Determinants of the Acmeologic Personality Profiling
545
For each speaker, an inventory of stable phonetic features is characteristic: pronouncing variants of phonemes, variants of intonemes, etc. Verbal speech features make it possible to determine such components of the speech portrait as nationality, places of the speak‐ er’s long residence, level of education, social status, economic status, upbringing, level of language proficiency, profession, level of intellectual skills, etc. It is thought that extraverbal features correspond with anthropometric (structure of the speech apparatus, body weight, height) [4], physiological (gender, age, norm/pathology), psychological (type of higher nervous activity (HNA) [22], emotional-volitional regulation), intellec‐ tual (specific thinking, cognitive level) aspects. Accordingly, it is possible to distinguish relatively stable speech extraverbal features in the speaker’s speech portrait. Both verbal and extraverbal features have their own acoustic correlates that make it possible to recreate the “portrait” of the speaker. For example, gender and age can be characterized by some acoustic parameters [1]. There are various data for native Russian speakers. According to observations [12], the average value of the pitch frequency dynamics for males aged from 20 to 80 years increases (≈ 100–130 Hz). For females aged 20 to 80 years, the reverse difference in value is observed (≈ 220–180 Hz) [3, 10, 13–18, 21]. Proceeding from the basic premise, according to which the human speech is indi‐ vidually organized on the basis of phonation and articulatory gestures in direct connec‐ tion with the socially-conditioned phonological representation of the utterance and its lexical and semantic features, it is proposed to conduct an express-analysis of the speak‐ er’s speech portrait taking into account the following stages: formation of the databases for correlates of anthropometric features; acoustic correlates of physiological features; acoustic correlates of psychological and emotional-psychological features; acoustic correlates of intellectual features [13–15, 17, 18]. Thus, the acoustic-linguistic algorithm of the speaker identification analysis is constructed taking into account the following stages: acoustic; anatomical-physiological aimed at decoding of the speech signal; socio-psychological aimed at decoding of the speech signal; intellectual-semantic decoded for the speech signal. In this regard, all the tasks can be conditionally charac‐ terized as tasks of compiling an individual portrait of the speaker, to which phonation (voice), articulatory segment (motor), prosodic (suprasegment) correlates of the speak‐ er’s speech should be attributed. Speech characteristics of the speaker are divided into controlled (external) and uncontrolled (internal) ones. Some experts identify potentially controlled features. The degree of control depends on two factors [2, 11]: the speaker’s ability to use auditory and proprioceptive forms of feedback in the implementation of the articulatory program; from his/her perceptual ability to use auditory forms of information to detect auditory differences. Therefore, information about the speaker is hidden in the speech signal, is correlated with his/her anatomical features and is stored at the neuronal level by the muscular speech patterns correlating with the speaker’s physique [2, 3, 8].
4
Preliminary Results of the Investigation
When developing expert methods for speaker profiling by speech on the Internet, the following conditions for the speech signal realization are taken into account: speech
546
R. Potapova and V. Potapov
should be natural and be varied as much as possible relative to the speakers (interspeaker discrepancies), but rather homogeneous relative to each speaker (intraspeaker discrep‐ ancies); at the initial stage of development, the speech should not be influenced by noise, interference, etc., and should include special characteristics of transmission along the technical path; no distortion of the voice is allowed [20]. Particularly informative for speaker attribution by speech is the range of the pitch frequency (ΔF0), which includes, first of all, such parameters as the pitch frequency range width (ΔF0) and its register (very high, high, medium, lower medium, low, very low), which correlates with the following individual characteristics of the speaker: biological differentiation by gender, age, physique; and psychological differences in the speaker’s behavior; idiosyncratic (individual) features at the biological, psychological and regional-social levels [6, 8, 10, 14, 17, 18, 21]. Individual features of the speaker are traditionally divided into two groups: acquired and non-acquired. Acquired features include such specific speech features that are formed under the influence of the external conditions of the speaker’s life. Among the latter is primarily the process of language acquisition, and then its application in spoken and written communication. In this case, a special role is played by the dialect used by the immediate environment of the individual, especially when, during the phase of speech acquisition, which corresponds approximately to the time of schooling (age up to 18), the speaker lived in various dialectal societies. This includes the social conditions that define the so-called sociolect. The acquired features also include speech features resulting from various harmful factors, for example, smoking, alcohol and drug intoxi‐ cation [1, 19]. Non-acquired features are correlated with organic-genetic data based on the anatomical and neurophysiological components of the speech apparatus. The latter include the size and spatial configuration (the so-called cavitary configuration) of the neck-laryngeal, nasal and pharyngeal tracts, the mobility and size of the tongue, and in particular the number of boundary conditions depending on the voice formation (the term of mathematics), as well as age and gender. The pitch frequency can vary depending on such factors as loud speaking (for example, in a state of excitement, in noisy conditions (Lombard effect), etc.) In these cases, the pitch frequency changes upwards, and this should be taken into account when describing the speaker. At certain stages of mental illness, the voice can be not only lower, but also much more monotonous (for example, in a state of depression in manicdepressive patients). In speaker attribution by voice, along with the above characteris‐ tics, of great importance is information on the voice quality. In this case, features specific to the speaker are found. First of all, one should mention such a qualitative attribute as hoarseness. Here most informative is not this feature in itself, but rather its distribution in the speech flow: this phenomenon can occur where the voice for purely linguistic reasons is lowered, i.e. at the end of sentences and other syntactic or semantic units. In a number of speakers with low voices or voice pathology (for example, due to inflam‐ mation of the larynx, a tumor or nodes in the larynx, etc.), this symptom may appear in various other positions of the speech flow [8]. In the speaker attribution process, the rate of speech formation is also informative. The average speech rate for all languages is about 4.5–5 syllables per second. Extreme values are 3, 2–7, 5 syllables. Higher rate
Main Determinants of the Acmeologic Personality Profiling
547
leads to incomplete articulation or complete loss of sounds, syllables and even whole words. As an example, the following requirements should be given that characterize the speaker’s portrait by voice and speech: physical: gender, age, height, weight; civil status: parents, their mother tongue, origin, social status, etc.; linguistic: native/non-native, literary/non-literary, regional/dialectal language; educational: length of study (primary, secondary, higher, etc.); geographical: place of long-term residence (if there are some, then indicate periods of residence); professional: work by profession/not by profession; auditory: state of hearing, presence/absence of pathology; medical: chronic/non-chronic diseases; voice: trained, singing, smoking, stressful, etc. voice; musical: musical infor‐ mation, etc.; hobby: sports profile, musical profile, etc. Thus, in the SND the number of communicants’ characteristics determined by acoustic data in IP telephony, can include the following characteristics: social: by level of education; social status; sphere of activity; physical characteristics; emotional characteristics; regional characteristics: place of birth; place of long residence; nationality; additional information; psychological characteristics: mental pathologies; HNA type; character traits; types of intoxication (alcohol, drug); pronouncing characteristics: spontaneous speech; quasi-spontaneous speech; prepared/unprepared reading of text; emotional characteristics: positive, nega‐ tive, etc. [1, 3, 4, 6, 9, 10, 15, 19, 21, 22]. As an example of personality profiling on the Internet is presented a speaker voice sample recorded hourly and analyzed by expert listeners on the basis of special instruc‐ tions. It was found that the dynamics of prosodic features (pitch, tempo, and loudness variations) is a reliable diagnostic tool of acmeologic speaker profiling with regard to emotional state changes of this personality from a normal emotional state to agitation, anger, fury, etc. This experiment deals with speakers’ emotional state acmeologic profiling during hourly communication on the Internet by means of perceptual-auditory analysis (step by step every ten minutes). The sentences used in the experiment were taken from speech communication dialogues on the Internet: a group of communicants (n = 10), male voices, speakers of 18–25 years old; a group of professional listeners (n = 10). The sentences were taken from a pre-election campaign debate in Russia on the Internet. The listeners were asked to evaluate the pitch, tempo, and loudness dynamics of all voice stimuli. The responses across all the stimuli examples are summar‐ ized as mean data. The Figs. 1, 2 and 3 show the dynamics of mean pitch, loudness, and speech rate data during one-hour recording. The experiment involved the perceptualauditory evaluation of such voice features of the listener as: pitch (very low, low, lower medium, medium, high, very high); speech rate (very slow/slower, slow, slowed, moderate, fast, very fast); loudness (subaudible, very slow, low, middle loud, loud, very loud). All conversations data were recorded, copied to a CD and sent to listeners with instructions to define pitch, speech rate, and loudness characteristics of every subject.
548
R. Potapova and V. Potapov
Fig. 1. Mean perceptual-auditory data of the pitch evaluation and data scope zones (ω = F0max – F0min) regarding hourly dynamics of the speech characterization features (in ten-minute steps).
Fig. 2. Mean perceptual-auditory data of the speech rate evaluation and data scope zones (ω = tmax – tmin) regarding hourly dynamics of the speech characterization features (in ten-minute steps).
Main Determinants of the Acmeologic Personality Profiling
549
Fig. 3. Mean perceptual-auditory data of the loudness evaluation and data scope zones (ω = Imax – Imin) regarding hourly dynamics of the speech characterization features (in ten-minute steps).
It can be concluded that every ten minutes acoustic stimuli had enough information available to draw distinction between those with no aggressive behavior dynamics of speech and those with some ones. It is known that pitch, intensity, and speaking rate are affected, e.g., by aggressive emotions and perceptual-auditory corresponding features. The speech acoustic correlations of this aggressive behavior invoke some challenges in defining the real emotional state. But the acoustic data measurements could not be always reliably interpreted with regard to acoustical speech signals. The optimization of the experimental methods lies in looking at combination of perceptual-auditory and acoustic analysis on the basis of fundamental sciences in the field of interdisciplinary speech research, regarding acmeologic personality profiling on the Internet. As example for emotional personality profiling vector, a speaker voice sample is presented which was recorded during one hour and analyzed by expert listeners on the basis of special instructions. It was found that the dynamics of prosodic features (pitch, speech rate, and loudness variations) is a robust diagnostic tool of acmeologic speaker profiling with regard to emotional state changes of the personality regarding changes in personal emotional characteristics with connection to psychological, social, physical, etc. factors.
5
Conclusion
Thus, the study of the speech variability process on the Internet is a task of immense complexity connected, on the one hand, with the articulatory-acoustic specifics of spoken speech and its perceptual auditory and acoustic characteristics, on the other hand,
550
R. Potapova and V. Potapov
with the specifics of constructing any utterance taking into account the prosodicsemantic variability of the speech product itself. At the same time, the transmission of a high-quality speech signal with regard to IP telephony (more precisely, Voiceover IP (VoIP), etc.), due to the specificity of encoding, compression and packaging of the speech signal into IP packets, may be a kind of obstacle for the successful solution of the task [5]. An analog voice signal digitized by the PCM method and compressed by codecs to eliminate redundancy, undoubtedly undergoes certain changes at the output. As the results of the preliminary study [20] have shown, the prospect of using special software for establishing acoustic and perceptual-auditory equivalence with some degree of probability is quite promising in solving the problems of an “electronic personality” profiling in the Internet information and communication environment. The above-described correlations between the SND-characteristics of the speakers on the Internet and his/her speech reactions on the communication stimuli in Internet dynamics make it possible to undertake further research in the field of the acmeologic personality profiling on the Internet and other speech communication transmission devices. Acknowledgements. This research is supported by the Russian Science Foundation, Project № 18-18-00477.
References 1. Braun, A.: Sprechstimmlage und regionale Umgangssprache. In: Braun A. (ed.). Beitraege zu Linguistik und Phonetik. Festschrift fuer Ioachim Goeschel zum 70 Geburtstag, pp. 453– 463. Stuttgart (2001) 2. Brown, R.: Auditory Speaker Recognition. Helmut Buske Verlag, Hamburg (1987) 3. Brown, W.S., Morris, R., Hollien, H., Howell, H.F.: Speaking fundamental frequency characteristics as a function of age and professional singing. J. Voice 3, 310–313 (1991) 4. Kuenzel, H.J.: How well does average fundamental frequency correlate with speaker height and weight? Phonetica 46, 117–125 (1989) 5. Mikhailov, V.G.: Features of the formation and analysis of voice signals transmitted by means of IP-telephony. Theor. Pract. Forensic Exam. 3(7), 129–140 (2007). (in Russia) 6. Oksaar, E.: Idiolekt als Grundlage der variationsorientierten Linguistik. Sociolinguistica 14, 37–41 (2000) 7. Potapova, R.K.: Linguistic and paralinguistic functions of prosody (On the experience of searching for prosodo-semanthema). In: Kedrova, G.E., Potapov, V.V. (eds.) Language and Speech: Problems and Solutions, pp. 117–137. MAKS Press, Moscow (2004) (in Russia) 8. Potapova, R.K.: Some observations on artificially modified speech. In: Ideas and Methods of Experimental Speech Study. I.P. Pavlov Institute of Physiology (Russian Academy of Sciences), State University, Sankt-Petersburg, pp. 124–135 (2008). (in Russia) 9. Potapova, R.K.: Speech: Communication, Information, Cybernetics, 4th edn. Book house “Librocom”, Moscow (2015). (in Russia) 10. Potapova, R.K., Potapov, V.V.: Language, Speech, Personality. Languages of Slavic Culture, Moscow (2006). (in Russia) 11. Potapova, R., Potapov, V.: Kommunikative Sprechtaetigkeit. Russland und Deutschland im Vergleich. Boehlau Verlag, Koeln (2011) 12. Potapova, R.K., Potapov, V.V.: Speech Communication: From Sound to Utterance. Languages of Slavic Cultures, Moscow (2012). (in Russia)
Main Determinants of the Acmeologic Personality Profiling
551
13. Potapova, R., Potapov, V.: Auditory and visual recognition of emotional behaviour of foreign language subjects (by native and non-native speakers). In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS (LNAI), vol. 8113, pp. 62–69. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-01931-4_9 14. Potapova, R., Potapov, V.: Cognitive mechanism of semantic content decoding of spoken discourse in noise. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 153–160. Springer, Cham (2015). https://doi.org/10.1007/9783-319-23132-7_19 15. Potapova, R., Potapov, V.: On individual polyinformativity of speech and voice regarding speakers auditive attribution (forensic phonetic aspect). In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 507–514. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_61 16. Potapova, R., Potapov, V.: Polybasic attribution of social network discourse. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 539–546. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_65 17. Potapova, R., Potapov, V.: Cognitive entropy in the perceptual-auditory evaluation of emotional modal states of foreign language communication partner. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 253–261. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-66429-3_24 18. Potapova, R., Potapov, V.: Human as acmeologic entity in social network discourse (multidimensional approach). In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 407–416. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-66429-3_40 19. Potapova, R.K., Potapov, V.V., Lebedeva, N.N., Agibalova, T.V.: Interdisciplinarity in the Study of Speech Polyinformativity. Languages of Slavic Culture, Moscow (2015). (in Russian) 20. Potapova, R., Sobakin, A., Maslov, A.: On the possibility of the skype channel speaker identification (on the basis of acoustic parameters). In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS (LNAI), vol. 8773, pp. 329–336. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11581-8_41 21. Ryan, E.B., Capadano, I., Harry, L.: Age perceptions and evaluative reactions toward adult speakers. J. Gerontol. 33, 98–102 (1978) 22. Sharp, D.: Personality Types: Jung’s Model of Typology. Inner City Books, Toronto (1987) 23. Titscher, S., Meyer, M., Vetter, E., Wodak, R.: Methods of Text and Discourse Analysis. Sage, London (2000)
Studying Mutual Phonetic Influence with a Web-Based Spoken Dialogue System Eran Raveh1,2(B) , Ingmar Steiner1,2,3 , Iona Gessinger1,2 , and Bernd M¨ obius1 1
3
Language Science and Technology, Saarland University, Saarbr¨ ucken, Germany 2 Multimodal Computing and Interaction, Saarland University, Saarbr¨ ucken, Germany [email protected] German Research Center for Artificial Intelligence (DFKI GmbH), Saarbr¨ ucken, Germany
Abstract. This paper presents a study on mutual speech variation influences in a human-computer setting. The study highlights behavioral patterns in data collected as part of a shadowing experiment, and is performed using a novel end-to-end platform for studying phonetic variation in dialogue. It includes a spoken dialogue system capable of detecting and tracking the state of phonetic features in the user’s speech and adapting accordingly. It provides visual and numeric representations of the changes in real time, offering a high degree of customization, and can be used for simulating or reproducing speech variation scenarios. The replicated experiment presented in this paper along with the analysis of the relationship between the human and non-human interlocutors lays the groundwork for a spoken dialogue system with personalized speaking style, which we expect will improve the naturalness and efficiency of human-computer interaction. Keywords: Spoken dialogue systems Human-computer interfaces
1
· Phonetic convergence
Introduction
With expanding research on, and growing use of, spoken dialogue systems (SDSs), a main challenge in the development of human-computer interaction (HCI) systems of this kind is making them as close as possible to human-human interaction (HHI) in terms of naturalness, fluency, and efficiency. One aspect of such HHIs is the relationship of mutual influences between the interlocutors. Influence here means changes in one interlocutor’s conversational behavior triggered by the behavior of the other interlocutor. We refer to changes that make the interlocutors’ behaviors more similar as convergence. Convergence can c Springer Nature Switzerland AG 2018 A. Karpov et al. (Eds.): SPECOM 2018, LNAI 11096, pp. 552–562, 2018. https://doi.org/10.1007/978-3-319-99579-3_57
Studying Mutual Phonetic Influence
553
occur in different modalities and with respect to various aspects of the conversation, like eye gaze, gestures, lexical choices, body language, and more. In this paper, we concentrate on phonetic-level influences, i.e., phonetic convergence. More specifically, we examine pronunciation variations over the course of HCIs. As speech is the principal modality used for interacting with SDSs, we believe it is an especially important modality to study in the field of HCI. Simulating and triggering convergence on the phonetic level, as found in HHI, may contribute a lot to the naturalness of dialogues of humans with computers. SDSs with such personalized speech style are expected to offer more natural and efficient interactions, and move one more step away from the interface metaphor [5] toward the human metaphor [3]. The novel system introduced in Sect. 3 tracks the states of segment-level phonetic features during the dialogue. All of the analyses are automated and run in real time. This not only saves time and manual work typically needed in convergence studies, but also makes the system more suitable for integration into other applications. In Sect. 4, we use this newly introduced system with recordings collected as part of a shadowing experiment to examine the relationship of mutual influences between a (simulated) user and the system. Using these signals, the system provides both visual and numerical evidence of the mutual influences between the interlocutors over the course of the interaction. The system itself will be made freely available under an open-source license.
2
Background and Related Work
Integrating support for changes in the speech signal into computer systems may enhance HCI and provide improved tools for studying convergence in HCI. [18] discusses the advantages of systems that dynamically adapt their speech output to that of the user, and the challenges involved in developing and using these systems. 2.1
Phonetic Convergence
According to [19], phonetic convergence is defined as an increase in segmental and suprasegmental similarity between two interlocutors (e.g., [27]). In contrast to entrainment, we use the term convergence to describe dynamic, mutual, and non-imposing changes. Phonetic convergence has been found to various extent in conversational settings [13]. There is evidence for phonetic convergence being both an internal mechanism [21] and socially motivated [9]. Previous studies of phonetic convergence in spontaneous dyadic conversations have focused on speech rate [26], timing-related phenomena [23], pitch [8], intensity [12], and perceived attractiveness [16]. Phonetic convergence is often examined in the scope of shadowing experiments, in which the participants are asked to produce certain utterances after hearing them produced in some stimuli (e.g., [7]). This is typically done with single target words embedded in a carrier sentence. The experiment showcasing our system in Sect. 4 uses whole sentences as stimuli, in which the target features are embedded, making it a semi-conversational HCI setting.
554
2.2
E. Raveh et al.
Adaptive Spoken Dialogue Systems
Various studies have investigated entrainment and priming in SDSs, aiming to better understand HCI dynamics and improve task-completion performance. [15], for example, focused on dynamic entrainment and adaptation on the lexical level. Others, like [17], concentrated on word frequency. [20] examined changes in both lexical choice and word frequency. While these studies addressed the changes in experimental, scripted scenarios, the theoretical foundations for studying these changes in spontaneous dialogue exist as well [2]. [6] provide examples of online adaptation for dialogue policies and belief tracking. It is important to note that while all of the studies mentioned above examine various aspects of dialogues, none of those are related to speech – the primary modality used to interact with SDSs. Studying convergence of speech in an HCI context is made possible with more natural synthesis technology, which gives fine-grained control over parameters of the system’s spoken output. Many systems that deal with adaptation of speech-related features focus on prosodic characteristics like intonation or speech rate. [10] sheds light on acoustic-prosodic entrainment in both HHI and HCI via the use of interactive avatars. [1] found that users’ speech rate can be manipulated using a simulated SDS. Similar results were found when intensity changes in children’s interaction with synthesized text-to-speech (TTS) output were examined [4]. All of the above provide solid ground for further investigation of phonetic convergence in HCI using SDSs.
3
System
The system introduced here is an end-to-end, web-based SDS with a focus on phonetic convergence and its analysis over the course of the interaction. Besides placing convergence in the spotlight, it is designed to be flexible and to meet the researcher’s needs by offering a wide range of customizations (see Sect. 3.2). Its online access via a web browser makes it scalable and simple for the end-user to operate. The system’s architecture and functionality are described in Sect. 3.1, its graphical user interface (GUI) and operation in Sect. 3.3, and an example of its utilization is demonstrated in Sect. 4. Ultimately, it offers an experimentation platform for studying phonetic convergence, with emphasis on the following: Temporal analysis offering real-time visualization of the interlocutors’ relations with respect to selected phonetic features over the course of the interaction. Customizability allowing the user to experiment with different scenarios by configuring parameters and definitions in many of the system’s components. Online scalability connecting multiple web clients to a server, allowing users to use it anywhere without preceding installation and configurations, and helping experimenters to collect and replay acquired data.
Studying Mutual Phonetic Influence
3.1
555
Architecture
As the system aims to offer a customizable playground for experimenting and studying phonetic convergence in HCI, a key aspect of its architecture is the separation between client-side, server-side, and external resources (see Fig. 1). All of the resources and configuration files needed for designing the interaction are located on the server. Running the client and server on different machines allows users to interact with the system using a web browser alone.
Fig. 1. An overview of the system architecture. The background colors distinguish client components, server components, and external resources that can be customized. (Color figure online)
ASR audio
text
Automatic speech recognition
signal features
NLU Natural language understanding
ASP
semantics
DM
Additional speech processing
Dialogue management
audio
semantics
TTS Text-to-speech synthesis
NLG text
Natural language generation
Fig. 2. The architecture of the dialogue system component. The ASP module (dashed line) between the ASR and TTS modules is responsible for performing additional speech processing required for analyzing the phonetic changes. Though additional links between the ASP module and other modules (like NLG for example) could be made, those are beyond the scope of this work.
As shown in Fig. 2, the dialogue system component consists of typical SDS modules such as natural language understanding (NLU) and a dialogue manager (DM), but also contains an additional speech processing (ASP) module [24]. This module is responsible for processing the audio and extracts the features required by the convergence model. While the NLU component uses merely the transcription provided by the ASR, the ASP module analyzes the speech signal
556
E. Raveh et al.
itself. More specifically, it tracks occurrences of the defined features and passes their measured values to the convergence model, which, in turn, forwards the tracked feature parameters to the TTS synthesis component. 3.2
Models and Customizations
The computational model for phonetic convergence used in the system is described in [25]. Different phonetic convergence behavioral patterns that were observed in HHI and HCI experiments can be simulated by combinations of the model’s parameters presented in Table 1. All of the parameters can be modified in the system’s configuration file. Table 1. Summary of the computational model’s parameters in their order of application in the convergence pipeline. Parameters marked with an asterisk ‘*’ are defined for each feature independently. allowed range*
allowed value range for new instances
history size
maximum number of exemplars in pool
update frequency
frequency to recalculate feature’s value
calculation method* method to calculate pool value convergence rate
weight given to pool value when recalculating
convergence limit*
the maximum degree of convergence allowed
The entire convergence process is based on the tracked phonetic features that are considered “convergeable”, i.e., prone to variation, and is triggered whenever the ASR component detects a segment containing a phoneme associated with one or more of these features. Each feature is defined by a key-value map, in which the parameters from Table 1 are configured. A classifier can be associated with each feature to provide real-time predictions for both the user’s and the system’s realizations of that feature, as demonstrated in Fig. 3. With this information available, more meaningful insights can be gained into the dynamics of phonetic changes in the dialogue. The dialogue domain is specified in an XML-based file. More details on the domain file can be found in [14]. The format of the domain file makes it easy to define new scenarios for the system, such as a task-specific dialogue, general-purpose chat, or an experimental setup. Speech processing is a central aspect of the system. Different models can be used, e.g., for improving performance or changing the language or the ASR module or the output voice of the TTS module.
Studying Mutual Phonetic Influence
3.3
557
Graphical User Interface
The system’s GUI consists of three main areas:
Fig. 3. A screenshot of the plot area showing the states of the feature [E:] vs. [e:] (in 2-dimensional formant space) during an interaction. The system’s internal convergence model (orange, bottom right) gradually adapts to the user’s (blue, upper left) detected realizations. A prediction of the feature’s current realization is given for both interlocutors. The annotation box marks the turn in which the system has aggregated enough evidence from the user’s utterances and changes its pronunciation from [E:] (its initial state) to [e:] (the user’s preferred variation). (Color figure online)
In the chat area, the interaction between the user and the system is shown in a chat-like representation. Each turn’s utterance appears inside a chat bubble with different colors and orientations for the user and the system. The turns are also numbered, to better track the dialogue progress and analysis shown by the plots in the graph area. It is also possible to replay the utterance of a turn by clicking the “Play” button in its corresponding bubble. In the interaction area, the user can interact with the system with written or spoken input. Text-based interactions progress through the dialogue (if applicable) and trigger any subsequent domain model, but will not affect convergencerelated models, since there is no audio input to process. Spoken input can be provided either by speaking into the microphone or via audio files with prerecorded speech. The latter option is especially useful for simulating specific user input, or for reproducing a previous experiment, as done in Sect. 4. In the graph area, each of the tracked features is visualized in a separate plot, and new data points are added whenever a new instance of the feature is detected. Hovering over a data point in a graph reveals additional information, such as the turn in which it was added, or the realized variant of the feature produced in that turn as predicted by its classifier. These dynamic, interactive plots make it possible to shed light on how the interlocutors influence each other, whether or not they are aware of it, throughout their exchanges. Figure 3 shows such a graph with several accumulated data points.
558
4
E. Raveh et al.
Showcase: Examining Convergence Behaviors
For demonstrating a possible use of the system, we simulated the shadowing experiment detailed in [7] using the system and its analyses to look into types of participant convergence behavior with respect to the features examined in the experiment (see Table 2). This experiment is designed to trigger phonetic convergence by confronting the participants with stimuli in which certain phonetic features are realized in a manner different from their own realizations. The simulation was carried out by building a domain file with the experimental procedure, including the transition between the experiment’s phases, as well as the flow within each phase. This automates the procedure and adapts it to the participant’s pace. Participants were simulated by using their recorded speech from the original experiment in the same order. The use of the system for this purpose results in an automated, reproducible execution, with additional insights like classification of feature realizations and dynamic visualizations in the GUI. The classifiers were trained offline on the data points acquired from analyzing the stimuli. However, the system also supports incremental, online re-training whenever requested by the user, for example after every time the convergence model is updated. For the demonstration presented here, a sequential minimization optimization (SMO) [22] implementation of the support vector machine (SVM) classifier was used for training. Each turn’s number and prediction are added as an interactive annotation to the dynamic graph of the relevant features, as shown in Fig. 3. Finally, using the system, the experiment is transformed into an automated dialogue scenario, which enhances its HCI nature. Table 2. Examples of stimuli sentences, each containing one target feature. Sentence War Was Ich I Wir We
4.1
das the bin am besuchen will visit
Feature Ger¨ at device s¨ uchtig addicted euch you
sehr very nach to bald soon
teuer? [E:] vs. [e:] in word-medial ¨ a expensive? Schokolade [Iç] vs. [Ik] in word-final -ig chocolate wieder [n] vs. [@n] in word-final -en " again
Finding Behavioral Patterns
In this section, we focus on the validation for the feature [E:] vs. [e:] as a representative example for the phonetic adaptation capability of the system. Although the classified realization is binary ([E:] or [e:]), the underlying representation used by the model is gradual. Both of these views on the feature can be seen in the graph area, as shown in Fig. 3. The degree of convergence was examined per utterance in the shadowing phase of the experiment. Three main groups emerged, each with a different
Studying Mutual Phonetic Influence
559
behavior: one group of participants showing little to no tendency to converge (changes in ≤10% of their utterances), the second, with varying degrees of convergence (10% to 90%), and a third group of participants who were very sensitive to the stimuli’s variation (≥90%). We refer to these groups as Low, Mid, and High, respectively. The feature’s classifier was determined on the fly, so that the prediction for each utterance was made based on the type of the stimulus to which the participant was listening. As Table 3 shows, the Low and High groups are both of significant size, indicating that these two distinct behaviors exist in the data and can be spotted by the system. In addition, we validated the separation between these behaviors. To this end, we regarded the shadowing phase as an annotation task, where the annotators are the predictors of the user and the system. Note that 100% similarity would mean complete convergence to every stimulus, which cannot be reasonably expected (cf. [7]). The Cohen’s kappa (κ) values1 of the Low group are expected to be the lowest, as a lesser degree of convergence was found among these participants. By the same logic, the High group is expected to have the highest agreement, and the Mid to have values between the two other groups. Indeed, this hypothesis holds: weak agreement was found in the Low group, strong agreement in the High group, and a value close to 0 (indicating no consistent behavior) for the Mid group.
5
Conclusion and Future Work
We have introduced a system with an integrated spoken dialogue system (SDS), which can track and analyze mutual influence on the phonetic level during the interaction based on an internal convergence model. This combines work done in the fields of phonetic convergence and adaptive SDSs, and contributes to the understanding of power relations between a human and a computer interlocutors. Many aspects of the system are customizable, which makes it flexible in terms of possible supported scenarios. The system can also run on a separate server, which makes it easier to scale its online use. To showcase its capabilities, we simulated a replication of a shadowing experiment, which examined phonetic convergence regarding certain segment-level phonetic features. Three main user behaviors were found with respect to their tendency to change their pronunciation based on the system’s stimulus input. This sheds light on possible relations and dynamics between a user and a system in HCI. Running the experiment in this way not only saved time by automating the annotation and phonetic analysis, but also offered additional insight such as visualization and on-the-fly classification. We believe that this shows that phonetic convergence can be studied using our SDS, and that this is one step forward toward personalized, phonetically aware SDSs, which will enable more natural and efficient interaction. 1
As calculated by the kappa2 command of the irr R package (v0.84), https://cran.rproject.org/package=irr.
560
E. Raveh et al.
Table 3. A summary of the measures for similarity and agreement between the predictor annotations of user and model productions in the shadowing phase. Similarity (%) Agreement (κ) Size (%) Low Mid High All
When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile
© Copyright 2015 - 2025 AZPDF.TIPS - All rights reserved.