| Total: 65
We introduce the Varied Emotion in Syntactically Uniform Speech (VESUS) repository as a new resource for the speech community. VESUS is a lexically controlled database, in which a semantically neutral script is portrayed with different emotional inflections. In total, VESUS contains over 250 distinct phrases, each read by ten actors in five emotional states. We use crowd sourcing to obtain ten human ratings for the perceived emotional content of each utterance. Our unique database construction enables a multitude of scientific and technical explorations. To jumpstart this effort, we provide benchmark performance on three distinct emotion recognition tasks using VESUS: longitudinal speaker analysis, extrapolating across syntactical complexity, and generalization to a new speaker.
The National Speech Corpus (NSC) is the first large-scale Singapore English corpus spearheaded by the Info-communications and Media Development Authority of Singapore. It aims to become an important source of open speech data for automatic speech recognition (ASR) research and speech-related applications. The first release of the corpus features more than 2000 hours of orthographically transcribed read speech data designed with the inclusion of locally relevant words. It is available for public and commercial use upon request at “www.imda.gov.sg/nationalspeechcorpus”, under the Singapore Open Data License. An accompanying lexicon is currently in the works and will be published soon. In addition, another 1000 hours of conversational speech data will be made available in the near future under the second release of NSC. This paper reports on the development and collection process of the read speech and conversational speech corpora.
There has been huge progress in speech recognition over the last several years. Tasks once thought extremely difficult, such as SWITCHBOARD, now approach levels of human performance. The MALACH corpus (LDC catalog LDC2012S05), a 375-Hour subset of a large archive of Holocaust testimonies collected by the Survivors of the Shoah Visual History Foundation, presents significant challenges to the speech community. The collection consists of unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching, and emotional speech - all still open problems for speech recognition systems. Transcription is challenging even for skilled human annotators. This paper proposes that the community place focus on the MALACH corpus to develop speech recognition systems that are more robust with respect to accents, disfluencies and emotional speech. To reduce the barrier for entry, a lexicon and training and testing setups have been created and baseline results using current deep learning technologies are presented. The metadata has just been released by LDC (LDC2019S11). It is hoped that this resource will enable the community to build on top of these baselines so that the extremely important information in these and related oral histories becomes accessible to a wider audience.
This paper introduces speech database for analyzing children’s speech. The proposed database of children is recorded in Kannada language (one of the South Indian languages) from children between age 2.5 to 6.5 years. The database is named as National Institute of Technology Karnataka Kids’ Speech Corpus (NITK Kids’ Speech Corpus). The relevant design considerations for the database collection are discussed in detail. It is divided into four age groups with an interval of 1 year between each age group. The speech corpus includes nearly 10 hours of speech recordings from 160 children. For each age range, the data is recorded from 40 children (20 male and 20 female). Further, the effect of developmental changes on the speech from 2.5 to 6.5 years are analyzed using pitch and formant analysis. Some of the potential applications, of the NITK Kids’ Speech Corpus, such as, systematic study on the language learning ability of children, phonological process analysis and children speech recognition are discussed.
We study the problem of evaluating automatic speech recognition (ASR) systems that target dialectal speech input. A major challenge in this case is that the orthography of dialects is typically not standardized. From an ASR evaluation perspective, this means that there is no clear gold standard for the expected output, and several possible outputs could be considered correct according to different human annotators, which makes standard word error rate (WER) inadequate as an evaluation metric. Specifically targeting the case of Arabic dialects, which are also morphologically rich and complex, we propose a number of alternative WER-based metrics that vary in terms of text representation, including different degrees of morphological abstraction and spelling normalization.We evaluate the efficacy of these metrics by comparing their correlation with human judgments on a validation set of 1,000 utterances. Our results show that the use of morphological abstractions and spelling normalization produces systems with higher correlation with human judgment. We released the code and the datasets to the research community.
Speech data found in the wild hold many advantages over artificially constructed speech corpora in terms of ecological validity and cultural worth. Perhaps most importantly, there is a lot of it. However, the combination of great quantity, noisiness and variation poses a challenge for its access and processing. Generally speaking, automatic approaches to tackle the problem require good labels for training, while manual approaches require time. In this study, we provide further evidence for a semi-supervised, human-in-the-loop framework that previously has shown promising results for browsing and annotating large quantities of found audio data quickly. The findings of this study show that a 100-hour long subset of the Fearless Steps corpus can be annotated for speech activity in less than 45 minutes, a fraction of the time it would take traditional annotation methods, without a loss in performance.
In this paper, we describe a novel approach for generating unsupervised humor labels using time-aligned user comments, and predicting humor using audio information alone. We collected 241 videos of comedy movies and gameplay videos from one of the largest Chinese video-sharing websites. We generate unsupervised humor labels from laughing comments, and find high agreement between these labels and human annotations. From these unsupervised labels, we build deep learning models using speech and text features, which obtain an AUC of 0.751 in predicting humor on a manually annotated test set. To our knowledge, this is the first study predicting perceived humor in large-scale audio data.
We address the problem of predicting the leading political ideology, i.e., left-center-right bias, for YouTube channels of news media. Previous work on the problem has focused exclusively on text and on analysis of the language used, topics discussed, sentiment, and the like. In contrast, here we study videos, which yields an interesting multimodal setup. Starting with gold annotations about the leading political ideology of major world news media from Media Bias/Fact Check, we searched on YouTube to find their corresponding channels, and we downloaded a recent sample of videos from each channel. We crawled more than 1,000 YouTube hours along with the corresponding subtitles and metadata, thus producing a new multimodal dataset. We further developed a multimodal deep-learning architecture for the task. Our analysis shows that the use of acoustic signal helped to improve bias detection by more than 6% absolute over using text and metadata only. We release the dataset to the research community, hoping to help advance the field of multi-modal political bias detection.
Automatic detection of speaker states and traits is made more difficult by intergroup differences in how they are distributed and expressed in speech and language. In this study, we explore various deep learning architectures for incorporating demographic information into the classification task. We find that early and late fusion of demographic information both improve performance on the task of personality recognition, and a multitask learning model, which performs best, also significantly improves deception detection accuracy. Our findings establish a new state-of-the-art for personality recognition and deception detection on the CXD corpus, and suggest new best practices for mitigating intergroup differences to improve speaker state and trait recognition.
In this paper, we present an in-depth study on the classification of regional accents in Mandarin speech. Experiments are carried out on Mandarin speech data systematically collected from 15 different geographical regions in China for broad coverage. We explore bidirectional Long Short-Term Memory (bLSTM) networks and i-vectors to model longer-term acoustic context. Starting from the classification of the collected data into the 15 regional accents, we derive a three-class grouping via non-metric dimensional scaling (NMDS), for which 68.4% average recall can be obtained. Furthermore, we evaluate a state-of-the-art ASR system on the accented data and demonstrate that the character error rate (CER) strongly varies among these accent groups, even if i-vector speaker adaptation is used. Finally, we show that model selection based on the prediction of our bLSTM accent classifier can yield up to 7.6% CER reduction for accented speech.
To detect social signals such as laughter or filler events from audio data, a straightforward choice is to apply a Hidden Markov Model (HMM) in combination with a Deep Neural Network (DNN) that supplies the local class posterior estimates ( HMM/DNN hybrid model). However, the posterior estimates of the DNN may be suboptimal due to a mismatch between the cost function used during training (e.g. frame-level cross-entropy) and the actual evaluation metric (e.g. segment-level F1 score). In this study, we show experimentally that by employing a simple posterior probability calibration technique on the DNN outputs, the performance of the HMM/DNN workflow can be significantly improved. Specifically, we apply a linear transformation on the activations of the output layer right before using the softmax function, and fine-tune the parameters of this transformation. Out of the calibration approaches tested, we got the best F1 scores when the posterior calibration process was adjusted so as to maximize the actual HMM-based evaluation metric.
The studies of laughter synthesis are relatively few, and they are still in a preliminary stage. We explored the possibility of applying WaveNet to laughter synthesis. WaveNet is potentially more suitable to model laughter waveforms that do not have a well-established theory of production like speech signals. Conversational laughter was modelled with a spontaneous dialogue speech corpus based on WaveNet. To obtain more stable laughter generation, conditioning WaveNet by power contour was proposed. Experimental results showed that the synthesized laughter by WaveNet was perceived as closer to natural laughter than HMM-based synthesized laughter.
Human verbal communication is a complex phenomenon involving dynamics that normally result in the alignment of participants on several modalities, and across various linguistic domains. We examined here whether such dynamics occur also for paralinguistic events, in particular, in the case of laughter. Using a conversational corpus containing dyadic interactions in three languages (French, German and Mandarin Chinese), we investigated three measures of alignment: convergence, synchrony and agreement. Support for convergence and synchrony was found in all three languages, although the level of support varied with the language, while the agreement in laughter type was found to be significant for the German data. The implications of these findings towards a better understanding of the role of laughter in human communication are discussed.
Although laughter research has gained quite some interest over the past few years, a shared description of how to annotate laughter and its sub-units is still missing. We present a first attempt towards an annotation scheme that contributes to improving the homogeneity and transparency with which laughter is annotated. This includes the integration of respiratory noises as well as stretches of speech-laughs, and to a limited extend to smiled speech and short silent intervals. Inter-annotator agreement is assessed while applying the scheme to different corpora where laughter is evoked through different methods and varying settings. Annotating laughter becomes more complex when the situation in which laughter occurs becomes more spontaneous and social. There is a substantial disagreement among the annotators with respect to temporal alignment (when does a unit start and when does it end) and unit classification, particularly the determination of starts/ends of laughter episodes. In summary, this detailed laughter annotation study reflects the need for better investigations of the various components of laughter.
The effect of stress on the human body is substantial, potentially resulting in serious health implications. Furthermore, with modern stressors seemingly on the increase, there is an abundance of contributing factors which lead to a diagnosis of acute stress. However, observing biological stress reactions usually includes costly and time consuming sequential fluid-based samples to determine the degree of biological stress. On the contrary, a speech monitoring approach would allow for a non-invasive indication of stress. To evaluate the efficacy of the speech signal as a marker of stress, we explored, for the first time, the relationship between sequential cortisol samples and speech-based features. Utilising a novel corpus of 43 individuals undergoing a standardised Trier Social Stress Test (TSST), we extract a variety of feature sets and observe a correlation between speech and sequential cortisol measurements. For prediction of mean cortisol levels from speech, results show that for the entire TSST oral presentation, handcrafted COMPARE features achieve best results of 0.244 root mean square error [0 ;1] for the sample 20 minutes after the TSST. Correlation also increases at minute 20, with a Spearman’s correlation coefficient of 0.421, and Cohen’s d of 0.883 between the baseline and minute 20 cortisol predictions.
The ability to discern an individual’s level of sincerity varies from person to person and across cultures. Sincerity is typically a key indication of personality traits such as trustworthiness, and portraying sincerity can be integral to an abundance of scenarios, e. g. , when apologising. Speech signals are one important factor when discerning sincerity and, with more modern interactions occurring remotely, automatic approaches for the recognition of sincerity from speech are beneficial during both interpersonal and professional scenarios. In this study we present details of the Sincere Apology Corpus ( Sina-C). Annotated by 22 individuals for their perception of sincerity, Sina-C is an English acted-speech corpus of 32 speakers, apologising in multiple ways. To provide an updated baseline for the corpus, various machine learning experiments are conducted. Finding that extracting deep data-representations (utilising the Deep Spectrum toolkit) from the speech signals is best suited. Classification results on the binary (sincere / not sincere) task are at best 79.2% Unweighted Average Recall and for regression, in regards to the degree of sincerity, a Root Mean Square Error of 0.395 from the standardised range [-1.51; 1.72] is obtained.
In this paper, we test whether the perception of filled-pause (FP) frequency and public-speaking performance are mediated by the phonetic characteristics of FPs. In particular, total duration, vowel-formant pattern (if present), and nasal-segment proportion of FPs were correlated with perceptual data of 29 German listeners who rated excerpts of business presentations given by 68 German-speaking managers. Results show strong inter-speaker differences in how and how often FPs are realized. Moreover, differences in FP duration and nasal proportion are significantly correlated with estimated (i.e. subjective) FP frequency and perceived speaker performance. The shorter and more nasal a speaker’s FPs are, the more do listeners underestimate the speaker’s actual FP frequency and the higher they rate the speaker’s public-speaking performance. The results are discussed in terms of their implications for FP saliency and rhetorical training.
There are a lot of features that can be extracted from speech signals for different applications such as automatic speech recognition or speaker verification. However, for pathological speech processing there is a need to extract features about the presence of the disease or the state of the patients that are comprehensible for clinical experts. Phonological posteriors are a group of features that can be interpretable by the clinicians and at the same time carry suitable information about the patient’s speech. This paper presents a tool to extract phonological posteriors directly from speech signals. The proposed method consists of a bank of parallel bidirectional recurrent neural networks to estimate the posterior probabilities of the occurrence of different phonological classes. The proposed models are able to detect the phonological classes with accuracies over 90%. In addition, the trained models are available to be used by the research community interested in the topic.
We investigate whether, and if so when, prosodic features in spoken dialogue aid in modeling the importance of words to the overall meaning of a dialogue turn. Starting from the assumption that acoustic-prosodic cues help identify important speech content, we investigate representation architectures that combine lexical and prosodic features and evaluate them for predicting word importance. We propose an attention-based feature fusion strategy and additionally show how the addition of strategic supervision of the attention weights results in especially competitive models. We evaluate our fusion strategy on spoken dialogues and demonstrate performance increases over state-of-the-art models. Specifically, our approach both achieves the lowest root mean square error on test data and generalizes better over out-of-vocabulary words.
Attention mechanism plays a crucial role in sequential learning for many speech and language applications. However, it is challenging to develop a stochastic attention in a sequence-to-sequence model which consists of two recurrent neural networks (RNNs) as the encoder and decoder. The problem of posterior collapse happens in variational inference and results in the estimated latent variables close to a standard Gaussian prior so that the information from input sequence is disregarded in learning process. This paper presents a new recurrent autoencoder for sentence representation where a self attention scheme is incorporated to activate the interaction between inference and generation in training procedure. In particular, a stochastic RNN decoder is implemented to provide additional latent variable to fulfill self attention for sentence reconstruction. The posterior collapse is alleviated. The latent information is sufficiently attended in variational sequential learning. During test phase, the estimated prior distribution of decoder is sampled for stochastic attention and generation. Experiments on Penn Treebank and Yelp 2013 show the desirable generation performance in terms of perplexity. The visualization of attention weights also illustrates the usefulness of self attention. The evaluation on DUC 2007 demonstrates the merit of variational recurrent autoencoder for document summarization.
This paper learns multi-modal embeddings from text, audio, and video views/modes of data in order to improve upon downstream sentiment classification. The experimental framework also allows investigation of the relative contributions of the individual views in the final multi-modal embedding. Individual features derived from the three views are combined into a multi-modal embedding using Deep Canonical Correlation Analysis (DCCA) in two ways i) One-Step DCCA and ii) Two-Step DCCA. This paper learns text embeddings using BERT, the current state-of-the-art in text encoders. We posit that this highly optimized algorithm dominates over the contribution of other views, though each view does contribute to the final result. Classification tasks are carried out on two benchmark data sets and on a new Debate Emotion data set, and together these demonstrate that the one-Step DCCA outperforms the current state-of-the-art in learning multi-modal embeddings.
Spoken language understanding (SLU) is a crucial component in virtual personal assistants. It consists of two main tasks: intent detection and slot filling. State-of-the-art deep neural SLU models have demonstrated good performance on benchmark datasets. However, these models suffer from the significant performance drop in practice after deployment due to the data distribution discrepancy between training and real user utterances. In this paper, we first propose four research questions that help to understand what the state-of-the-art deep neural SLU models actually learn. To answer them, we study the vocabulary importance using a novel Embedding Sparse Structure Learning (SparseEmb) approach. It can be applied onto various existing deep SLU models to efficiently prune the useless words without any additional manual hyperparameter tuning. We evaluate SparseEmb on benchmark datasets using two existing SLU models and answer the proposed research questions. Then, we utilize SparseEmb to sanitize the training data based on the selected useless words as well as the model re-validation during training. Using both benchmark and our collected testing data, we show that our sanitized training data helps to significantly improve the SLU model performance. Both SparseEmb and training data sanitization approaches can be applied onto any deep learning based SLU models.
Ambitions in artificial intelligence involve machine understanding of human language. The state-of-the-art approach for Spoken Language Understanding is using an Automatic Speech Recognizer (ASR) to generate transcripts, which are further processed with text-based tools. ASR yields error prone transcripts, these errors then propagate further into the processing pipeline. Subjective tests show on the other hand, that humans understand quite well ASR closed captions despite the word and punctuation errors. Our goal is to assess and quantify the loss in the semantic space resulting from error propagation and also analyze error propagation into speech summarization as a special use-case. We show, that word errors cause a slight shift in the semantic space, which is fairly below the average semantic distance between the sentences within a document. We also show, that punctuation errors have higher impact on summarization performance, which suggests that proper sentence level tokenization is crucial for this task.
Attention-based bidirectional long short-term network (BiLSTM) models have recently shown promising results in text classification tasks. However, when the amount of training data is restricted, or the distribution of the test data is quite different from the training data, some potential informative words maybe hard to be captured in training. In this work, we propose a new method to learn attention mechanism for domain classification. Unlike the past attention mechanisms only guided by domain tags of training data, we explore using the latent topics in the data set to learn topic attention, and employ it for BiLSTM. Experiments on the SMP-ECDT benchmark corpus show that the proposed latent topic attention mechanism outperforms the state-of-the-art soft and hard attention mechanisms in domain classification. Moreover, experiment result shows that the proposed method can be trained with additional unlabeled data and further improve the domain classification performance.
In 2018, the U.S. National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). SRE18 was organized in a similar manner to SRE16, focusing on speaker detection over conversational telephony speech (CTS) collected outside north America. SRE18 also featured several new aspects including: two new data domains, namely voice over internet protocol (VoIP) and audio extracted from amateur online videos (AfV), as well as a new language (Tunisian Arabic). A total of 78 organizations (forming 48 teams) from academia and industry participated in SRE18 and submitted 129 valid system outputs under fixed and open training conditions first introduced in SRE16. This paper presents an overview of the evaluation and several analyses of system performance for all primary conditions in SRE18. The evaluation results suggest that 1) speaker recognition on AfV was more challenging than on telephony data, 2) speaker representations (aka embeddings) extracted using end-to-end neural network frameworks were most effective, 3) top performing systems exhibited similar performance, and 4) greatest performance improvements were largely due to data augmentation, use of extended and more complex models for data representation, as well as effective use of the provided development sets.