| Total: 951
Automatic speech recognition (ASR) has shown huge advances in adult speech; however, when the models are tested on child speech, the performance does not achieve satisfactory word error rates (WER). This is mainly due to the high variance in acoustic features of child speech and the lack of clean, labeled corpora. We apply the factored time delay neural network (TDNN-F) to the child speech domain, finding that it yields better performance. To enable our models to handle the different noise conditions and extremely small corpora, we augment the original training data by adding noise and reverberation. Compared with conventional GMM-HMM and TDNN systems, TDNN-F does better on two widely accessible corpora: CMU Kids and CSLU Kids, and on the combination of these two. Our system achieves a 26% relative improvement in WER.
Accurate automatic speech recognition (ASR) of kindergarten speech is particularly important as this age group may benefit the most from voice-based educational tools. Due to the lack of young child speech data, kindergarten ASR systems often are trained using older child or adult speech. This study proposes a fundamental frequency (fo)-based normalization technique to reduce the spectral mismatch between kindergarten and older child speech. The technique is based on the tonotopic distances between formants and fo developed to model vowel perception. This proposed procedure only relies on the computation of median fo across an utterance. Tonotopic distances for vowel perception were reformulated as a linear relationship between formants and fo to provide an effective approach for frequency normalization. This reformulation was verified by examining the formants and fo of child vowel productions. A 208-word ASR experiment using older child speech for training and kindergarten speech for testing was performed to examine the effectiveness of the proposed technique against piecewise vocal tract length, F3-based, and subglottal resonance normalization techniques. Results suggest that the proposed technique either has performance advantages or requires the computation of fewer parameters.
This study explores building and improving an automatic speech recognition (ASR) system for children aged 6–9 years and diagnosed with autism spectrum disorder (ASD), language impairment (LI), or both. Working with only 1.5 hours of target data in which children perform the Clinical Evaluation of Language Fundamentals Recalling Sentences task, we apply deep neural network (DNN) weight transfer techniques to adapt a large DNN model trained on the LibriSpeech corpus of adult speech. To begin, we aim to find the best proportional training rates of the DNN layers. Our best configuration yields a 29.38% word error rate (WER). Using this configuration, we explore the effects of quantity and similarity of data augmentation in transfer learning. We augment our training with portions of the OGI Kids’ Corpus, adding 4.6 hours of typically developing speakers aged kindergarten through 3rd grade. We find that 2nd grade data alone — approximately the mean age of the target data — outperforms other grades and all the sets combined. Doubling the data for 1st, 2nd, and 3rd grade, we again compare each grade as well as pairs of grades. We find the combination of 1st and 2nd grade performs best at a 26.21% WER.
We investigate the automatic processing of child speech therapy sessions using ultrasound visual biofeedback, with a specific focus on complementing acoustic features with ultrasound images of the tongue for the tasks of speaker diarization and time-alignment of target words. For speaker diarization, we propose an ultrasound-based time-domain signal which we call estimated tongue activity. For word-alignment, we augment an acoustic model with low-dimensional representations of ultrasound images of the tongue, learned by a convolutional neural network. We conduct our experiments using the Ultrasuite repository of ultrasound and speech recordings for child speech therapy sessions. For both tasks, we observe that systems augmented with ultrasound data outperform corresponding systems using only the audio signal.
Use of speech technologies in the classroom is often limited by the inferior acoustic conditions as well as other factors that might affect the quality of the recordings. We describe MyTurnToRead, an e-book-based app designed to support an interleaved listening and reading experience, where the child takes turns reading aloud with a virtual partner. The child’s reading turns are recorded, and processed by an automated speech analysis system in order to provide feedback or track improvement in reading skill. We describe the architecture of the speech processing back-end and evaluate system performance on the data collected in several summer camps where children used the app on consumer-grade devices as part of the camp programming. We show that while the quality of the audio recordings varies greatly, our estimates of student oral reading fluency are very good: for example, the correlation between ASR-based and transcription-based estimates of reading fluency at the speaker level is r=0.93. These are also highly correlated with an external measure of reading comprehension.
Problems in vocal quality are common in 4 to 12-year-old children, which may affect their health as well as their social interactions and development process. The sustained vowel exercise is widely used by speech and language pathologists for the child’s voice recovery and vocal re-education. Nonetheless, despite being an important voice exercise, it can be a monotonous and tedious activity for children. Here, we propose a computer therapy game that uses the sustained vowel exercise to motivate children on doing this exercise often. In addition, the game gives visual feedback on the child’s performance, which helps the child understand how to improve the voice production. The game uses a vowel classification model learned with a support vector machine and Mel frequency cepstral coefficients. A user test with 14 children showed that when using the game, children achieve longer phonation times than without the game. Also, it shows that the visual feedback helps and motivates children on improving their sustained vowel productions.
The research on ambient assistive technology is concerned with features humanoid agents should show in order to gain user acceptance. However, differently aged groups may have different requirements. This paper is particularly focused on agent’s voice preferences among elders, young adults, and adolescents. To this aim 316 users organized in groups of 45/46 subjects of which 3 groups of elders (65+ years old), 2 of young adults (aged between 22–35 years), and 2 of adolescents (aged between 14–16 years) were recruited and administered the Virtual Agent Acceptance Questionnaire (VAAQ), after watching video-clips of mute and speaking agents, in order to test their preferences in terms of willingness to interact, pragmatic and hedonic qualities, and attractiveness, of proposed speaking and mute agents. In addition, the elders were also tested on listening only the agent’s. The results suggest that voice is primary for getting elder’s acceptance of virtual humanoid agents in contrast to young adults and adolescents which accept equally well either mute or speaking agents.
We analyze the phonetic correlates of petitionary prayer in 22 Christian practitioners. Our aim is to examine if praying is characterized by prosodic markers of dialogue speech and expected efficacy. Three similar conditions are compared; 1) requests to God, 2) requests to a human recipient, 3) requests to an imaginary person. We find that making requests to God is clearly distinguishable from making requests to both human and imaginary interlocutors. Requests to God are, unlike requests to an imaginary person, characterized by markers of dialogue speech (as opposed to monologue speech), including, a higher f0 level, a larger f0 range, and a slower speaking rate. In addition, requests to God differ from those made to both human and imaginary persons in markers of expected efficacy on the part of the speaker. These markers are related to a more careful speech production, including almost complete lack of hesitations, more pauses, and a much longer speaking time.
This study explores whether people align to expressive speech spoken by a voice-activated artificially intelligent device (voice-AI), specifically Amazon’s Alexa. Participants shadowed words produced by the Alexa voice in two acoustically distinct conditions: “regular” and “expressive”, containing more exaggerated pitch contours and longer word durations. Another group of participants rated the shadowed items, in an AXB perceptual similarity task, as an assessment of overall degree of vocal alignment. Results show greater vocal alignment toward expressive speech produced by the Alexa voice and, furthermore, systematic variation based on speaker gender. Overall, these findings have applications to the field of affective computing in understanding human responses to synthesized emotional expressiveness.
Being able to detect topics and speaker stances in conversations is a key requirement for developing spoken language understanding systems that are personalized and adaptive. In this work, we explore how topic-oriented speaker stance is expressed in conversational speech. To do this, we present a new set of topic and stance annotations of the CallHome corpus of spontaneous dialogues. Specifically, we focus on six stances — positivity, certainty, surprise, amusement, interest, and comfort — which are useful for characterizing important aspects of a conversation, such as whether a conversation is going well or not. Based on this, we investigate the use of neural network models for automatically detecting speaker stance from speech in multi-turn, multi-speaker contexts. In particular, we examine how performance changes depending on how input feature representations are constructed and how this is related to dialogue structure. Our experiments show that incorporating both lexical and acoustic features is beneficial for stance detection. However, we observe variation in whether using hierarchical models for encoding lexical and acoustic information improves performance, suggesting that some aspects of speaker stance are expressed more locally than others. Overall, our findings highlight the importance of modelling interaction dynamics and non-lexical content for stance detection.
In human perception and understanding, a number of different and complementary cues are adopted according to different modalities. Various emotional states in communication between humans reflect this variety of cues across modalities. Recent developments in multi-modal emotion recognition utilize deep-learning techniques to achieve remarkable performances, with models based on different features suitable for text, audio and vision. This work focuses on cross-modal fusion techniques over deep learning models for emotion detection from spoken audio and corresponding transcripts. We investigate the use of long short-term memory (LSTM) recurrent neural network (RNN) with pre-trained word embedding for text-based emotion recognition and convolutional neural network (CNN) with utterance-level descriptors for emotion recognition from speech. Various fusion strategies are adopted on these models to yield an overall score for each of the emotional categories. Intra-modality dynamics for each emotion is captured in the neural network designed for the specific modality. Fusion techniques are employed to obtain the inter-modality dynamics. Speaker and session-independent experiments on IEMOCAP multi-modal emotion detection dataset show the effectiveness of the proposed approaches. This method yields state-of-the-art results for utterance-level emotion recognition based on speech and text.
This paper presents a novel 1-D sentiment classifier trained on the benchmark IMDB dataset. The classifier is a 1-D convolutional neural network with repeated convolution and max pooling layers. The main contribution of this work is the demonstration of a deconvolution technique for 1-D convolutional neural networks that is agnostic to specific architecture types. This deconvolution technique enables text classification to be explained, a feature that is important for NLP-based decision support systems, as well as being an invaluable diagnostic tool.
Electrodermal activity (EDA) is a psychophysiological indicator that can be considered a somatic marker of the emotional and attentional reaction of subjects towards stimuli like audiovisual content. EDA measurements are not biased by the cognitive process of giving an opinion or a score to characterize the subjective perception, and group-level EDA recordings integrate the reaction of an audience, thus reducing the signal noise. This paper contributes to the field of audience’s attention prediction to video content, extending previous novel work on the use of EDA as ground truth for prediction algorithms. Videos are segmented into shorter clips attending to the audience’s increasing or decreasing attention, and we process videos’ audio waveform to extract meaningful aural embeddings from a VGGish model pretrained on the Audioset database. While previous similar work on attention level prediction using only audio accomplished 69.83% accuracy, we propose a Mixture of Experts approach to train a binary classifier that outperforms the main existing state-of-the-art approaches predicting increasing and decreasing attention levels with 81.76% accuracy. These results confirm the usefulness of providing acoustic features with a semantic significance, and the convenience of considering experts over partitions of the dataset in order to predict group-level attention from audio.
Code-switching is about dealing with alternative languages in speech or text. It is partially speaker-dependent and domain-related, so completely explaining the phenomenon by linguistic rules is challenging. Compared to most monolingual tasks, insufficient data is an issue for code-switching. To mitigate the issue without expensive human annotation, we proposed an unsupervised method for code-switching data augmentation. By utilizing a generative adversarial network, we can generate intra-sentential code-switching sentences from monolingual sentences. We applied the proposed method on two corpora, and the result shows that the generated code-switching sentences improve the performance of code-switching language models.
We describe our efforts to compare data collection methods using two think-aloud protocols in preparation to be used as a basis for automatic structuring and labeling of a large database of high-dimensional human activities data into a valuable resource for research in cognitive robotics. The envisioned dataset, currently in development, will contain synchronously recorded multimodal data, including audio, video, and biosignals (eye-tracking, motion-tracking, muscle and brain activity) from about 100 participants performing everyday activities while describing their task through use of think-aloud protocols. This paper provides details of our pilot recordings in the well-established and scalable “table setting scenario,” describes the concurrent and retrospective think-aloud protocols used, the methods used to analyze them, and compares their potential impact on the data collected as well as the automatic data segmentation and structuring process.
We introduce RadioTalk, a corpus of speech recognition transcripts sampled from talk radio broadcasts in the United States between October of 2018 and March of 2019. The corpus is intended for use by researchers in the fields of natural language processing, conversational analysis, and the social sciences. The corpus encompasses approximately 2.8 billion words of automatically transcribed speech from 284,000 hours of radio, together with metadata about the speech, such as geographical location, speaker turn boundaries, gender, and radio program information. In this paper we summarize why and how we prepared the corpus, give some descriptive statistics on stations, shows and speakers, and carry out a few high-level analyses.
Lectures are usually known to be highly specialised in that they deal with multiple and domain specific topics. This context is challenging for Automatic Speech Recognition (ASR) systems since they are sensitive to topic variability. Language Model (LM) adaptation is a commonly used technique to address the mismatch problem between training and test data. In this paper, we are interested in a qualitative analysis in order to relevantly compare the accuracy of the LM adaptation. While word error rate is the most common metric used to evaluate ASR systems, we consider that this metric cannot provide accurate information. Consequently, we explore the use of other metrics based on individual word error rate, indexability, and capability of building relevant requests for information retrieval from the ASR outputs. Experiments are carried out on the PASTEL corpus, a new dataset in French language, composed of lecture recordings, manual chaptering, manual transcriptions, and slides. While an adapted LM allows us to reduce the global classical word error rate by 15.62% in relative, we show that this reduction reaches 44.2% when computed on relevant words only. These observations are confirmed with the high LM adaptation gains obtained with indexability and information retrieval metrics.
Natural Language Understanding (NLU) models are typically trained in a supervised learning framework. In the case of intent classification, the predicted labels are predefined and based on the designed annotation schema while the labeling process is based on a laborious task where annotators manually inspect each utterance and assign the corresponding label. We propose an Active Annotation (AA) approach where we combine an unsupervised learning method in the embedding space, a human-in-the-loop verification process, and linguistic insights to create lexicons that can be open categories and adapted over time. In particular, annotators define the y-label space on-the-fly during the annotation using an iterative process and without the need for prior knowledge about the input data. We evaluate the proposed annotation paradigm in a real use-case NLU scenario. Results show that our Active Annotation paradigm achieves accurate and higher quality training data, with an annotation speed of an order of magnitude higher with respect to the traditional human-only driven baseline annotation methodology.
Automatic sung speech recognition is a relatively understudied topic that has been held back by a lack of large and freely available datasets. This has recently changed thanks to the release of the DAMP Sing! dataset, a 1100 hour karaoke dataset originating from the social music-making company, Smule. This paper presents work undertaken to define an easily replicable, automatic speech recognition benchmark for this data. In particular, we describe how transcripts and alignments have been recovered from Karaoke prompts and timings; how suitable training, development and test sets have been defined with varying degrees of accent variability; and how language models have been developed using lyric data from the LyricWikia website. Initial recognition experiments have been performed using factored-layer TDNN acoustic models with lattice-free MMI training using Kaldi. The best WER is 19.60% — a new state-of-the-art for this type of data. The paper concludes with a discussion of the many challenging problems that remain to be solved. Dataset definitions and Kaldi scripts have been made available so that the benchmark is easily replicable.
In this paper, we propose to detect mismatches between speech and transcriptions using deep neural networks. Although it is generally assumed there are no mismatches in some speech related applications, it is hard to avoid the errors due to one reason or another. Moreover, the use of mismatched data probably leads to performance reduction when training a model. In our work, instead of detecting the errors by computing the distance between manual transcriptions and text strings obtained using a speech recogniser, we view mismatch detection as a classification task and merge speech and transcription features using deep neural networks. To enhance detection ability, we use cross-modal attention mechanism in our approach by learning the relevance between the features obtained from the two modalities. To evaluate the effectiveness of our approach, we test it on Factored WSJCAM0 by randomly setting three kinds of mismatch, word deletion, insertion or substitution. To test its robustness, we train our models using a small number of samples and detect mismatch with different number of words being removed, inserted, and substituted. In our experiments, the results show the use of our approach for mismatch detection is close to 80% on insertion and deletion and outperforms the baseline.
In this paper, we describe the methodology for collecting and annotating a new database designed for conducting research and development on pronunciation assessment. While a significant amount of research has been done in the area of pronunciation assessment, to our knowledge, no database is available for public use for research in the field. Considering this need, we created EpaDB (English Pronunciation by Argentinians Database), which is composed of English phrases read by native Spanish speakers with different levels of English proficiency. The recordings are annotated with ratings of pronunciation quality at phrase-level and detailed phonetic alignments and transcriptions indicating which phones were actually pronounced by the speakers. We present inter-rater agreement, the effect of each phone on overall perceived non-nativeness, and the frequency of specific pronunciation errors.
Understanding spoken language can be impeded through factors like noisy environments, hearing impairments or lack of proficiency. Subtitles can help in those cases. However, for fast speech or limited screen size, it might be advantageous to compress the subtitles to their most relevant content. Therefore, we address automatic sentence compression in this paper. We propose a neural network model based on an encoder-decoder approach with the possibility of integrating the desired compression ratio. Using this model, we conduct a user study to investigate the effects of compressed subtitles on user experience. Our results show that compressed subtitles can suffice for comprehension but may pose additional cognitive load.
Traditional video question answering models have been designed to retrieve videos to answer input questions. A drawback of this scenario is that users have to watch the entire video to find their desired answer. Recent work presented unsupervised neural models with attention mechanisms to find moments or segments from retrieved videos to provide accurate answers to input questions. Although these two tasks look similar, the latter is more challenging because the former task only needs to judge whether the question is answered in a video and returns the entire video, while the latter is expected to judge which moment within a video matches the question and accurately returns a segment of the video. Moreover, there is a lack of labeled data for training moment detection models. In this paper, we focus on integrating video retrieval and moment detection in a unified corpus. We further develop two models — a self-attention convolutional network and a memory network — for the tasks. Experimental results on our corpus show that the neural models can accurately detect and retrieve moments in supervised settings.
Parkinson’s disease is a neurological disorder that produces different motor impairments in the patients. The longitudinal assessment of the neurological state of patients is important to improve their quality of life. We introduced Apkinson, a smartphone application to evaluate continuously the speech and movement deficits of Parkinson’s patients, who receive feedback about their current state after performing different exercises. The speech assessment considers phonation, articulation, and prosody capabilities of the patients. Movement exercises captured with the inertial sensors of the smartphone evaluated symptoms in the upper and lower limbs.
We present an application that detects depression by speech based on a speech feature extraction engine. The input of the application is a read speech sample and the output is predicted depression severity level (Beck Depression Inventory). The application analyses the speech sample and evaluates it using support vector regression (SVR). The developed system could assist general medical staff if no specialist is present to aid the diagnosis. If there is a suspicion that the speaker is suffering from depression, it is inevitable to seek special medical assistance. The application supports five native languages: English, French, German, Hungarian and Italian.