| Total: 186
Research on speech rhythm suggests that coordination between syllable and supra-syllabic prominence defines rhythmic differences between languages. This study investigates the role of language-specific phonological processes in the emergence of language-specific coordinative patterns underlying speech rhythm, which result from sensorimotor processes during speech production. French and German speakers repeated disyllabic utterances simultaneously with pre-recorded productions of these utterances. We manipulated the location of prominence in the stimuli and asked speakers to reproduce the heard prominence pattern. Despite the surface cross-linguistic similarity of the recorded productions between the speakers of the two languages, the analysis of laryngeal activity revealed language-dependent coordination between syllables and prominence production, suggesting an influence of language-specific phonology on speech rhythm control even when producing unfamiliar prosodic patterns.
Electrodes for decoding speech from electromyography (EMG) are typically placed on the face, requiring adhesives that are inconvenient and skin-irritating if used regularly. We explore a different device form factor, where dry electrodes are placed around the neck instead. 11-word, multi-speaker voiced EMG classifiers trained on data recorded with this device achieve 92.7% accuracy. Ablation studies reveal the importance of having more than two electrodes on the neck, and phonological analyses reveal similar classification confusions between neck-only and neck-and-face form factors. Finally, speech-EMG correlation experiments demonstrate a linear relationship between many EMG spectrogram frequencies and self-supervised speech representation dimensions.
Brain-Computer Interfaces (BCIs) open avenues for communication among individuals unable to use voice or gestures. Silent speech interfaces are one such approach for BCIs that could offer a transformative means of connecting with the external world. Performance on imagined speech decoding however is rather low due to, amongst others, data scarcity and the lack of a clear starting and end point of the imagined speech in the brain signal. We investigate whether using electroencephalography (EEG) signals from articulated speech can be used to improve imagined speech decoding in two ways: we investigate whether articulated speech EEG signals can be used to predict the end point of the imagined speech and use the articulated speech EEG as extra training data for speaker-independent imagined vowel classification. Our results show that using EEG data from articulated speech did not improve classification of vowels in imagined speech, probably due to high variability in EEG signals amongst speakers.
Direct speech synthesis from neural activity can enable individuals to communicate without articulatory movement or vocalization. A number of recent speech braincomputer interface (BCI) studies have been conducted using invasive neuroimaging techniques, which require neurosurgery to implant electrodes in the brain. In this study, we investigated the feasibility of direct speech synthesis from non-invasive, magnetoencephalography (MEG) signals acquired while participants performed overt speech production tasks. We used a transformer-based framework (Squeezeformer) to convert neural signals into Mel-spectrograms followed by a neural vocoder to generate speech. Our approach achieved an average correlation coefficient of 0.95 between the target and the generated Mel spectrograms, indicating high fidelity. To the best of our knowledge, this is the first demonstration of synthesizing intelligible speech directly from non-invasive brain signals.
The diffusion-based Acoustic-to-Articulatory Inversion (AAI) approach has been shown impressive results for converting audio into Ultrasound Tongue Imaging (UTI) data with clear tongue contours. However, Mean Square Error (MSE) based diffusion models focus on the pixel error between reference and generated UTI data, inherently omitting changes in tongue movements. This leads to the discrepancy in tongue trajectory between reference and generated UTI data. To address this issue, this paper presents an Optical Flow Guided tongue trajectory generation method for training the diffusionbased AAI model. The optical flow method calculates the displacement information of the tongue contours in consecutive frames, enabling the tongue trajectory similarity between reference and generated UTI data to be used as an additional constraint for Diffusion Model network optimization. Experimental results show that our proposed diffusionbased AAI system with additional tongue trajectory constraint outperformed the baseline system across various evaluation metrics.
Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at rishiraij.github.io/multimodal-mri-avatar/.
Articulatory speech synthesis is a challenging task which requires mapping of time-varying articulatory trajectories and speech. In recent years, deep learning methods have been proposed for speech synthesis which have achieved significant progress towards human-like speech generation. However, articulatory speech synthesis is far from human-level performance. Thus, in this work, we further improve the results of articulatory speech synthesis to enhance synthesis quality. We consider a deep learning-based sequence-to-sequence baseline. We improve upon this network using a novel approach of labelaware contrastive learning using framewise phoneme alignment to learn better representations of the articulatory trajectories. With this approach, we obtain a relative improvement in Word Error Rate (WER) of 5.8% over the baseline. We also conduct mean opinion score (MOS) tests and other objective metrics to further evaluate our proposed models.
Auditory Attention Decoding (AAD) is a technique that determines the focus of a listener's attention in complex auditory scenes according to cortical neural responses. Existing research largely examines two-talker scenarios, insufficient for real-world complexity. This study introduced a new AAD database for a four-talker scenario with speeches from four distinct talkers simultaneously presented and spatially separated, and listeners' EEG was recorded. Temporal response functions (TRFs) analysis showed that attended speech TRFs are stronger than each unattended speech. AAD methods based on stimulus-reconstruction (SR) and cortical spatial lateralization were employed and compared. Results indicated decoding accuracy of 77.5% in 60s (chance level of 25%) using SR. Using auditory spatial attention detection (ASAD) methods also indicated high accuracy (94.7% with DenseNet-3D in 1s), demonstrating ASAD methods' generalization performance.
Recent studies have demonstrated the feasibility of localizing an attended sound source from electroencephalography (EEG) signals in a cocktail party scenario. This is referred to as EEG-enabled Auditory Spatial Attention Detection (ASAD). Despite the promise, there is a lack of ASAD datasets. Most existing ASAD datasets are recorded from two speaking locations. To bridge this gap, we introduce a new Auditory Spatial Attention (ASA) dataset, featuring multiple speaking locations of sound sources. The new dataset is designed to challenge and refine deep neural network solutions in real-world applications. Furthermore, we build a channel attention convolutional neural network (CA-CNN) as a reference model for ASA, that serves as a competitive benchmark for future studies.
Recent work has shown that the locus of selective auditory attention in multi-speaker settings can be decoded from single-trial electroencephalography (EEG). This study represents the first effort to investigate the decoding of selective auditory attention through the utilization of an ensemble model. Specifically, we combine predictions solely based on brain data using two stacked deep learning-based models, namely the SpatioTemporal Attention Network (STAnet) and SpatioTemporal Graph Convolutional Network (ST-GCN), through an average-soft voting layer. This ensemble approach demonstrates improved generalizability within short 1-second decision windows, incorporating subtle distinctions in spatial features extracted by the networks from the EEG. This results in an effective trial-independent prediction of spatial auditory attention, outperforming baseline models by a substantial margin of 10% across two publicly available auditory attention datasets1.
This study investigates the phonaesthetics and perceptual dynamics of Swiss German dialects, focusing on how particular sound features influence subjective assessments and, in doing so, contribute to dialect stereotypes. By examining 24 linguistic features of Bern and Zurich German, including nine vowels and 15 consonants in single-word utterances, we aim to fill a research gap that has been previously overlooked, despite suggestions of importance. In an online perception study, we gathered evaluations from three distinct groups of raters (N = 46) from Bern, Zurich, and Hessen, Germany, across six categories from aesthetic dimensions to stereotypical dialect attributions. The findings reveal that rater origin determines the levels of importance on evaluation categories and that certain linguistic features can be identified that are closely linked with specific perceptions (e.g., stupid or arrogant), which may foster negative biases against dialect speakers.
Intra-speaker variability is present even when the talker is uttering the same words in the same social and linguistic context. Studies have revealed that such intra-speaker trial-to-trial variability is connected to speech perception and is actively regulated during speech production. However, the relevant parameters in the variability that are under active regulation remain largely unclear. This study contributes to the discussion by examining the distributional properties of intra-speaker variability. Following up on a study that showed formants on hundreds of repetitions of the same word, measured at different points along the trajectory, are all normally distributed, we ask if those normal distributions correspond, i.e. whether a particular repetition would hold a stable position in the distributions across measurement points. Our analysis of 300 repetitions of /i, oU/ showed that strong correspondence typically spans one to two measurement points, and the strength of correspondence is phonemedependent and position sensitive.
This paper introduces a novel method for quantifying vowel overlap. There is a tension in previous work between using multivariate measures, such as those derived from empirical distri- butions, and the ability to control for unbalanced data and extraneous factors, as is possible when using fitted model parameters. The method presented here resolves this tension by jointly modelling all acoustic dimensions of interest and by simulating distributions from the model to compute a measure of vowel overlap. An additional benefit of this method is that computation of uncertainty becomes straightforward. We evaluate this method on corpus speech data targeting the PIN-PEN merger in four dialects of English and find that using modelled distributions to calculate Bhattacharyya affinity substantially improves results compared to empirical distributions, while the difference between multivariate and univariate modelling is subtle.
This study investigates the characteristics of backchannels showing the entrainment to the interlocutor’s speech. The prosodic features of the dialogues of attentive listening are analyzed to describe how the prosody of Japanese backchannels is affected by the preceding interlocutor’s utterance. We adopt a support vector regression (SVR) to model the relationships between the prosodic features of backchannels and those of the preceding utterances. As a result, we found an interrelationship between the different types of features; in particular, the F0 of backchannels is highly correlated with the power of the preceding utterance. The regression analyses show that the combination of prosodic features of the preceding utterances achieves good prediction of both the F0 and power of backchannels. The findings of this study can be applied to the automatic generation of backchannels for spoken dialogue systems to show empathy and facilitate user’s speech.
Speech rate has been shown to vary across social categories such as gender, age, and dialect, while also being conditioned by properties of speech planning. The effect of utterance length, where speech rate is faster and less variable for longer utterances, has also been shown to reduce the role of social factors once it has been accounted for, leaving unclear the relationship between social factors and speech production in conditioning speech rate. Through modelling of speech rate across 13 English speech corpora, it is found that utterance length has the largest effect on speech rate, though this effect itself varies little across corpora and speakers. While age and gender also modulate speech rate, their effects are much smaller in magnitude. These findings suggest utterance length effects may be conditioned by articulatory and perceptual constraints, and that social influences on speech rate should be interpreted in the broader context of how speech rate variation is structured.
Little research has been conducted to gauge a listener’s ability to recognise or identify speakers when presented with samples of singing within the field of Forensic Speech Science. Eight friends and two foil speakers were recorded speaking and singing to investigate the effects of speaker familiarity and singing in speaker identification tasks. The stimuli were used to create a listening test completed by close social network speakers, members of the wider social network, and general lay listeners. The study aimed to explore the impact of familiarity on an individual’s ability to recognise speakers when presented with spoken and sung stimuli. The results revealed that the listeners within the close social network were the most successful in the listening test. Overall, listeners performed best when both samples were spoken, however, those in the close social network were less affected by the use of sung samples, and scored higher, compared to those outside the close social network.
Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process. We introduce two approaches to effectively steer the decoder towards these prompts: KG-Whisper, which is aimed at fine-tuning the Whisper decoder, and KG-Whisper-PT, which learns a prompt prefix. Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates. Specifically, in unseen language generalization, we demonstrate an average WER improvement of 5.1% over Whisper.
In recent years, advancements in automatic speech recognition (ASR) systems have led to their widespread use in applications such as call center bots and virtual assistants. However, these systems encounter challenges in adverse speech conditions, lack of contextual information, and recognizing rare words. In this paper, we propose a novel architecture to tackle these limitations by integrating Large Language Models (LLMs) and prompt mechanisms, aiming to enhance ASR accuracy. By using a pre-trained text encoder with a text adapter for task-specific adaptation and an efficient LLM-based re-prediction mechanism, our method has shown remarkable results in various real-world scenarios. Our proposed system achieves an average relative word error rate improvement of 27% for conventional tasks, 30% for utterance-level contextual tasks, and 33% for word-level biasing tasks compared to a baseline ASR system on multiple public datasets.
Despite advancements of end-to-end (E2E) models in speech recognition, named entity recognition (NER) is still challenging but critical for semantic understanding. Previous studies mainly focus on various rule-based or attention-based contextual biasing algorithms. However, their performance might be sensitive to the biasing weight or degraded by excessive attention to the named entity list, along with a risk of false triggering. Inspired by the success of the class-based language model (LM) in NER in conventional hybrid systems and the effective decoupling of acoustic and linguistic information in the factorized neural Transducer (FNT), we propose C-FNT, a novel E2E model that incorporates class-based LMs into FNT. In C-FNT, the LM score of named entities can be associated with the name class instead of its surface form. The experimental results show that our proposed C-FNT significantly reduces error in named entities without hurting performance in general word recognition.
Deep biasing methods and shallow fusion methods have been demonstrated to improve the performance of end-to-end ASR effectively. However, accurate recognition often becomes challenging when specific words within the contextual phrases occur too infrequently in the training corpus or are out-of-vocabulary. To address this issue, we introduce a confidence-based homophone detector and syllable bias model to correct context phrases that may have been recognized incorrectly. The detector utilizes confidence distribution peaks resulting from homophone substitutions in ASR decoding outputs and employs their coefficient of variation for discrimination to avoid loss of general performance. Experiments on the biased word subset of Aishell-1 show that our proposed method obtains a 31.2% relative CER improvement over the baseline and a relative decrease of 52.0% for context phrases. When cascaded with the deep fusion and shallow fusion methods, the improvements become 13.7% and 33.5% respectively.
Existing research suggests that automatic speech recognition(ASR) models can benefit from additional contexts (e.g., contact lists, user specified vocabulary). Rare words and named entities can be better recognized with contexts. In this work, we propose two simple yet effective techniques to improve context-aware ASR models. First, we inject contexts into the encoders at an early stage instead of merely at their last layers. Second, to enforce the model to leverage the contexts during training, we perturb the reference transcription with alternative spellings so that the model learns to rely on the contexts to make correct predictions. On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion, making the new state-of-the-art performance. On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and Transducer (RNN-T) ASR models. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The valid candidates then replace their greedy recognition counterparts in corresponding frame intervals. A Hybrid Transducer-CTC model enables the CTC-WS application for the Transducer model. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER compared to baseline methods. The proposed method is publicly available in the NVIDIA NeMo toolkit.
This paper explores the challenge of recognising relevant but previously unheard named entities in spoken input. This scenario pertains to real-world applications where establishing an automatic speech recognition (ASR) model trained on new entity phrases may not be efficient. We propose a technique that involves fine-tuning a Whisper model with a list of entity phrases as prompts. We establish a task-specific dataset where stratification of different entity phrases supports evaluation of three different scenarios in which entities might be encountered. We focus our analysis on a seen-but-unheard scenario, reflecting a situation where only textual representations of novel entity phrases are available for a commercial banking assistant bot. We show that a model tuned to anticipate prompts reflecting novel named entities makes substantial improvements in entity recall over non-tuned baseline models, and meaningful improvements in performance over models fine-tuned without a prompt.
Adapting End-to-End ASR models to out-of-domain datasets with text data is challenging. Factorized neural Transducer (FNT) aims to address this issue by introducing a separate vocabulary decoder to predict the vocabulary. Nonetheless, this approach has limitations in fusing acoustic and language information seamlessly. Moreover, a degradation in word error rate (WER) on the general test sets was also observed, leading to doubts about its overall performance. In response to this challenge, we present the improved factorized neural Transducer (IFNT) model structure designed to comprehensively integrate acoustic and language information while enabling effective text adaptation. We assess the performance of our proposed method on English and Mandarin datasets. The results indicate that IFNT not only surpasses the neural Transducer and FNT in baseline performance in both scenarios but also exhibits superior adaptation ability compared to FNT. On source domains, IFNT demonstrated statistically significant accuracy im- provements, achieving a relative enhancement of 1.2% to 2.8% in baseline accuracy compared to the neural Transducer. On out-of-domain datasets, IFNT shows relative WER(CER) improvements of up to 30.2% over the standard neural Transducer with shallow fusion, and relative WER(CER) reductions ranging from 1.1% to 2.8% on test sets compared to the FNT model.
Recent research on speech models, which are jointly pre-trained with text, has unveiled its promising potential to enhance speech representations by encoding both speech and text within a shared space. However, these models often struggle with the interference between speech and text modalities that hardly achieves cross-modality alignment. Furthermore, the previous focus of evaluation for these models has been on neutral speech scenarios. Their effectiveness in addressing domain-shift speech, notably in the context of emotional speech, has remained largely unexplored in the existing works. In this study, a modality translation model is proposed to align speech and text modalities based on a shared space for speech-to-text translation, and aims to harness such a shared representation to address the challenge of emotional speech recognition. Experiment results show that the proposed method achieves about 3% absolute improvement in word error rate when compared with speech models.