| Total: 174
Accents can have a detrimental impact on interpersonal evaluations. However, the influence of specific language errors remains less understood. The present study tests how accent strength (constructed as a graded factor obtained through ratings) impacts evaluations of speakers’ personality traits (warmth, competence) in the German-Polish context. Moreover, this study intertwines accentedness with two L2 typical error types: phonological (vowel substitutions) and grammatical (gender agreement errors). Results indicate that L2 accent strength had an unfavorable effect on speakers’ competence but positively influenced warmth. The perceived competence was reduced by both error types (phonological, grammatical), while warmth was decreased solely by grammatical errors. The error effects diminished with increasing L2 accent. Finally, Polish participants were less sensitive towards errors and particularly resistant towards phonological substitutions when rating speakers’ personality.
With handling code-switching becoming an increasingly important topic in speech technology, driven by the expansion of low-resource and multilingual methodologies, it is vital that we recognize the diversity of code-switching as a phenomenon. We propose a framework that leverages linguistic findings as makeshift ground-truths to assess the quality and sufficiency of existing metrics designed to capture data-sets' differing code-switching styles. We also introduce a new metric, T-index, which leverages machine translation systems to capture properties of code-switched words in relation to the participating language pair. Through analysis of diverse Hindi-English and Mandarin-English datasets, we systematically explore how well these metrics align with linguistic intuition regarding code-switching richness levels in conversational versus technical domains.
Receptive multilingualism is a form of communication where speakers can comprehend an utterance of a foreign language (Lx) using their native language (L1) when L1 and Lx share similarities in, e.g., vocabulary and pronunciation. The success of receptive multilingualism can be tested by examining accuracy and reaction time of auditory word recognition (AWR) of target words in lexical decision tasks. AWR in such tasks can be affected by adverse listening conditions due to environmental noises and by the presence of a preceding prime word. This study explores whether AWR of L1 in Lx-L1 pairs (Lx = Dutch; L1 = German or English) will be affected by different degrees of similarities in their phonology and semantics and whether such an influence will differ as a function of listening condition. We observed less accurate and slower responses without semantic similarity but a null effect on accuracy without phonological overlap. The interaction with listening conditions is language-dependent.
We introduce an extended 2D (2.5D) wave solver that blends the computational efficiency of low-dimensional models with the accuracy of 3D approaches tailored for simulating tube geometries similar to vocal tracts. Unlike 1D and 2D models limited to radial symmetry, our lightweight 2.5D finite-difference time-domain solver handles irregular geometries bound only to mid-sagittal symmetry. We validated our model against state-of-the-art 2D and 3D solvers for three different vocal tract geometries, each having a unique cross-sectional shape. Results show that the frequency response of 2.5D simulations closely aligns with 3D up to 12 kHz with a Pearson correlation coefficient greater than 0.8 for all geometries. The proposed model also produces effects of higher-order modes associated with non-cylindrical vocal tracts, surpassing the limitations of the advanced 1D and 2D solvers. Moreover, it achieved a speed-up factor close to an order of magnitude compared to the 2D and 3D models.
Existing keyword spotting (KWS) systems primarily rely on predefined keyword phrases. However, the ability to recognize customized keywords is crucial for tailoring interactions with intelligent devices. In this paper, we present a novel Query-by-Example (QbyE) KWS system that employs spectral-temporal graph attentive pooling and multi-task learning. This framework aims to effectively learn speaker-invariant and linguistic-informative embeddings for QbyE KWS tasks. Within this framework, we investigate three distinct network architectures for encoder modeling: LiCoNet, Conformer and ECAPA_TDNN. The experimental results on a substantial internal dataset of 629 speakers have demonstrated the effectiveness of the proposed QbyE framework in maximizing the potential of simpler models such as LiCoNet. Particularly, LiCoNet, which is 13x more efficient, achieves comparable performance to the computationally intensive Conformer model (1.98% vs. 1.63% FRR at 0.3 FAs/Hr).
In recent years, there has been an increasing focus on user convenience, leading to increased interest in text-based keyword enrollment systems for keyword spotting (KWS). Since the system utilizes text input during the enrollment phase and audio input during actual usage, we call this task audio-text based KWS. To enable this task, both acoustic and text encoders are typically trained using deep metric learning loss functions, such as triplet- and proxy-based losses. This study aims to improve existing methods by leveraging the structural relations within acoustic embeddings and within text embeddings. Unlike previous studies that only compare acoustic and text embeddings on a point-to-point basis, our approach focuses on the relational structures within the embedding space by introducing the concept of Relational Proxy Loss (RPL). By incorporating RPL, we demonstrated improved performance on the Wall Street Journal (WSJ) corpus.
This paper introduces a novel approach for streaming open-vocabulary keyword spotting (KWS) with text-based keyword enrollment. For every input frame, the proposed method finds the optimal alignment ending at the frame using connectionist temporal classification (CTC) and aggregates the frame-level acoustic embedding (AE) to obtain higher-level (i.e., character, word, or phrase) AE that aligns with the text embedding (TE) of the target keyword text. After that, we calculate the similarity of the aggregated AE and the TE. To the best of our knowledge, this is the first attempt to dynamically align the audio and the keyword text on-the-fly to attain the joint audio-text embedding for KWS. Despite operating in a streaming fashion, our approach achieves competitive performance on the LibriPhrase dataset compared to the non-streaming methods with a mere 155K model parameters and a decoding algorithm with time complexity O(U), where U is the length of the target keyword at inference time.
For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail party problem where multi-talker speech is separated using speaker clues, the key challenge here is to extract the target speech for KWS based on text clues. To address it, this paper proposes a novel Text-aware Permutation Determinization Training method for multi-talker KWS with a clue-based Speech Separation front-end (TPDT-SS). Our research highlights the critical role of SS front-ends and shows that incorporating keyword-specific clues into these models can greatly enhance the effectiveness. TPDT-SS shows remarkable success in addressing permutation problems in mixed keyword speech, thereby greatly boosting the performance of the backend. Additionally, fine-tuning our system on unseen mixed speech results in further performance improvement.
We propose a novel language-universal approach to end-to-end automatic spoken keyword recognition (SKR) leveraging upon (i) a self-supervised pre-trained model, and (ii) a set of universal speech attributes (manner and place of articulation). Specifically, Wav2Vec2.0 is used to generate robust speech representations, followed by a linear output layer to produce attribute sequences. A non-trainable pronunciation model then maps sequences of attributes into spoken keywords in a multilingual setting. Experiments on the Multilingual Spoken Words Corpus show comparable performances to character- and phoneme-based SKR in seen languages. The inclusion of domain adversarial training (DAT) improves the proposed framework, outperforming both character- and phoneme-based SKR approaches with 13.73% and 17.22% relative word error rate (WER) reduction in seen languages, and achieves 32.14% and 19.92% WER reduction for unseen languages in zero-shot settings.
Contextual biasing has been demonstrated to be effective in improving Whisper recall for named entities or domain-specific words. In a recent work, CB-Whisper takes an additional step and integrates a classifier for open-vocabulary keyword-spotting (OV-KWS) to retrieve keywords from an external database to form a restricted biasing list. Heavy dependence on text-to-speech (TTS) models for generating the speech for the keywords makes the system prone to the drawbacks of using TTS models to generate speech for graphemes with non-trivial phonetic transcriptions. This work proposes an extension to CB-Whisper that leverages user feedback to extend the database of keywords with audio extracted from natural speech. We experiment with different learning strategies for the OV-KWS classifier to assess its domain generalization capabilities for TTS-generated or natural-speech keyword audios and unseen languages.
Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder architectures. This work aims to improve the performance and efficiency of OWSM without additional data. We present a series of E-Branchformer-based models named OWSM v3.1, ranging from 100M to 1B parameters. OWSM v3.1 outperforms its predecessor, OWSM v3, in most evaluation benchmarks, while showing an improved inference speed of up to 25%. We further reveal the emergent ability of OWSM v3.1 in zero-shot contextual biasing speech recognition. We also provide a model trained on a subset of data with low license restrictions. We will publicly release the code, pre-trained models, and training logs.1
Multi-task learning (MTL) approach leverages pre-trained models in speech and machine translation and has significantly advanced speech-to-text translation tasks. However, it introduces a considerable number of parameters, leading to increasing training costs. Most parameter-efficient fine-tuning (PEFT) methods only train additional modules to effectively reduce the number of trainable parameters. Nevertheless, the increase in trainable parameters caused by the PEFT method remains non-negligible in multilingual speech translation settings. In this paper, we first propose the parameter-sharing adapter, which reduces parameters by 7/8 compared to regular adapters, with only approximately 0.7% performance decrease. For the balance between model parameter quantity and performance, we present a neural network search (NAS) based model. Experimental results revealed that the performance of adapter is closest to fine-tuning, while LoRA exhibits the poorest performance.
In Transformer-based Speech-to-Text (S2T) translation, an encoder-decoder model is trained end-to-end to take as input an untranscribed acoustic signal in the source language and directly generate a text translation in the target language. S2T translation models can also be trained in multilingual setups where a single front-end speech encoder is shared across multiple languages. A lingering question, however, is whether the encoder represents spoken utterances in a language-neutral space. In this paper, we present an interpretability study of encoder representations in a multilingual speech translation Transformer via various probing tasks. Our main findings show that while encoder representations are not entirely language-neutral, there exists a semantic subspace that is shared across different languages. Furthermore, we discuss our findings and the implication of our study on cross-lingual learning for spoken language understanding tasks.
Multilingual speech translation tasks typically employ retraining, regularization, or resampling methods to add new languages. Retraining the model significantly increases training time and cost. Moreover, using existing regularization or resampling methods to balance performance between new and original languages might lead to catastrophic forgetting. This can degrade the translation performance of the existing languages. To mitigate the above issues, we store the knowledge of new languages in additional models. We then introduce them as pluggable modules into existing multilingual speech translation models. This approach does not significantly increase training costs and affect the translation performance of existing models. The experimental results demonstrate that our method improves the translation performance of new languages without affecting existing translation tasks. Our code is available at https://github.com/myaxxxxx/transfer-st.
We adapt the well-known beam-search algorithm for machine translation to operate in a cascaded real-time speech translation system. This proved to be more complex than initially anticipated, due to four key challenges: (1) real-time processing of intermediate and final transcriptions with incomplete words from ASR, (2) emitting intermediate and final translations with minimal user perceived latency, (3) handling beam search hypotheses that have unequal length and different model state, and (4) handling sentence boundaries. Previous work in the field of simultaneous machine translation only implemented greedy decoding. We present a beam-search realization that handles all of the above, providing guidance through the minefield of challenges. Our approach increases the BLEU score by 1 point compared to greedy search, reduces the CPU time by up to 40% and character flicker rate by 20+% compared to a baseline heuristic that just retranslates input repeatedly.
Language-agnostic many-to-one end-to-end speech translation models can convert audio signals from different source languages into text in a target language. These models do not need source language identification, which improves user experience. In some cases, the input language can be given or estimated. Our goal is to use this additional language information while preserving the quality of the other languages. We accomplish this by introducing a simple and effective linear input network. The linear input network is initialized as an identity matrix, which ensures that the model can perform as well as, or better than, the original model. Experimental results show that the proposed method can successfully enhance the specified language, while keeping the language-agnostic ability of the many-to-one ST models.
Current research in speech-to-speech translation (S2ST) primarily concentrates on translation accuracy and speech naturalness, often overlooking key elements like paralinguistic information, which is essential for conveying emotions and attitudes in communication. To address this, our research introduces a novel, carefully curated multilingual dataset from various movie audio tracks. Each dataset pair is precisely matched for paralinguistic features and duration. We enhance this by integrating multiple prosody transfer techniques, aiming for translations that are accurate, natural-sounding, and rich in paralinguistic details. Our experimental results confirm that our model retains more paralinguistic information from the source speech while maintaining high standards of translation accuracy and naturalness.
Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yoruba, and propose a Yoruba-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.
Our work introduces the Zero-Shot Speech Translation (ZeroST) framework, leveraging the synergistic potential of pre trained multilingual speech and text foundation models. Inspired by recent advances in multimodal foundation models, ZeroST utilizes a Query Transformer (Q-Former) to seamlessly connect a speech foundation model, such as Whisper or Massively Multilingual Speech (MMS), with a text translation model like No-Language-Left-Behind (NLLB). Our proposed learning framework enables the model to perform the speech-to-text translation in a zero-shot manner, bypassing the need for explicit supervision from expensive-to-collect speech-text translation pairs during training. Our extensive experiments, notably on the Europarl-ST benchmark, demonstrate that ZeroST achieves results comparable to those of a strong cascaded translation system and significantly outperforms baseline models. This promising approach paves the way for future research in zero-shot speech translation.
Customer Satisfaction (CS) in call centers influences customer loyalty and the company's reputation. Traditionally, CS evaluations were conducted manually or with classical machine learning algorithms; however, advancements in deep learning have led to automated systems that evaluate CS using speech and text analyses. Previous studies have shown the text approach to be more accurate but relies on an external ASR for transcription. This study introduces a cross-transfer knowledge technique, distilling knowledge from the BERT model into speech encoders like Wav2Vec2, WavLM, and Whisper. By enriching these encoders with BERT’s linguistic information, we improve speech analysis performance and eliminate the need for an ASR. In evaluations on a dataset of customer opinions, our methods achieve over 92\% accuracy in identifying CS categories, providing a faster and cost-effective solution compared to traditional text approaches.
Speakers regulate vocal intensity on many occasions for example to be heard over a long distance or to express vocal emotions. Humans can regulate vocal intensity over a wide sound pressure level (SPL) range and therefore speech can be categorized into different vocal intensity categories. Recent machine learning experiments have studied classification of vocal intensity category from speech signals which have been recorded without SPL information and which are represented on arbitrary amplitude scales. By fine-tuning four pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, HuBERT, audio speech transformers), this paper studies classification of speech into four intensity categories (soft, normal, loud, very loud), when speech is presented on such arbitrary amplitude scale. The fine-tuned model embeddings showed absolute improvements of 5% and 10-12% in accuracy compared to baselines for the target intensity category label and the SPL-based intensity category label, respectively.
Post-traumatic Stress Disorder (PTSD) is a mental condition that develops as a result of catastrophic events. Triggers for this may include experiences, such as military combat, natural disasters, or sexual abuse, having a great influence on the mental wellbeing. Due to the severity of this condition, early detection and professional treatment is crucial. For this reason, previous research explored prediction models for recognising PTSD at an early stage. However, when these models are transferred from research to real-world applications, they face heterogeneous environments (e.g., different recording settings, various dialects or languages). To analyse this effect, we develop a speech-based PTSD recognition model and subsequently analyse its cross-corpus and cross-linguistic performance. Our experiments indicate that there are cross-cultural factors incluencing PTSD and leading to a best area under the ROC curve (AUC) of 70.1% evaluated on the cross-corpus.
Among the many multilingual speakers of the world, code-switching (CSW) is a common linguistic phenomenon. Prior sociolinguistic work has shown that factors such as expressing group identity and solidarity, performing affective function, and reflecting shared experiences are related to CSW prevalence in multilingual speech. We build on prior studies by asking: is the expression of empathy a motivation for CSW in speech? To begin to answer this question, we examine several multilingual speech corpora representing diverse language families and apply recent modeling advances in the study of empathetic monolingual speech. We find a generally stronger positive relationship of spoken CSW with the lexical correlates of empathy than with acoustic-prosodic ones, which holds across three language pairs. Our work is a first step toward establishing a motivation for CSW that has thus far mainly been studied qualitatively.
Recently, multi-task spoken language understanding (SLU) models have emerged, designed to address various speech processing tasks. However, these models often rely on a large number of parameters. Also, they often encounter difficulties in adapting to new data for a specific task without experiencing catastrophic forgetting of previously trained tasks. In this study, we propose finding task-specific subnetworks within a multi-task SLU model via neural network pruning. In addition to model compression, we expect that the forgetting of previously trained tasks can be mitigated by updating only a task-specific subnetwork. We conduct experiments on top of the state-of-the-art multi-task SLU model ``UniverSLU'', trained for several tasks such as emotion recognition (ER), intent classification (IC), and automatic speech recognition (ASR). We show that pruned models were successful in adapting to additional ASR or IC data with minimal performance degradation on previously trained tasks.
Test data is said to be out-of-distribution (OOD) when it unexpectedly differs from the training data, a common challenge in real-world use cases of machine learning. Although OOD generalisation has gained interest in recent years, few works have focused on OOD generalisation in spoken language understanding (SLU) tasks. To facilitate research on this topic, we introduce a modified version of the popular SLU dataset SLURP, featuring data splits for testing OOD generalisation in the SLU task. We call our modified dataset SLURP For OOD generalisation, or SLURPFOOD. Utilising our OOD data splits, we find end-to-end SLU models to have limited capacity for generalisation. Furthermore, by employing model interpretability techniques, we shed light on the factors contributing to the generalisation difficulties of the models. To improve the generalisation, we experiment with two techniques, which improve the results on some, but not all the splits, emphasising the need for new techniques.