| Total: 141
This paper introduces a dynamical systems framework for understanding stuttering, conceptualizing it as a qualitative shift in speech articulation driven by a single control parameter. Using a forced Duffing oscillator model, we demonstrate how variations in the excitation frequency can account for transitions between fluent and stuttered speech states. The model generates specific predictions about articulatory behaviors during stuttering, which we test using real-time MRI data of stuttered speech. Analysis of articulatory movements provides empirical support for the model’s predictions, suggesting that stuttering can be understood as a dynamical disease—an intact system operating outside its typical parameter range. This framework offers new insights into the nature of stuttering and potential approaches to intervention.
Ultrasound has recently been suggested as an alternative to laryngoscopy for checking vocal fold movement after neck surgery. We propose to use M-mode ultrasound (MUS) to study vocal fold vibration in left and right hemilarynges. MUS is acquired along a 1D line in each hemilarynx for about 5 seconds during vowel phonation. Post-processing estimates spatio-temporal maps of fundamental and second harmonic frequencies. To validate our method, 108 recordings (MUS and voice) from 12 healthy subjects were acquired. We compared the fundamental frequency obtained by MUS with that from voice analysis. The median fundamental frequency was estimated by MUS with high accuracy (y=0.997x+0.293, r=0.999). In our current trial, estimated frequencies are limited to 250 Hz. In future work, frequency will be increased to 1 kHz to avoid aliasing at high pitch, and MUS will be tested on patients with vocal pathologies.
We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms. To our knowledge, this is the first method to decode speech waveforms from surface EMG via articulatory features, offering a novel approach to EMG-based speech synthesis. Additionally, we analyze the relationship between EMG electrode placement and articulatory feature predictability, providing knowledge-driven insights for optimizing EMG electrode configurations. The source code and decoded speech samples are publicly available.
Speech is produced through the coordination of vocal tract constricting organs: lips, tongue, velum, and glottis. Previous works developed a Speech Inversion (SI) systems to recover acoustic-to-articulatory mappings for lip and tongue constrictions, called oral tract variables (TVs), which were later enhanced by including source information (periodic and aperiodic energies, and F0 frequency) as proxies for glottal control. Comparison of the nasometric measures with high-speed nasopharyngoscopy showed that nasalance can serve as ground truth, and that an SI system trained with it reliably recovers velum movement patterns for American English speakers. Here, two SI training approaches are compared: baseline models that estimate oral TVs and nasalance independently, and a synergistic model that combines oral TVs and source features with nasalance. The synergistic model shows relative improvements of 5% in oral TVs estimation and 9% in nasalance estimation compared to the baseline models.
Prosody conveys rich emotional and semantic information of the speech signal as well as individual idiosyncrasies. We propose a stand-alone model that maps text-to-prosodic features such as F0 and energy and can be used in downstream tasks such as TTS. The ProMode encoder takes as input acoustic features and time-aligned textual content, both are partially masked, and obtains a fixed-length latent prosodic embedding. The decoder predicts acoustics in the masked region using both the encoded prosody input and unmasked textual content. Trained on the GigaSpeech dataset, we compare our method with state-of-the-art style encoders. For F0 and energy predictions, we show consistent improvements for our model at different levels of granularity. We also integrate these predicted prosodic features into a TTS system and conduct perceptual tests, which show higher prosody preference compared to the baselines, demonstrating the model’s potential in tasks where prosody modeling is important.
Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without retraining, bridging the gap between pre-trained TTS models and editable speech synthesis.
While generative methods have progressed rapidly in recent years, generating expressive prosody for an utterance remains a challenging task in text-to-speech synthesis. This is particularly true for systems that model prosody explicitly through parameters such as pitch, energy, and duration, which is commonly done for the sake of interpretability and controllability. In this work, we investigate the effectiveness of stochastic methods for this task, including Normalizing Flows, Conditional Flow Matching, and Rectified Flows. We compare these methods to a traditional deterministic baseline, as well as to real human realizations. Our extensive subjective and objective evaluations demonstrate that stochastic methods produce natural prosody on par with human speakers by capturing the variability inherent in human speech. Further, they open up additional controllability options by allowing the sampling temperature to be tuned.
Prosody prediction is crucial for pitch-accent languages like Japanese in text-to-speech (TTS) synthesis. Traditional methods rely on accent labels, which are often incomplete and do not generalize well. BERT-based models, such as fo-BERT, enable fundamental frequency prediction without accent labels but have been limited to single-speaker TTS. We propose GST-BERT-TTS, a novel method for multi-speaker TTS that integrates speaker-specific style embeddings from global style tokens (GST) into the token embeddings in BERT. The proposed method can realize speaker-aware fundamental frequency (fo) prediction in an accent label-free setting. Additionally, we extend fo-BERT to predict not only log fo but also energy and duration, improving speech expressiveness. Experiments using a Japanese multi-speaker TTS corpus demonstrate that GST-BERT-TTS improves the prosody prediction accuracy and synthesis quality compared with fo-BERT.
Computer-Assisted Language Learning systems provide speech exaggerating at mispronounced word locations as a feedback to the L2 learners. Traditionally, expert speakers recordings (which limit the scalability) are considered for this task though there are advancements in text-to-speech (TTS) that can generate native like natural sounding speech. To address these, this work proposes two novel controllable strategies for scalable speech exaggeration. One strategy is direct speech exaggeration that incorporates the proposed label conditioned tokenization in GlowTTS. Another strategy is cascading the state-of-the-art TTS to a WORLD vocoder with proposed energy and duration modifications. A subset of Tatoeba corpus, that we annotated with prominent words, is used for experimentation. Automatic and manual assessment reveals that the exaggerated speech quality from both direct and cascaded strategy with duration modification is closer to the prominent words in the native speaker's speech.
Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.
As generative models gain attention, it is crucial to adapt these models efficiently even with limited high-quality data and computational resources. In this work, we investigate a parameter-efficient fine-tuning (PEFT) for low-resource text-to-speech to transfer pre-trained knowledge to a new language leveraging only a single-speaker dataset and a single NVIDIA TITAN RTX GPU. We propose three types of adapters: Conditioning Adapter, Prompt Adapter, and DiT LoRA Adapter, where Conditioning Adapter enhances text embeddings, Prompt Adapter refines input representations, and DiT LoRA Adapter enables speech generation efficiency. We further explore the respective optimal configuration of adapters for single-speaker and multi-speaker scenarios. Consequently, under resource constraints, we successfully achieve effective adaptation to a new language using only 1.72% of the total parameters. Audio samples, source code and checkpoints will be available.
Accent normalization converts foreign-accented speech into native-like speech while preserving speaker identity. We propose a novel pipeline using self-supervised discrete tokens and non-parallel training data. The system extracts tokens from source speech, converts them through a dedicated model, and synthesizes the output using flow matching. Our method demonstrates superior performance over a frame-to-frame baseline in naturalness, accentedness reduction, and timbre preservation across multiple English accents. Through token-level phonetic analysis, we validate the effectiveness of our token-based approach. We also develop two duration preservation methods, suitable for applications such as dubbing.
Recently, language model (LM)-based speech synthesis models have shown remarkable naturalness and powerful zero-shot capabilities. In this paradigm, discrete speech tokens play a critical role. Prior work has proposed using automatic speech recognition (ASR) tasks to enhance the semantic information and alignment with text in tokens. However, the commonly used byte-pair encoding (BPE) tokenizer in ASR task leads to significant differences in the text token sets of different languages, making it difficult to exploit language-shared information. This paper proposes to use the International Phonetic Alphabet (IPA) as the training target for ASR to learn language-independent speech tokens. In addition, we propose to use a timbre converter for speaker disentanglement in the speech synthesis model. Our proposed approach effectively improves the speaker similarity and expressiveness in both multilingual and cross-lingual zero-shot speech synthesis.
Existing Punjabi text-to-speech (TTS) solutions focus on Gurumukhi script, requiring transliteration from Shahmukhi. This leads to letter substitutions and omissions, resulting in pronunciation errors. In this study, speech corpus, phonetic lexicon, and text analysis module for Punjabi Shahmukhi were developed. Two model architectures: Tacotron 1 and Tacotron 2 with WaveGlow were used to build TTS models. In addition to Punjabi, Urdu TTS models were also developed. These models were benchmarked against Urdu and Punjabi Gurumukhi TTS models provided by Meta’s Massively Multilingual Speech (MMS) which is a top profile multilingual speech project. Objective and subjective evaluations indicate that tacotron based Urdu and Punjabi models outperform MMS in intelligibility, naturalness, and phonetic accuracy, enhancing TTS quality for these languages.
The control of perceptual voice qualities in a text-to-speech (TTS) system is of interest for applications where unmanipulated and manipulated speech probes can serve to illustrate phonetic concepts that are otherwise difficult to grasp. Here, we show that a TTS system, that is augmented with a global speaker attribute manipulation block based on normalizing flows, is capable of correctly manipulating the non-persistent, localized quality of creaky voice, thus avoiding the necessity of a, typically unreliable, frame-wise creak predictor. Subjective listening tests confirm successful creak manipulation at a slightly reduced MOS score compared to the original recording.
English names and expressions are frequently inserted into Swedish text. Humans intuitively adjust the degree of English pronunciation of such insertions. This work aims at a Swedish text-to-speech synthesis (TTS) capable of similar controlled adaptation. We focus on two key aspects: (1) the development of a TTS system with controllable degrees of perceived English-accentedness (PEA); and (2) the exploration of human preferences related to PEA. We trained a Swedish TTS voice on Swedish and English sentences with a conditioning parameter for language (English-accentedness, EA) on a scale from 0 to 1, and estimated a psychometric mapping of the perceived effect of EA to a perceptual scale (PEA) through perception tests. PEA was then used in Best-Worst listening tests presenting English insertions with varying PEA. The results confirm the effectiveness of the training and the PEA scale, and that listener preferences change with different insertions.
Recent advancements in Spoken Language Understanding (SLU) have been driven by pre-trained speech processing models. However, deploying these models on resource-constrained devices remains challenging due to their large parameter sizes. This paper presents PruneSLU, a new method for compressing pre-trained SLU models while maintaining performance. Our approach combines vocabulary pruning and structural layer-wise pruning to reduce model size while preserving essential knowledge. After pruning, the model undergoes knowledge refinement using integration distillation and contrastive learning. Experiments on the STOP and SLURP datasets demonstrate that PruneSLU compresses a 39M model to 15M while retaining 98\% of its original performance, outperforming previous compression techniques.
Dialog State Tracking (DST) is an important part of Task-Oriented Dialog (TOD) systems, as it needs to navigate the complex human conversational flow to accomplish a task. Most TOD systems are trained on written-style text data, and their performance plunges when deployed in spoken scenarios due to natural disfluencies and human-speech recognition errors. Labeled spoken-style TOD data is limited because of the high data collection cost and privacy concerns. As Large Language Models (LLMs) emerge as a tool for synthetic text data generation, we explored their capability to generate spoken-style text-based TOD data. Through meticulously crafting LLM prompts, our generated labeled spoken style TOD data improved the absolute Joint Goal Accuracy (JGA) by 3.39% and relative JGA by 11.6%, for dedicated DST models. In this work, we showcase our divide-and-conquer-based data generation strategies and DST training to improve the performance of task-specific dialog models.
In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fuzzy matching-based output post-processing, which greatly improves performance of our systems on named entities in the dialogue slot values. We conduct our experiments on the SpokenWOZ dataset, and additionally utilize the Speech-Aware MultiWOZ dataset to augment our training data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17% JGA on SpokenWOZ test.
Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans.
High-quality speech conversational datasets are essential for developing and evaluating Speech-LLMs. However, collecting real-world recordings presents significant challenges including high costs, privacy concerns, and inconsistent quality, while existing synthetic approaches often lack authenticity due to limited acoustic variety and insufficient paralinguistic information. We present SpeechDialogueFactory, a framework that addresses these limitations through a three-stage pipeline: generating comprehensive metadata, creating detailed scripts, and producing utterances enriched with paralinguistic features. Our framework retrieves speaker voices from a voice bank and leverages paralinguistic tags for expressive TTS. We also introduce an automated evaluation protocol that shows strong correlation with human assessments. Experimental results demonstrate that our synthesized dialogues achieve quality comparable to human recordings while offering greater flexibility and control.
Emotion Recognition in Conversation (ERC) is essential for dialogue systems in human-computer interaction. Most existing studies primarily focus on modeling contextual information from historical interactions but often overlook the effective integration of speaker and content information. To address these challenges, we propose the "Three Ws" concept -Who, When, and What, representing speaker, context, and content information- to comprehensively capture emotional cues from historical interactions. Building on this concept, we further introduce a novel model for ERC. Additionally, we incorporate a speaker similarity loss to enhance speaker information. Experimental results show that our model outperforms baselines, with each component making significant contributions—especially context information. Additionally, the speaker similarity loss further improves ERC performance. Notably, the "Three Ws"concept demonstrates robustness across both single-modal and multimodal scenarios.
Machine unlearning, the process of efficiently removing specific information from machine learning models, is a growing area of interest for responsible AI. However, few studies have explored the effectiveness of unlearning methods on complex tasks, particularly speech-related ones. This paper introduces UnSLU-BENCH, the first benchmark for machine unlearning in spoken language understanding (SLU), focusing on four datasets spanning four languages. We address the unlearning of data from specific speakers as a way to evaluate the quality of potential "right to be forgotten" requests. We assess eight unlearning techniques and propose a novel metric to simultaneously better capture their efficacy, utility, and efficiency. UnSLU-BENCH sets a foundation for unlearning in SLU and reveals significant differences in the effectiveness and computational feasibility of various techniques.
Large Language Models (LLMs) are powerful tools for generating synthetic data, offering a promising solution to data scarcity in low-resource scenarios. This study evaluates the effectiveness of LLMs in generating question-answer pairs to enhance the performance of question answering (QA) models trained with limited annotated data. While synthetic data generation has been widely explored for text-based QA, its impact on spoken QA remains underexplored. We specifically investigate the role of LLM-generated data in improving spoken QA models, showing performance gains across both text-based and spoken QA tasks. Experimental results on subsets of the SQuAD, Spoken SQuAD, and a Turkish spoken QA dataset demonstrate significant relative F1 score improvements of 7.8%, 7.0%, and 2.7%, respectively, over models trained solely on restricted human-annotated data. Furthermore, our findings highlight the robustness of LLM-generated data in spoken QA settings, even in the presence of noise.
Disfluencies are a characteristic of speech. We focus on the impact of a specific class of disfluency - whole-word speech substitution errors (WSSE) - on LLM-based conversational recommender system performance. We develop Syn-WSSE, a psycholinguistically-grounded framework for synthetically creating genre-based WSSE at varying ratios to study their impact on conversational recommender system performance. We find that LLMs are impacted differently: llama and mixtral have improved performance in the presence of these errors, while gemini, gpt-4o, and gpt-4o-mini have deteriorated performance. We hypothesize that this difference in model resiliency is due to differences in the pre- and post-training methods and data, and that the increased performance is due to the introduced genre diversity. Our findings indicate the importance of a careful choice of LLM for these systems, and more broadly, that disfluencies must be carefully designed for as they can have unforeseen impacts.