INTERSPEECH.2025 - Speech Synthesis

| Total: 141

#1 Towards a dynamical model of transitions between fluent and stuttered speech [PDF4] [Copy] [Kimi] [REL]

Authors: Yijing Lu, Khalil Iskarous, Louis Goldstein

This paper introduces a dynamical systems framework for understanding stuttering, conceptualizing it as a qualitative shift in speech articulation driven by a single control parameter. Using a forced Duffing oscillator model, we demonstrate how variations in the excitation frequency can account for transitions between fluent and stuttered speech states. The model generates specific predictions about articulatory behaviors during stuttering, which we test using real-time MRI data of stuttered speech. Analysis of articulatory movements provides empirical support for the model’s predictions, suggesting that stuttering can be understood as a dynamical disease—an intact system operating outside its typical parameter range. This framework offers new insights into the nature of stuttering and potential approaches to intervention.

Subject: INTERSPEECH.2025 - Speech Synthesis


#2 Study of vocal fold vibration using M-mode ultrasound: a proof of concept [PDF1] [Copy] [Kimi] [REL]

Authors: Juliette Dindart, Agnès Rouxel, Crystal Lin, Trung Kien Bui, Muriel Lefort, Claire Pillot-Loiseau, Christophe Trésallet, Frédérique Frouin

Ultrasound has recently been suggested as an alternative to laryngoscopy for checking vocal fold movement after neck surgery. We propose to use M-mode ultrasound (MUS) to study vocal fold vibration in left and right hemilarynges. MUS is acquired along a 1D line in each hemilarynx for about 5 seconds during vowel phonation. Post-processing estimates spatio-temporal maps of fundamental and second harmonic frequencies. To validate our method, 108 recordings (MUS and voice) from 12 healthy subjects were acquired. We compared the fundamental frequency obtained by MUS with that from voice analysis. The median fundamental frequency was estimated by MUS with high accuracy (y=0.997x+0.293, r=0.999). In our current trial, estimated frequencies are limited to 250 Hz. In future work, frequency will be increased to 1 kHz to avoid aliasing at high pitch, and MUS will be tested on patients with vocal pathologies.

Subject: INTERSPEECH.2025 - Speech Synthesis


#3 Articulatory Feature Prediction from Surface EMG during Speech Production [PDF1] [Copy] [Kimi] [REL]

Authors: Jihwan Lee, Kevin Huang, Kleanthis Avramidis, Simon Pistrosch, Monica Gonzalez-Machorro, Yoonjeong Lee, Björn W. Schuller, Louis Goldstein, Shrikanth Narayanan

We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms. To our knowledge, this is the first method to decode speech waveforms from surface EMG via articulatory features, offering a novel approach to EMG-based speech synthesis. Additionally, we analyze the relationship between EMG electrode placement and articulatory feature predictability, providing knowledge-driven insights for optimizing EMG electrode configurations. The source code and decoded speech samples are publicly available.

Subject: INTERSPEECH.2025 - Speech Synthesis


#4 Enhancing Acoustic-to-Articulatory Speech Inversion by Incorporating Nasality [PDF1] [Copy] [Kimi] [REL]

Authors: Saba Tabatabaee, Suzanne Boyce, Liran Oren, Mark Tiede, Carol Espy-Wilson

Speech is produced through the coordination of vocal tract constricting organs: lips, tongue, velum, and glottis. Previous works developed a Speech Inversion (SI) systems to recover acoustic-to-articulatory mappings for lip and tongue constrictions, called oral tract variables (TVs), which were later enhanced by including source information (periodic and aperiodic energies, and F0 frequency) as proxies for glottal control. Comparison of the nasometric measures with high-speed nasopharyngoscopy showed that nasalance can serve as ground truth, and that an SI system trained with it reliably recovers velum movement patterns for American English speakers. Here, two SI training approaches are compared: baseline models that estimate oral TVs and nasalance independently, and a synergistic model that combines oral TVs and source features with nasalance. The synergistic model shows relative improvements of 5% in oral TVs estimation and 9% in nasalance estimation compared to the baseline models.

Subject: INTERSPEECH.2025 - Speech Synthesis


#5 ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs [PDF4] [Copy] [Kimi] [REL]

Authors: Eray Eren, Qingju Liu, Hyeongwoo Kim, Pablo Garrido, Abeer Alwan

Prosody conveys rich emotional and semantic information of the speech signal as well as individual idiosyncrasies. We propose a stand-alone model that maps text-to-prosodic features such as F0 and energy and can be used in downstream tasks such as TTS. The ProMode encoder takes as input acoustic features and time-aligned textual content, both are partially masked, and obtains a fixed-length latent prosodic embedding. The decoder predicts acoustics in the masked region using both the encoded prosody input and unmasked textual content. Trained on the GigaSpeech dataset, we compare our method with state-of-the-art style encoders. For F0 and energy predictions, we show consistent improvements for our model at different levels of granularity. We also integrate these predicted prosodic features into a TTS system and conduct perceptual tests, which show higher prosody preference compared to the baselines, demonstrating the model’s potential in tasks where prosody modeling is important.

Subject: INTERSPEECH.2025 - Speech Synthesis


#6 Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models [PDF1] [Copy] [Kimi1] [REL]

Authors: Kyowoon Lee, Artyom Stitsyuk, Gunu Jho, Inchul Hwang, Jaesik Choi

Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without retraining, bridging the gap between pre-trained TTS models and editable speech synthesis.

Subject: INTERSPEECH.2025 - Speech Synthesis


#7 Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis [PDF4] [Copy] [Kimi1] [REL]

Authors: Paul Mayer, Florian Lux, Alejandro Pérez-González-de-Martos, Angelina Elizarova, Lindsey Vanderlyn, Dirk Väth, Ngoc Thang Vu

While generative methods have progressed rapidly in recent years, generating expressive prosody for an utterance remains a challenging task in text-to-speech synthesis. This is particularly true for systems that model prosody explicitly through parameters such as pitch, energy, and duration, which is commonly done for the sake of interpretability and controllability. In this work, we investigate the effectiveness of stochastic methods for this task, including Normalizing Flows, Conditional Flow Matching, and Rectified Flows. We compare these methods to a traditional deterministic baseline, as well as to real human realizations. Our extensive subjective and objective evaluations demonstrate that stochastic methods produce natural prosody on par with human speakers by capturing the variability inherent in human speech. Further, they open up additional controllability options by allowing the sampling temperature to be tuned.

Subject: INTERSPEECH.2025 - Speech Synthesis


#8 GST-BERT-TTS: Prosody Prediction Without Accentual Labels For Multi-Speaker TTS Using BERT With Global Style Tokens [PDF4] [Copy] [Kimi] [REL]

Authors: Tadashi Ogura, Takuma Okamoto, Yamato Ohtani, Erica Cooper, Tomoki Toda, Hisashi Kawai

Prosody prediction is crucial for pitch-accent languages like Japanese in text-to-speech (TTS) synthesis. Traditional methods rely on accent labels, which are often incomplete and do not generalize well. BERT-based models, such as fo-BERT, enable fundamental frequency prediction without accent labels but have been limited to single-speaker TTS. We propose GST-BERT-TTS, a novel method for multi-speaker TTS that integrates speaker-specific style embeddings from global style tokens (GST) into the token embeddings in BERT. The proposed method can realize speaker-aware fundamental frequency (fo) prediction in an accent label-free setting. Additionally, we extend fo-BERT to predict not only log fo but also energy and duration, improving speech expressiveness. Experiments using a Japanese multi-speaker TTS corpus demonstrate that GST-BERT-TTS improves the prosody prediction accuracy and synthesis quality compared with fo-BERT.

Subject: INTERSPEECH.2025 - Speech Synthesis


#9 ExagTTS: An Approach Towards Controllable Word Stress Incorporated TTS for Exaggerated Synthesized Speech Aiding Second Language Learners [PDF3] [Copy] [Kimi] [REL]

Authors: Anindita Mondal, Monica Surtani, Anil Kumar Vuppala, Parameswari Krishnamurthy, Chiranjeevi Yarra

Computer-Assisted Language Learning systems provide speech exaggerating at mispronounced word locations as a feedback to the L2 learners. Traditionally, expert speakers recordings (which limit the scalability) are considered for this task though there are advancements in text-to-speech (TTS) that can generate native like natural sounding speech. To address these, this work proposes two novel controllable strategies for scalable speech exaggeration. One strategy is direct speech exaggeration that incorporates the proposed label conditioned tokenization in GlowTTS. Another strategy is cascading the state-of-the-art TTS to a WORLD vocoder with proposed energy and duration modifications. A subset of Tatoeba corpus, that we annotated with prominent words, is used for experimentation. Automatic and manual assessment reveals that the exaggerated speech quality from both direct and cascaded strategy with duration modification is closer to the prominent words in the native speaker's speech.

Subject: INTERSPEECH.2025 - Speech Synthesis


#10 Synthetic Data Generation for Phrase Break Prediction with Large Language Model [PDF1] [Copy] [Kimi] [REL]

Authors: Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim

Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.

Subject: INTERSPEECH.2025 - Speech Synthesis


#11 Parameter-Efficient Fine-Tuning for Low-Resource Text-to-Speech via Cross-Lingual Continual Learning [PDF2] [Copy] [Kimi] [REL]

Authors: Ki-Joong Kwon, Jun-Ho So, Sang-Hoon Lee

As generative models gain attention, it is crucial to adapt these models efficiently even with limited high-quality data and computational resources. In this work, we investigate a parameter-efficient fine-tuning (PEFT) for low-resource text-to-speech to transfer pre-trained knowledge to a new language leveraging only a single-speaker dataset and a single NVIDIA TITAN RTX GPU. We propose three types of adapters: Conditioning Adapter, Prompt Adapter, and DiT LoRA Adapter, where Conditioning Adapter enhances text embeddings, Prompt Adapter refines input representations, and DiT LoRA Adapter enables speech generation efficiency. We further explore the respective optimal configuration of adapters for single-speaker and multi-speaker scenarios. Consequently, under resource constraints, we successfully achieve effective adaptation to a new language using only 1.72% of the total parameters. Audio samples, source code and checkpoints will be available.

Subject: INTERSPEECH.2025 - Speech Synthesis


#12 Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data [PDF4] [Copy] [Kimi] [REL]

Authors: Qibing Bai, Sho Inoue, Shuai Wang, Zhongjie Jiang, Yannan Wang, Haizhou Li

Accent normalization converts foreign-accented speech into native-like speech while preserving speaker identity. We propose a novel pipeline using self-supervised discrete tokens and non-parallel training data. The system extracts tokens from source speech, converts them through a dedicated model, and synthesizes the output using flow matching. Our method demonstrates superior performance over a frame-to-frame baseline in naturalness, accentedness reduction, and timbre preservation across multiple English accents. Through token-level phonetic analysis, we validate the effectiveness of our token-based approach. We also develop two duration preservation methods, suitable for applications such as dubbing.

Subject: INTERSPEECH.2025 - Speech Synthesis


#13 LIST: Language-Independent Speech Token for Multilingual Speech Synthesis with Language Models [PDF2] [Copy] [Kimi1] [REL]

Authors: Chang Liu, Zhen-Hua Ling, Yu Gu

Recently, language model (LM)-based speech synthesis models have shown remarkable naturalness and powerful zero-shot capabilities. In this paradigm, discrete speech tokens play a critical role. Prior work has proposed using automatic speech recognition (ASR) tasks to enhance the semantic information and alignment with text in tokens. However, the commonly used byte-pair encoding (BPE) tokenizer in ASR task leads to significant differences in the text token sets of different languages, making it difficult to exploit language-shared information. This paper proposes to use the International Phonetic Alphabet (IPA) as the training target for ASR to learn language-independent speech tokens. In addition, we propose to use a timbre converter for speaker disentanglement in the speech synthesis model. Our proposed approach effectively improves the speaker similarity and expressiveness in both multilingual and cross-lingual zero-shot speech synthesis.

Subject: INTERSPEECH.2025 - Speech Synthesis


#14 Developing High-Quality TTS for Punjabi and Urdu: Benchmarking against MMS Models [PDF1] [Copy] [Kimi] [REL]

Authors: Fatima Naseem, Maham Sajid, Farah Adeeba, Sahar Rauf, Asad Mustafa, Sarmad Hussain

Existing Punjabi text-to-speech (TTS) solutions focus on Gurumukhi script, requiring transliteration from Shahmukhi. This leads to letter substitutions and omissions, resulting in pronunciation errors. In this study, speech corpus, phonetic lexicon, and text analysis module for Punjabi Shahmukhi were developed. Two model architectures: Tacotron 1 and Tacotron 2 with WaveGlow were used to build TTS models. In addition to Punjabi, Urdu TTS models were also developed. These models were benchmarked against Urdu and Punjabi Gurumukhi TTS models provided by Meta’s Massively Multilingual Speech (MMS) which is a top profile multilingual speech project. Objective and subjective evaluations indicate that tacotron based Urdu and Punjabi models outperform MMS in intelligibility, naturalness, and phonetic accuracy, enhancing TTS quality for these languages.

Subject: INTERSPEECH.2025 - Speech Synthesis


#15 Synthesizing Speech with Selected Perceptual Voice Qualities – A Case Study with Creaky Voice [PDF3] [Copy] [Kimi] [REL]

Authors: Frederik Rautenberg, Fritz Seebauer, Jana Wiechmann, Michael Kuhlmann, Petra Wagner, Reinhold Haeb-Umbach

The control of perceptual voice qualities in a text-to-speech (TTS) system is of interest for applications where unmanipulated and manipulated speech probes can serve to illustrate phonetic concepts that are otherwise difficult to grasp. Here, we show that a TTS system, that is augmented with a global speaker attribute manipulation block based on normalizing flows, is capable of correctly manipulating the non-persistent, localized quality of creaky voice, thus avoiding the necessity of a, typically unreliable, frame-wise creak predictor. Subjective listening tests confirm successful creak manipulation at a slightly reduced MOS score compared to the original recording.

Subject: INTERSPEECH.2025 - Speech Synthesis


#16 Intrasentential English in Swedish TTS: perceived English-accentedness [PDF2] [Copy] [Kimi] [REL]

Authors: Christina Tånnander, David House, Jonas Beskow, Jens Edlund

English names and expressions are frequently inserted into Swedish text. Humans intuitively adjust the degree of English pronunciation of such insertions. This work aims at a Swedish text-to-speech synthesis (TTS) capable of similar controlled adaptation. We focus on two key aspects: (1) the development of a TTS system with controllable degrees of perceived English-accentedness (PEA); and (2) the exploration of human preferences related to PEA. We trained a Swedish TTS voice on Swedish and English sentences with a conditioning parameter for language (English-accentedness, EA) on a scale from 0 to 1, and estimated a psychometric mapping of the perceived effect of EA to a perceptual scale (PEA) through perception tests. PEA was then used in Best-Worst listening tests presenting English insertions with varying PEA. The results confirm the effectiveness of the training and the PEA scale, and that listener preferences change with different insertions.

Subject: INTERSPEECH.2025 - Speech Synthesis


#17 PruneSLU: Efficient On-device Spoken Language Understanding through Vocabulary and Structural Pruning [PDF1] [Copy] [Kimi] [REL]

Authors: Truong Do, Minh-Phuong Nguyen, Le -Minh Nguyen

Recent advancements in Spoken Language Understanding (SLU) have been driven by pre-trained speech processing models. However, deploying these models on resource-constrained devices remains challenging due to their large parameter sizes. This paper presents PruneSLU, a new method for compressing pre-trained SLU models while maintaining performance. Our approach combines vocabulary pruning and structural layer-wise pruning to reduce model size while preserving essential knowledge. After pruning, the model undergoes knowledge refinement using integration distillation and contrastive learning. Experiments on the STOP and SLURP datasets demonstrate that PruneSLU compresses a 39M model to 15M while retaining 98\% of its original performance, outperforming previous compression techniques.

Subject: INTERSPEECH.2025 - Speech Synthesis


#18 Leveraging LLMs for Written to Spoken Style Data Transformation to Enhance Spoken Dialog State Tracking [PDF] [Copy] [Kimi] [REL]

Authors: Haris Gulzar, Monikka Roslianna Busto, Akiko Masaki, Takeharu Eda, Ryo Masumura

Dialog State Tracking (DST) is an important part of Task-Oriented Dialog (TOD) systems, as it needs to navigate the complex human conversational flow to accomplish a task. Most TOD systems are trained on written-style text data, and their performance plunges when deployed in spoken scenarios due to natural disfluencies and human-speech recognition errors. Labeled spoken-style TOD data is limited because of the high data collection cost and privacy concerns. As Large Language Models (LLMs) emerge as a tool for synthetic text data generation, we explored their capability to generate spoken-style text-based TOD data. Through meticulously crafting LLM prompts, our generated labeled spoken style TOD data improved the absolute Joint Goal Accuracy (JGA) by 3.39% and relative JGA by 11.6%, for dedicated DST models. In this work, we showcase our divide-and-conquer-based data generation strategies and DST training to improve the performance of task-specific dialog models.

Subject: INTERSPEECH.2025 - Speech Synthesis


#19 Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs [PDF1] [Copy] [Kimi] [REL]

Authors: Šimon Sedláček, Bolaji Yusuf, Ján Švec, Pradyoth Hegde, Santosh Kesiraju, Oldřich Plchot, Jan Černocký

In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fuzzy matching-based output post-processing, which greatly improves performance of our systems on named entities in the dialogue slot values. We conduct our experiments on the SpokenWOZ dataset, and additionally utilize the Speech-Aware MultiWOZ dataset to augment our training data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17% JGA on SpokenWOZ test.

Subject: INTERSPEECH.2025 - Speech Synthesis


#20 What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems [PDF1] [Copy] [Kimi] [REL]

Authors: Kiyotada Mori, Seiya Kawano, Chaoran Liu, Carlos Toshinori Ishi, Angel García Contreras, Koichiro Yoshino

Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans.

Subject: INTERSPEECH.2025 - Speech Synthesis


#21 SpeechDialogueFactory: A Framework for Natural Speech Dialogue Generation [PDF3] [Copy] [Kimi2] [REL]

Authors: Minghan Wang, Ye Bai, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

High-quality speech conversational datasets are essential for developing and evaluating Speech-LLMs. However, collecting real-world recordings presents significant challenges including high costs, privacy concerns, and inconsistent quality, while existing synthetic approaches often lack authenticity due to limited acoustic variety and insufficient paralinguistic information. We present SpeechDialogueFactory, a framework that addresses these limitations through a three-stage pipeline: generating comprehensive metadata, creating detailed scripts, and producing utterances enriched with paralinguistic features. Our framework retrieves speaker voices from a voice bank and leverages paralinguistic tags for expressive TTS. We also introduce an automated evaluation protocol that shows strong correlation with human assessments. Experimental results demonstrate that our synthesized dialogues achieve quality comparable to human recordings while offering greater flexibility and control.

Subject: INTERSPEECH.2025 - Speech Synthesis


#22 Who, When, and What: Leveraging the ``Three Ws'' Concept for Emotion Recognition in Conversation [PDF2] [Copy] [Kimi1] [REL]

Authors: Xiaohan Shi, Xingfeng Li, Tomoki Toda

Emotion Recognition in Conversation (ERC) is essential for dialogue systems in human-computer interaction. Most existing studies primarily focus on modeling contextual information from historical interactions but often overlook the effective integration of speaker and content information. To address these challenges, we propose the "Three Ws" concept -Who, When, and What, representing speaker, context, and content information- to comprehensively capture emotional cues from historical interactions. Building on this concept, we further introduce a novel model for ERC. Additionally, we incorporate a speaker similarity loss to enhance speaker information. Experimental results show that our model outperforms baselines, with each component making significant contributions—especially context information. Additionally, the speaker similarity loss further improves ERC performance. Notably, the "Three Ws"concept demonstrates robustness across both single-modal and multimodal scenarios.

Subject: INTERSPEECH.2025 - Speech Synthesis


#23 ``Alexa, can you forget me?'' Machine Unlearning Benchmark in Spoken Language Understanding [PDF] [Copy] [Kimi] [REL]

Authors: Alkis Koudounas, Claudio Savelli, Flavio Giobergia, Elena Baralis

Machine unlearning, the process of efficiently removing specific information from machine learning models, is a growing area of interest for responsible AI. However, few studies have explored the effectiveness of unlearning methods on complex tasks, particularly speech-related ones. This paper introduces UnSLU-BENCH, the first benchmark for machine unlearning in spoken language understanding (SLU), focusing on four datasets spanning four languages. We address the unlearning of data from specific speakers as a way to evaluate the quality of potential "right to be forgotten" requests. We assess eight unlearning techniques and propose a novel metric to simultaneously better capture their efficacy, utility, and efficiency. UnSLU-BENCH sets a foundation for unlearning in SLU and reveals significant differences in the effectiveness and computational feasibility of various techniques.

Subject: INTERSPEECH.2025 - Speech Synthesis


#24 Evaluating Large Language Models in Data Generation for Low-Resource Scenarios: A Case Study on Question Answering [PDF] [Copy] [Kimi] [REL]

Authors: Ebru Arisoy, Merve Unlu Menevse, Yusufcan Manav, Arzucan Ozgur

Large Language Models (LLMs) are powerful tools for generating synthetic data, offering a promising solution to data scarcity in low-resource scenarios. This study evaluates the effectiveness of LLMs in generating question-answer pairs to enhance the performance of question answering (QA) models trained with limited annotated data. While synthetic data generation has been widely explored for text-based QA, its impact on spoken QA remains underexplored. We specifically investigate the role of LLM-generated data in improving spoken QA models, showing performance gains across both text-based and spoken QA tasks. Experimental results on subsets of the SQuAD, Spoken SQuAD, and a Turkish spoken QA dataset demonstrate significant relative F1 score improvements of 7.8%, 7.0%, and 2.7%, respectively, over models trained solely on restricted human-annotated data. Furthermore, our findings highlight the robustness of LLM-generated data in spoken QA settings, even in the presence of noise.

Subject: INTERSPEECH.2025 - Speech Synthesis


#25 I want a horror – comedy – movie: Slips-of-the-Tongue Impact Conversational Recommender System Performance [PDF] [Copy] [Kimi] [REL]

Authors: Maria Teleki, Lingfeng Shi, Chengkai Liu, James Caverlee

Disfluencies are a characteristic of speech. We focus on the impact of a specific class of disfluency - whole-word speech substitution errors (WSSE) - on LLM-based conversational recommender system performance. We develop Syn-WSSE, a psycholinguistically-grounded framework for synthetically creating genre-based WSSE at varying ratios to study their impact on conversational recommender system performance. We find that LLMs are impacted differently: llama and mixtral have improved performance in the presence of these errors, while gemini, gpt-4o, and gpt-4o-mini have deteriorated performance. We hypothesize that this difference in model resiliency is due to differences in the pre- and post-training methods and data, and that the increased performance is due to the introduced genre diversity. Our findings indicate the importance of a careful choice of LLM for these systems, and more broadly, that disfluencies must be carefully designed for as they can have unforeseen impacts.

Subject: INTERSPEECH.2025 - Speech Synthesis