INTERSPEECH.2025 | Cool Papers - Immersive Paper Discovery

#1 From Talking and Listening Devices to Intelligent Communicative Machines [PDF¹⁰] [Copy] [Kimi¹¹] [REL]

Abstract Having been 'in the business' of speech technology for over 50 years, I've had the pleasure of witnessing (and being involved first-hand) in many of the astounding developments that have led to the incredible solutions we have today. Indeed, my involvement in the field of spoken language has been somewhat of a love affair, and it's been a huge honour and privilege to have been working with so many excellent researchers on "the most sophisticated behaviour of the most complex organism in the known universe"! Although I've always been heavily committed to the establishment of machine learning approaches to spoken language processing - including publishing one of the first papers on the application of artificial neural networks to automatic speech recognition - my approach has always been one of attempting to uncover the underlying mechanisms of 'intelligent' (speech-based) interaction, on the basis that living systems are remarkably data-efficient in their learning. This talk will both look back (rather a long way) and look forward, asking the question how did we get here and where are we going? I hope that some of my insights may inspire others to follow a similar path. Biography Prof. Moore has over 50 years’ experience in Speech Technology R&D and, although an engineer by training, much of his research has been based on insights from human speech perception and production. He studied Computer & Communications Engineering at the University of Essex and was awarded the B.A. (Hons.) degree in 1973. He subsequently received the M.Sc.(Res.) and Ph.D. degrees from the same university in 1975 and 1977 respectively, both theses being on the topic of automatic speech recognition. After a period of post-doctoral research in the Phonetics Department at University College London, Prof. Moore was recruited in 1980 to establish a speech recognition research team at the Royal Signals and Radar Establishment (RSRE) in Malvern. As Head of the UK Government's Speech Research Unit from 1985 to 1999, he was responsible for the development of the Aurix range of speech technology products and the subsequent formation of 20/20 Speech Ltd. Since 2004 he has been Professor of Spoken Language Processing at the University of Sheffield, and also holds Visiting Chairs at Bristol Robotics Laboratory and University College London Psychology & Language Sciences. Since joining Sheffield, his research has focused on understanding the fundamental principles of speech-based interaction, and in 2017 he initiated the first in the series of international workshops on ‘Vocal Interactivity in-and-between Humans, Animals and Robots' (VIHAR). As President of both the European Speech Communication Association (ESCA) and Permanent Council of the International Conference on Spoken Language Processing (PC-ICSLP) from 1997, Prof. Moore pioneered their integration to form the International Speech Communication Association (ISCA). He was subsequently General Chair for INTERSPEECH-2009 and ISCA Distinguished Lecturer during 2014-15. He has received several awards, including the UK Institute of Acoustics Tyndall Medal for “distinguished work in the field of speech research and technology“, the NATO RTO Scientific Achievement Award for “repeated contribution in scientific and technological cooperation”, the LREC Antonio Zampoli Prize for "Outstanding Contributions to the Advancement of Language Resources & Language Technology Evaluation within Human Language Technologies", and the ISCA Special Service Medal for "Service in the establishment, leadership and international growth of ISCA". Prof. Moore is the current Editor-in-Chief of Computer Speech & Language, and Associate Editor for Speech Communication, Languages, the Journal of Future Robot Life, and Frontiers in Robotics and AI (Computational Intelligence in Robotics).

Subject: INTERSPEECH.2025 - Keynote

#2 From Speech Science to Language Transparence [PDF²] [Copy] [Kimi³] [REL]

Author: Alexander Waibel

Abstract Breaking down language barriers has been a dream of centuries. Seemingly unsolvable, we are now lucky to live in the one generation that makes global communication a common reality. Such global transformation was not thought to be possible, and has only become possible through revolutionary advances in AI, language and speech processing. Indeed, the challenges of processing spoken language have required, caused, guided and motivated the most impactful advances in AI. During a time of knowledge-based speech and language processing, I became convinced that only data-driven machine learning can reasonably be expected to handle the complexities, the uncertainty, and variability of communication, and that only latent learned representations would be able to abstract and fuse new and complementary knowledge. It turned out to work beyond our wildest expectations. Starting with small shift-invariant time-delay neural networks (TDNN’s) for phonemes, we would eventually scale neural systems to massive speech, language and interpretating systems. From small vocabulary recognition, we could advance to simultaneous interpretation, summarization, interactive dialog, multimodal systems and now automatic lip-synchronous dubbing. Despite the data-driven machine learning, however, speech science was necessary to inspire the models, and observing human communication continues to motivate our ongoing work in AI. In the first part of my talk, I will revisit some of our earliest prototypes, demonstrators, and their transition into start-up companies and products in the real world. I will highlight the research advances that took us there from poorly performing early attempts to human parity on popular performance benchmarks and the lessons learned. In the second part I will discuss current research and a roadmap for the future: the dream of a language-barrier free world between all the peoples on the planet has not yet been reached. What is the missing science and how can we approach the remaining challenges? What do we learn from human speech interaction, and what would future machine learning models have to look like to better emulate and engage in human interaction? What are the opportunities, and lessons learned for students, scientists, and entrepreneurs? The talk will include demos and examples of SOTA speech translation and dubbing systems. Biography Alexander Waibel is Professor of Computer Science at Carnegie Mellon University (USA) and at the Karlsruhe Institute of Technology (Germany). He is director of the International Center for Advanced Communication Technologies. Waibel is known for his work on AI, Machine Learning, Multimodal Interfaces and Speech Translation Systems. He proposed early Neural Network based Speech and Language systems, including in 1987 the TDNN, the first shift-invariant (“Convolutional”) Neural Network, and early Neural Speech and Language systems. Based on advances in ML, he and his team developed early (’93-’98) multimodal interfaces including the first emotion recognizer, face tracker, lipreader, error repair system, a meeting browser, support for smart rooms and human-robot collaboration. Waibel pioneered many cross-lingual communication systems that now overcome language barriers via speech and image interpretation: first consecutive (1992) and simultaneous (2005) speech translation systems, road sign translator, heads-up display translation goggles, face/lip and EMG translators. Waibel founded & co-founded more than 10 companies and various non-profit services to transition results from academic work to practical deployment. This included “Jibbigo LLC” (2009), the first speech translator on a phone (acquired by Facebook 2013), “M*Modal” medical transcription and reporting (acquired by Medquist and 3M), “Kites” interpreting services for subtitling and video conferencing (acquired by Zoom in 2021), “Lecture Translator”, the first automatic simultaneous translation service (2012) at Universities and European Parliament, and STS services for medical missions/disaster relief. Waibel published ~1,000 articles, books, and patents. He is a member of the National Academy of Sciences of Germany, Life-Fellow of IEEE, Fellow of ISCA, Fellow of the Explorers Club, and Research Fellow at Zoom. Waibel received many awards, including the IEEE Flanagan award, the ICMI sustained achievement award, the Meta prize, the A. Zampolli award, and the Alcatel-SEL award. He received his BS from MIT, and MS and PhD degrees from CMU.

Subject: INTERSPEECH.2025 - Keynote

#3 Speech Kinematic Analysis from Acoustics: Scientific, Clinical and Practical Applications [PDF] [Copy] [Kimi] [REL]

Author: Carol Espy-Wilson

Abstract Much of my research has involved studying how small changes in the spatiotemporal coordination of speech articulators affect variability in the acoustic characteristics of the speech signal. This interest in speech variability ultimately led me to develop a speech inversion (SI) system that recovers articulatory movements of the lips, tongue tip, and tongue body from the speech signal. Recently, we were able to extend the SI system to provide information about the velopharyngeal port opening (nasality) and will soon investigate a methodology to uncover information about the tongue root and the size of the glottal opening. Our SI system has proven to be speaker independent and generalizes well across acoustic databases. In this talk, I will explain how we developed the SI system, and ways in which we have used it to date: for clinical purposes in mental health and speech disorder assessment, in scientific analysis of cross-linguistic speech patterns, and for improving automatic speech recognition. Biography Carol Espy-Wilson is a full professor in the Electrical and Computer Engineering Department and the Institute for Systems Research at the University of Maryland College Park. She received her BS in electrical engineering from Stanford University and her MS, EE and PhD degrees in electrical engineering from the Massachusetts Institute of Technology. Dr. Espy-Wilson is a Fellow of the Acoustical Society of America (ASA), the International Speech Communication Association (ISCA) and the IEEE. She was recently elected VP-elect of ASA, and to the ISCA Advisory Board. She is currently serving on the Editorial Board of Computer, Speech and Language. She has been Chair of the Speech Communication Technical Committee of ASA, elected member of the Speech and Language Technical Committee of IEEE and Associate Editor of the Journal of the Acoustical Society of America. Finally, at the National Institutes of Health, she has served on the Advisory Councils for the National Institute on Deafness and Communication Disorders and the National Institutes of Biomedical Imaging and Bioengineering, on the Medical Rehabilitation Advisory Board of the National Institute of Child Health and Human Development, and she has been a member of the Language and Communication Study Section. Carol directs the Speech Communication Lab where they combine digital signal processing, speech science, linguistics and machine learning to conduct research in speech communication. Current research projects include speech inversion, mental health assessment based on speech, video and text, speech recognition for elementary school classrooms, entrainment based on articulatory and facial gestures in unstructured conversations between neurotypical and neurodiverse participants, and speech enhancement. Her laboratory has received federal funding (NSF, NIH and DoD) and industry grants and she has 13 patents.

Subject: INTERSPEECH.2025 - Keynote

#4 Using and comprehending language in face-to-face conversation [PDF¹] [Copy] [Kimi¹] [REL]

Author: Judith Holler

Abstract Face-to-face conversational interaction is at the very heart of human sociality and the natural ecological niche in which language has evolved and is acquired. Yet, we still know rather little about how utterances are produced and comprehended in this environment. In this talk, I will focus on how hand gestures, facial and head movements are organised to convey semantic and pragmatic meaning in conversation, as well as on how the presence and timing of these signals impacts utterance comprehension and responding. Specifically, I will present studies based on complementary approaches, which feed into and inform one another. This includes qualitative and quantitative multimodal corpus studies showing that visual signals indeed often occur early, and experimental comprehension studies, which are based on and inspired by the corpus results, implementing controlled manipulations to test for causal effects between visual bodily signals and comprehension processes and mechanisms. These experiments include behavioural and EEG studies, most of them using multimodally animated virtual characters. Together, the findings provide evidence for the hypothesis that visual bodily signals form an integral part of semantic and pragmatic meaning communication in conversational interaction, and that they facilitate language processing, especially due to their timing and the predictive potential they gain through their temporal orchestration. Biography Judith Holler is Associate Professor at the Donders Institute for Brain, Cognition & Behaviour, Radboud University where she leads the research group Communication in Social Interaction, and senior investigator at the Max Planck Institute for Psycholinguistics. Her research program investigates human language in the very environment in which it has evolved, is acquired, and used most: face-to-face interaction. Within this context, Judith focuses on the semantics and pragmatics of human communication from a multimodal perspective considering spoken language within the rich, visual infrastructure that embeds it, such as manual gestures, head movements, facial signals, and gaze. She uses a combination of methods from different fields to investigate human multimodal communication, including quantitative conversational corpus analyses, in-situ eyetracking, behavioural and neurocognitive experimentation using multimodal language stimuli involving virtual animations. Her research has been supported by a range of prestigious research grants from funders including the European Research Council (EU), The Dutch Science Foundation (NWO), Marie Curie Fellowships (EU), Economic & Social Research Council (UK), Parkinson UK, The Leverhulme Trust (UK), the British Academy (UK), Volkswagen Stiftung (Germany) and the German Science Foundation (DFG, Mercator Fellowships).

Subject: INTERSPEECH.2025 - Keynote

#5 Speech transcription from South Tyrolean Dialect to Standard German with Whisper [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Luca Ducceschi, Greta H. Franzini

This study presents the first fine-tuned Whisper model for the automatic translation of South Tyrolean dialectal speech into Standard German text. To address an unmet need for subtitling and translation, we introduce a small corpus of manually annotated and synthetic speech data compiled for this task. Through fine-tuning and hyperparameter optimisation, our model achieves a BLEU score of 86.18 significantly outperforming baseline error rates. Our findings highlight Whisper's effectiveness in handling dialectal speech, contributing to low-resource language research. The model is already being used in a heritage collaboration for large-scale translation of audiovisual archival material and is also being considered for application in news broadcasting and tourism promotion. Future directions include expanding the training data and extending hyperparameter optimisation to improve the model's performance and generalisation across South Tyrolean dialectal variations.

Subject: INTERSPEECH.2025 - Language and Multimodal

#6 Length Aware Speech Translation for Video Dubbing [PDF] [Copy] [Kimi] [REL]

Authors: Aswin Shanmugam Subramanian, Harveen Chadha, Vikas Joshi, Shubham Bansal, Jian Xue, Rupeshkumar Mehta, Jinyu Li

In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths - short, normal, and long - using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively.

Subject: INTERSPEECH.2025 - Language and Multimodal

#7 ArticulateX: End-to-End Monolingual Speech Translation in Articulator Space [PDF] [Copy] [Kimi] [REL]

Authors: Vishal Kumar, Vinayak Abrol

We present ArticulateX, the first non-autoregressive direct speech-to-speech translation (S2ST) model that operates through an articulatory latent space, offering an efficient alternative to existing cascaded models. It consists of a direct speech-to-articulator encoder, a latent articulator-to-MelSpectrogram mapper, and a vocoder for high-fidelity speech synthesis. By leveraging articulatory representations, which are inherently language-agnostic, our model effectively captures speech dynamics, preserving speaker identity, prosody and expressiveness across languages. Unlike prior autoregressive models, ArticulateX eliminates the need for intermediate text, discrete units and/or complex self-supervised objectives, enabling faster inference, stable training, and improved translation quality. We demonstrate the efficacy of the proposed model in fr-en and de-en speech-to-speech translation on the CVSS dataset, achieving BLEU scores better or comparable to existing models.

Subject: INTERSPEECH.2025 - Language and Multimodal

#8 CMSP-ST: Cross-modal Mixup with Speech Purification for End-to-End Speech Translation [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Jiale Ou, Hongying Zan

End-to-end speech translation (E2E ST) aims to directly convert speech in a source language into text in a target language, and its performance is constrained by the inherent modality gap. Existing methods attempt to align speech and text representations to perform cross-modal mixup at the token level, which overlooks the impact of redundant speech information. In this paper, we propose cross-modal mixup with speech purification for speech translation (CMSP-ST) to address this issue. Specifically, we remove the non-content features from speech through orthogonal projection and extract the purified speech features for cross-modal mixup. Additionally, we employ adversarial training under the Soft Alignment (S-Align) to relax the alignment granularity and improve robustness. Experimental results on the MuST-C En-De, CoVoST-2 Fr-En, and CoVoST-2 De-En benchmarks demonstrate that CMSP-ST effectively improves the speech translation performance of existing cross-modal mixup methods.

Subject: INTERSPEECH.2025 - Language and Multimodal

#9 End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model [PDF] [Copy] [Kimi] [REL]

Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

We present an end-to-end speech translation (ST) model that uses a large language model (LLM) to guide the translation process. Recent advances in LLMs have shown strong contextual understanding and robustness to noisy text, making them beneficial for mitigating automatic speech recognition (ASR) errors. Building on these strengths, we develop an LLM-driven ST model within an encoder-decoder framework, with the encoder handling an auxiliary ASR task and the decoder incorporating an LLM at its front end. Here, the encoder generates an ASR hypothesis that cues the LLM to perform machine translation. The LLM output is then fed into the decoder to yield the final translation. This two-pass design capitalizes on the LLM's robust and accurate translation capabilities, while enabling end-to-end optimization tailored to specific ST tasks. Experimental results on various ST tasks reveal significant performance gains with our LLM integration, and extensive analyses further validate our approach.

Subject: INTERSPEECH.2025 - Language and Multimodal

#10 Empowering Large Language Models for End-to-End Speech Translation Leveraging Synthetic Data [PDF²] [Copy] [Kimi²] [REL]

Authors: Yu Pu, Xiaoqian Liu, Guangyu Zhang, Zheng Yan, Wei-Qiang Zhang, Xie Chen

Speech-to-speech translation (S2ST) is a key technology for seamless cross-lingual communication. Traditional cascaded systems, which involve speech recognition, text translation, and speech synthesis, are prone to error propagation and latency. In this work, we present SLAM-TR, an end-to-end speech translation model which directly map input speech to output speech, eliminating the need for intermediate text representations. By fine-tuning from the large language model Qwen2-0.5B, SLAM-TR achieves superior performance over the cascaded baseline and state-of-the-art open-source models with minimal training time. Additionally, SLAM-TR demonstrates strong generalization, achieving an ASR-BLEU score of 8.20 on the FLEURS benchmark, outperforming both cascaded and open-source systems. In addition, addressing the challenge of limited natural speech translation data, we propose SynStard-1000, a 1,000-hour synthetic speech translation dataset.

Subject: INTERSPEECH.2025 - Language and Multimodal

#11 Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios [PDF] [Copy] [Kimi] [REL]

Authors: Gerard I. Gállego, Oriol Pareras, Martí Cortada Garcia, Lucas Takanori, Javier Hernando

We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.

Subject: INTERSPEECH.2025 - Language and Multimodal

#12 Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe

Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs), which are fine-tuned on discrete speech units. In such approaches, modality adaptation from text to speech has been an issue. LLMs are trained on text-only data, which presents challenges to adapt them to speech modality with limited speech-to-speech data. To address the training difficulty, we propose scheduled interleaved speech-text training in this study. We use interleaved speech-text units instead of speech units during training, where aligned text tokens are interleaved at the word level. We gradually decrease the ratio of text as training progresses, to facilitate progressive modality adaptation from text to speech. We conduct experimental evaluations by fine-tuning LLaMA3.2-1B for S2ST on the CVSS dataset. We show that the proposed method consistently improves the translation performances, especially for languages with limited training data.

Subject: INTERSPEECH.2025 - Language and Multimodal

#13 End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data [PDF] [Copy] [Kimi¹] [REL]

Authors: Aishwarya Pothula, Bhavana Akkiraju, Srihari Bandarupalli, Charan D, Santosh Kesiraju, Anil Kumar Vuppala

The scarcity of high-quality annotated data presents a significant challenge in developing effective end-to-end speech-to-text translation (ST) systems, particularly for low-resource languages. This paper explores the hypothesis that weakly labeled data can be used to build ST models for low-resource language pairs. We constructed speech-to-text translation datasets with the help of bitext mining using state-of-the-art sentence encoders. We mined the multilingual Shrutilipi corpus to build Shrutilipi-anuvaad, a dataset comprising ST data for language pairs Bengali-Hindi, Malayalam-Hindi, Odia-Hindi, and Telugu-Hindi. We created multiple versions of training data with varying degrees of quality and quantity to investigate the effect of quality versus quantity of weakly labeled data on ST model performance. Results demonstrate that ST systems can be built using weakly labeled data, with performance comparable to massive multi-modal multilingual baselines such as SONAR and SeamlessM4T.

Subject: INTERSPEECH.2025 - Language and Multimodal

#14 Self-Improvement for Audio Large Language Model using Unlabeled Speech [PDF] [Copy] [Kimi] [REL]

Authors: Shaowen Wang, Xinyuan Chen, Yao Xu

Recent audio LLMs have emerged rapidly, demonstrating strong generalization across various speech tasks. However, given the inherent complexity of speech signals, these models inevitably suffer from performance degradation in specific target domains. To address this, we focus on enhancing audio LLMs in target domains without any labeled data. We propose a self-improvement method called SI-SDA, leveraging the information embedded in large-model decoding to evaluate the quality of generated pseudo labels and then perform domain adaptation based on reinforcement learning optimization. Experimental results show that our method consistently and significantly improves audio LLM performance, outperforming existing baselines in WER and BLEU across multiple public datasets of automatic speech recognition (ASR), spoken question-answering (SQA), and speech-to-text translation (S2TT). Furthermore, our approach exhibits high data efficiency, underscoring its potential for real-world deployment.

Subject: INTERSPEECH.2025 - Language and Multimodal

#15 Evaluation of Three Automatic Alignment Tools for the Processing of Non-native French [PDF] [Copy] [Kimi] [REL]

Authors: Qian Zhou, Mathilde Hutin

The production of non-native speech is known to display "cross-language phonetic interference", which makes such speech uneasy to align and label automatically. Automatic phonetic alignment refers to an automated process whereby software synchronizes speech with its transcription, usually at the phone and word levels. This method has proven useful and reliable for native speech, yet this reliability usually does not extend to non-native speech. This paper proposes to test three major automatic aligners (WebMAUS, MFA and SPPAS) on non-native French uttered by two native speakers of Chinese by comparing them with two manual segmentations. This paper's goal is to offer non-computer linguists a preliminary investigation on which to rely when choosing a tool for their studies in non-native phonetics or language didactics. Results show that the best performing tool for labeling is SPPAS while the best performing tool for both word- and phone-segmentation overall is WebMAUS and MFA the worst.

Subject: INTERSPEECH.2025 - Language and Multimodal

#16 CrossPhon: An Auto Phone Mapping Tool to Streamline Cross-language Modeling for Phone Alignment of Low-resource Languages [PDF] [Copy] [Kimi] [REL]

Authors: Hongchen Wu, Yixin Gu

Phone alignment matches spoken sounds with text, streamlining speech dataset creation and analysis. However, most trained aligners focus on Indo-European languages, leaving under-resourced languages unsupported. Developing new aligners for these languages requires expertise and large datasets, which are often scarce. Cross-language phone alignment offers a solution using aligners trained in one language to align speech in another, but it traditionally relies on expert-crafted phone mappings. Our tool, CrossPhon, automates this process, making cross-language phone alignment more efficient. In tests on 14 languages from 7 families, CrossPhon achieved agreement rates of 78.95% to 97.77% compared to human expert mappings and delivered competitive performance in cross-language phone alignment. CrossPhon provides an efficient, reliable solution for generating cross-language phone alignment in under-resourced languages, helping bridge the digital divide and efficiently study these languages.

Subject: INTERSPEECH.2025 - Language and Multimodal

#17 Multi-lingual and Zero-Shot Speech Recognition by Incorporating Classification of Language-Independent Articulatory Features [PDF] [Copy] [Kimi¹] [REL]

Authors: Ryo Magoshi, Shinsuke Sakai, Jaeyoung Lee, Tatsuya Kawahara

We address multi-lingual speech recognition including unknown or zero-shot languages based on the International Phonetic Alphabet (IPA) and articulatory features. Articulatory features are language-independent representations for IPA based on phonetic knowledge. In the previous studies, however, they were mostly limited to two dimensions of place of articulation and manner of articulation. Moreover, the classification of articulatory features were not well aligned with phone recognition. In this study, we adopt a comprehensive 24-dimensional vector representation, and propose a training method in which IPA tokens and their corresponding articulatory features are simultaneously predicted based on CTC alignment. Experiments are conducted by fine-tuning the wav2vec 2.0 XLS-R model over 22 languages, and the results demonstrated significant improvements on average as well as in zero-shot language settings.

Subject: INTERSPEECH.2025 - Language and Multimodal

#18 Instantaneous changes in acoustic signals reflect syllable progression and cross-linguistic syllable variation [PDF] [Copy] [Kimi] [REL]

Authors: Haley Hsu, Dani Byrd, Khalil Iskarous, Louis Goldstein

While abstract speech representations often exploit sequenced syllable units, how exactly syllables as abstract cognitive compositional structure relate to observable patterns in the articulatory and acoustic signals remains opaque. Previous work suggests oscillatory acoustic properties link such linguistic representations to physical events. We probe this relationship by testing temporal coordination between changes in spectral energy and amplitude with syllable boundary locations through phase-locking analyses. Results for syllabic nuclei demonstrate these phase-locking values (PLVs) track syllable progression in both English and Tashlhiyt. Further, cross-language preferences for different syllable nucleus types are found to be reflected in their respective PLVs. Overall, the findings demonstrate a tight coordination between abstract syllable units and quantifiable signal properties and additionally provide novel dynamical grounding for cross-linguistic syllable nucleus preferences.

Subject: INTERSPEECH.2025 - Language and Multimodal

#19 Influence of Proficiency and L2 Experience on Dynamic Spectral Cue Utilization in L2 Vowel Perception and Production [PDF] [Copy] [Kimi] [REL]

Authors: Linda Bakkouche, Brechtje Post

The acquisition of English vowels as an L2 is complex, yet most studies focus on static measures, with little attention to dynamic spectral cues like Vowel-Inherent Spectral Change (VISC). It remains unclear how language experience and length of residence (LOR) in immersion-rich environments affect perception-production alignment. This study examines Polish learners’ perception and production of /e-æ/ (DRESS-TRAP) and /i-I/ (FLEECE-KIT). These contrasts are challenging due to phonetic similarity and category overlap as predicted by L2 models. Advanced learners showed greater perceptual accuracy and more consistent production, especially for /i-I/, while /e-æ/ remained difficult. With higher proficiency, learners exhibited greater formant movement (20-40% of vowel duration), but LOR and language experience were not significant predictors. These findings provide insight into phonetic similarity in theoretical models of L2 vowel acquisition.

Subject: INTERSPEECH.2025 - Language and Multimodal

#20 A Bayesian Approach to L2 Fluency Ratings by Native and Nonnative Listeners [PDF] [Copy] [Kimi] [REL]

Authors: Kakeru Yazawa, Takayuki Konishi

This study investigates how native and nonnative listeners evaluate the fluency of Japanese speakers' English using a Bayesian modeling framework. Data were obtained from 16 listeners with diverse linguistic backgrounds (Cantonese, English, French, German, Japanese, Korean, Mandarin, Polish, Punjabi, and Spanish), who rated English read speech samples from 180 Japanese speakers, in the J-AESOP corpus. Utterance fluency measures included speed (syllable- or segment-based articulation rate), breakdown (pause frequency and duration), and re pair (repetitions). Results revealed that nonnative listeners, particularly those with Asian language backgrounds, were generally more lenient and less reliant on speech rate than native listeners, highlighting inter-listener variability previously overlooked. Model comparisons also revealed that segment-based articulation rate better captures utterance speed fluency than the commonly adopted syllable-based articulation rate.

Subject: INTERSPEECH.2025 - Language and Multimodal

#21 Are loan sequences different from foreign sequences? A perception study with Japanese listeners on coronal obstruent – high front vowel sequences [PDF] [Copy] [Kimi] [REL]

Authors: Silke Hamann, Andrea Alićehajić

Native phonotactics influences speech perception, as numerous studies have shown. The present study tackles the question whether there is a difference in perceptual performance if the involved sequence occurs only in loanwords, compared to a sequence that does not occur at all in the native language. This was tested with the native Japanese sequences of palatal affricate plus /i/, compared to /ti/ (accepted only in loanwords) versus /zi/ (not accepted in Japanese) in an online AX discrimination task with 39 Japanese speakers (21-63 years old), who also had to answer three questions on their received English input. Participants performed significantly better at discriminating the accepted loan sequence /ti/, though discrimination of the foreign sequence /zi/ was also quite high (ranging from 40-100% correct). The results indicate that discriminability is only partly guided by native phonotactics. A potential role of amount of English input measured by self-report could not be attested.

Subject: INTERSPEECH.2025 - Language and Multimodal

#22 Relative cue weighting in multilingual stop voicing production [PDF] [Copy] [Kimi] [REL]

Authors: Le Xuan Chan, Annika Heuser

How does a multilingual speaker produce similar phonological contrasts across the different languages that they speak? Some theories predict crosslinguistic influence while others predict that multilinguals keep separate sound inventories for each language. In this paper, we present crosslinguistic data from early multilingual speakers in Malaysia. We investigate the interaction of a true voicing language (Malay), a variable voicing language (English), and an aspiration language (Mandarin). Using a random forest classification of nine acoustic correlates of stop voicing, we show that 1) all early multilinguals show language-specific productions of stop voicing, and 2) variation driven by dominance can still be observed despite this language-specificity. In addition, we present evidence that closure voicing is a salient correlate alongside aspiration in Malaysian English, and that English is more reliant on secondary correlates than Malay and Mandarin.

Subject: INTERSPEECH.2025 - Language and Multimodal

#23 Variability in Intervocalic /t/ and Community Diversity in Australian English [PDF] [Copy] [Kimi] [REL]

Authors: Hannah White, Joshua Penney, Felicity Cox

The voiceless alveolar stop /t/ exhibits considerable variation in English. Realisations of /t/ vary depending phonetic context and social factors such as gender, age and socioeconomic status. Generally, studies on Australian English have focused on the “mainstream” variety, without acknowledging the wide range of linguistic diversity speakers are exposed to in contemporary multicultural Australian society. In the present paper, we explore intervocalic /t/ variation in data collected from 183 speakers as part of the Multicultural Australian English – Voices of Sydney corpus. Results show that, in certain phonetic contexts, exposure to community linguistic diversity can affect intervocalic /t/ realisation, with speakers from more diverse areas showing a preference for a single variant (the tap) compared to those from less diverse areas. We interpret this as an example of simplification that can occur in diverse communities where there is extreme variability in ambient language exposure.

Subject: INTERSPEECH.2025 - Language and Multimodal

#24 Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model [PDF] [Copy] [Kimi] [REL]

Authors: Yong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu

Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions from Silent Videos (SVAD) to address this challenge and investigate vision-language models' (VLMs) capabilities on this task. To further enhance the VLMs' reasoning capacity for the SVAD task, we construct a CoT-AudioCaps dataset and propose a Chain-of-Thought-based supervised fine-tuning strategy. Experiments on SVAD and subsequent VT2A tasks demonstrate our method's effectiveness in two key aspects: significantly improving VLMs' modal-mismatch reasoning for SVAD and effectively addressing the challenge of acquiring audio descriptions during VT2A inference.

Subject: INTERSPEECH.2025 - Language and Multimodal

#25 Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning [PDF] [Copy] [Kimi] [REL]

Authors: Le Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu

Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysis in cross-attention layers to automatically identify and suppress misleading visual cues during modal fusion. Complementing this architecture, we develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs, greatly enhancing model resilience against alignment noise. Evaluations on the AudioCaps benchmark demonstrate our system's superior performance over existing baselines, especially in mismatched modality scenarios. Furthermore, our solution demonstrates an approximately 6x improvement in inference speed compared to the baseline.

Subject: INTERSPEECH.2025 - Language and Multimodal