INTERSPEECH.2025 - Language and Multimodal

| Total: 108

#1 Speech transcription from South Tyrolean Dialect to Standard German with Whisper [PDF1] [Copy] [Kimi1] [REL]

Authors: Luca Ducceschi, Greta H. Franzini

This study presents the first fine-tuned Whisper model for the automatic translation of South Tyrolean dialectal speech into Standard German text. To address an unmet need for subtitling and translation, we introduce a small corpus of manually annotated and synthetic speech data compiled for this task. Through fine-tuning and hyperparameter optimisation, our model achieves a BLEU score of 86.18 significantly outperforming baseline error rates. Our findings highlight Whisper's effectiveness in handling dialectal speech, contributing to low-resource language research. The model is already being used in a heritage collaboration for large-scale translation of audiovisual archival material and is also being considered for application in news broadcasting and tourism promotion. Future directions include expanding the training data and extending hyperparameter optimisation to improve the model's performance and generalisation across South Tyrolean dialectal variations.

Subject: INTERSPEECH.2025 - Language and Multimodal


#2 Length Aware Speech Translation for Video Dubbing [PDF] [Copy] [Kimi] [REL]

Authors: Aswin Shanmugam Subramanian, Harveen Chadha, Vikas Joshi, Shubham Bansal, Jian Xue, Rupeshkumar Mehta, Jinyu Li

In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths - short, normal, and long - using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively.

Subject: INTERSPEECH.2025 - Language and Multimodal


#3 ArticulateX: End-to-End Monolingual Speech Translation in Articulator Space [PDF] [Copy] [Kimi] [REL]

Authors: Vishal Kumar, Vinayak Abrol

We present ArticulateX, the first non-autoregressive direct speech-to-speech translation (S2ST) model that operates through an articulatory latent space, offering an efficient alternative to existing cascaded models. It consists of a direct speech-to-articulator encoder, a latent articulator-to-MelSpectrogram mapper, and a vocoder for high-fidelity speech synthesis. By leveraging articulatory representations, which are inherently language-agnostic, our model effectively captures speech dynamics, preserving speaker identity, prosody and expressiveness across languages. Unlike prior autoregressive models, ArticulateX eliminates the need for intermediate text, discrete units and/or complex self-supervised objectives, enabling faster inference, stable training, and improved translation quality. We demonstrate the efficacy of the proposed model in fr-en and de-en speech-to-speech translation on the CVSS dataset, achieving BLEU scores better or comparable to existing models.

Subject: INTERSPEECH.2025 - Language and Multimodal


#4 CMSP-ST: Cross-modal Mixup with Speech Purification for End-to-End Speech Translation [PDF1] [Copy] [Kimi1] [REL]

Authors: Jiale Ou, Hongying Zan

End-to-end speech translation (E2E ST) aims to directly convert speech in a source language into text in a target language, and its performance is constrained by the inherent modality gap. Existing methods attempt to align speech and text representations to perform cross-modal mixup at the token level, which overlooks the impact of redundant speech information. In this paper, we propose cross-modal mixup with speech purification for speech translation (CMSP-ST) to address this issue. Specifically, we remove the non-content features from speech through orthogonal projection and extract the purified speech features for cross-modal mixup. Additionally, we employ adversarial training under the Soft Alignment (S-Align) to relax the alignment granularity and improve robustness. Experimental results on the MuST-C En-De, CoVoST-2 Fr-En, and CoVoST-2 De-En benchmarks demonstrate that CMSP-ST effectively improves the speech translation performance of existing cross-modal mixup methods.

Subject: INTERSPEECH.2025 - Language and Multimodal


#5 End-to-End Speech Translation Guided by Robust Translation Capability of Large Language Model [PDF] [Copy] [Kimi] [REL]

Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

We present an end-to-end speech translation (ST) model that uses a large language model (LLM) to guide the translation process. Recent advances in LLMs have shown strong contextual understanding and robustness to noisy text, making them beneficial for mitigating automatic speech recognition (ASR) errors. Building on these strengths, we develop an LLM-driven ST model within an encoder-decoder framework, with the encoder handling an auxiliary ASR task and the decoder incorporating an LLM at its front end. Here, the encoder generates an ASR hypothesis that cues the LLM to perform machine translation. The LLM output is then fed into the decoder to yield the final translation. This two-pass design capitalizes on the LLM's robust and accurate translation capabilities, while enabling end-to-end optimization tailored to specific ST tasks. Experimental results on various ST tasks reveal significant performance gains with our LLM integration, and extensive analyses further validate our approach.

Subject: INTERSPEECH.2025 - Language and Multimodal


#6 Empowering Large Language Models for End-to-End Speech Translation Leveraging Synthetic Data [PDF2] [Copy] [Kimi2] [REL]

Authors: Yu Pu, Xiaoqian Liu, Guangyu Zhang, Zheng Yan, Wei-Qiang Zhang, Xie Chen

Speech-to-speech translation (S2ST) is a key technology for seamless cross-lingual communication. Traditional cascaded systems, which involve speech recognition, text translation, and speech synthesis, are prone to error propagation and latency. In this work, we present SLAM-TR, an end-to-end speech translation model which directly map input speech to output speech, eliminating the need for intermediate text representations. By fine-tuning from the large language model Qwen2-0.5B, SLAM-TR achieves superior performance over the cascaded baseline and state-of-the-art open-source models with minimal training time. Additionally, SLAM-TR demonstrates strong generalization, achieving an ASR-BLEU score of 8.20 on the FLEURS benchmark, outperforming both cascaded and open-source systems. In addition, addressing the challenge of limited natural speech translation data, we propose SynStard-1000, a 1,000-hour synthetic speech translation dataset.

Subject: INTERSPEECH.2025 - Language and Multimodal


#7 Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios [PDF] [Copy] [Kimi] [REL]

Authors: Gerard I. Gállego, Oriol Pareras, Martí Cortada Garcia, Lucas Takanori, Javier Hernando

We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.

Subject: INTERSPEECH.2025 - Language and Multimodal


#8 Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs [PDF1] [Copy] [Kimi1] [REL]

Authors: Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Arora, Shinji Watanabe

Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs), which are fine-tuned on discrete speech units. In such approaches, modality adaptation from text to speech has been an issue. LLMs are trained on text-only data, which presents challenges to adapt them to speech modality with limited speech-to-speech data. To address the training difficulty, we propose scheduled interleaved speech-text training in this study. We use interleaved speech-text units instead of speech units during training, where aligned text tokens are interleaved at the word level. We gradually decrease the ratio of text as training progresses, to facilitate progressive modality adaptation from text to speech. We conduct experimental evaluations by fine-tuning LLaMA3.2-1B for S2ST on the CVSS dataset. We show that the proposed method consistently improves the translation performances, especially for languages with limited training data.

Subject: INTERSPEECH.2025 - Language and Multimodal


#9 End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data [PDF] [Copy] [Kimi1] [REL]

Authors: Aishwarya Pothula, Bhavana Akkiraju, Srihari Bandarupalli, Charan D, Santosh Kesiraju, Anil Kumar Vuppala

The scarcity of high-quality annotated data presents a significant challenge in developing effective end-to-end speech-to-text translation (ST) systems, particularly for low-resource languages. This paper explores the hypothesis that weakly labeled data can be used to build ST models for low-resource language pairs. We constructed speech-to-text translation datasets with the help of bitext mining using state-of-the-art sentence encoders. We mined the multilingual Shrutilipi corpus to build Shrutilipi-anuvaad, a dataset comprising ST data for language pairs Bengali-Hindi, Malayalam-Hindi, Odia-Hindi, and Telugu-Hindi. We created multiple versions of training data with varying degrees of quality and quantity to investigate the effect of quality versus quantity of weakly labeled data on ST model performance. Results demonstrate that ST systems can be built using weakly labeled data, with performance comparable to massive multi-modal multilingual baselines such as SONAR and SeamlessM4T.

Subject: INTERSPEECH.2025 - Language and Multimodal


#10 Self-Improvement for Audio Large Language Model using Unlabeled Speech [PDF] [Copy] [Kimi] [REL]

Authors: Shaowen Wang, Xinyuan Chen, Yao Xu

Recent audio LLMs have emerged rapidly, demonstrating strong generalization across various speech tasks. However, given the inherent complexity of speech signals, these models inevitably suffer from performance degradation in specific target domains. To address this, we focus on enhancing audio LLMs in target domains without any labeled data. We propose a self-improvement method called SI-SDA, leveraging the information embedded in large-model decoding to evaluate the quality of generated pseudo labels and then perform domain adaptation based on reinforcement learning optimization. Experimental results show that our method consistently and significantly improves audio LLM performance, outperforming existing baselines in WER and BLEU across multiple public datasets of automatic speech recognition (ASR), spoken question-answering (SQA), and speech-to-text translation (S2TT). Furthermore, our approach exhibits high data efficiency, underscoring its potential for real-world deployment.

Subject: INTERSPEECH.2025 - Language and Multimodal


#11 Evaluation of Three Automatic Alignment Tools for the Processing of Non-native French [PDF] [Copy] [Kimi] [REL]

Authors: Qian Zhou, Mathilde Hutin

The production of non-native speech is known to display "cross-language phonetic interference", which makes such speech uneasy to align and label automatically. Automatic phonetic alignment refers to an automated process whereby software synchronizes speech with its transcription, usually at the phone and word levels. This method has proven useful and reliable for native speech, yet this reliability usually does not extend to non-native speech. This paper proposes to test three major automatic aligners (WebMAUS, MFA and SPPAS) on non-native French uttered by two native speakers of Chinese by comparing them with two manual segmentations. This paper's goal is to offer non-computer linguists a preliminary investigation on which to rely when choosing a tool for their studies in non-native phonetics or language didactics. Results show that the best performing tool for labeling is SPPAS while the best performing tool for both word- and phone-segmentation overall is WebMAUS and MFA the worst.

Subject: INTERSPEECH.2025 - Language and Multimodal


#12 CrossPhon: An Auto Phone Mapping Tool to Streamline Cross-language Modeling for Phone Alignment of Low-resource Languages [PDF] [Copy] [Kimi] [REL]

Authors: Hongchen Wu, Yixin Gu

Phone alignment matches spoken sounds with text, streamlining speech dataset creation and analysis. However, most trained aligners focus on Indo-European languages, leaving under-resourced languages unsupported. Developing new aligners for these languages requires expertise and large datasets, which are often scarce. Cross-language phone alignment offers a solution using aligners trained in one language to align speech in another, but it traditionally relies on expert-crafted phone mappings. Our tool, CrossPhon, automates this process, making cross-language phone alignment more efficient. In tests on 14 languages from 7 families, CrossPhon achieved agreement rates of 78.95% to 97.77% compared to human expert mappings and delivered competitive performance in cross-language phone alignment. CrossPhon provides an efficient, reliable solution for generating cross-language phone alignment in under-resourced languages, helping bridge the digital divide and efficiently study these languages.

Subject: INTERSPEECH.2025 - Language and Multimodal


#13 Multi-lingual and Zero-Shot Speech Recognition by Incorporating Classification of Language-Independent Articulatory Features [PDF] [Copy] [Kimi1] [REL]

Authors: Ryo Magoshi, Shinsuke Sakai, Jaeyoung Lee, Tatsuya Kawahara

We address multi-lingual speech recognition including unknown or zero-shot languages based on the International Phonetic Alphabet (IPA) and articulatory features. Articulatory features are language-independent representations for IPA based on phonetic knowledge. In the previous studies, however, they were mostly limited to two dimensions of place of articulation and manner of articulation. Moreover, the classification of articulatory features were not well aligned with phone recognition. In this study, we adopt a comprehensive 24-dimensional vector representation, and propose a training method in which IPA tokens and their corresponding articulatory features are simultaneously predicted based on CTC alignment. Experiments are conducted by fine-tuning the wav2vec 2.0 XLS-R model over 22 languages, and the results demonstrated significant improvements on average as well as in zero-shot language settings.

Subject: INTERSPEECH.2025 - Language and Multimodal


#14 Instantaneous changes in acoustic signals reflect syllable progression and cross-linguistic syllable variation [PDF] [Copy] [Kimi] [REL]

Authors: Haley Hsu, Dani Byrd, Khalil Iskarous, Louis Goldstein

While abstract speech representations often exploit sequenced syllable units, how exactly syllables as abstract cognitive compositional structure relate to observable patterns in the articulatory and acoustic signals remains opaque. Previous work suggests oscillatory acoustic properties link such linguistic representations to physical events. We probe this relationship by testing temporal coordination between changes in spectral energy and amplitude with syllable boundary locations through phase-locking analyses. Results for syllabic nuclei demonstrate these phase-locking values (PLVs) track syllable progression in both English and Tashlhiyt. Further, cross-language preferences for different syllable nucleus types are found to be reflected in their respective PLVs. Overall, the findings demonstrate a tight coordination between abstract syllable units and quantifiable signal properties and additionally provide novel dynamical grounding for cross-linguistic syllable nucleus preferences.

Subject: INTERSPEECH.2025 - Language and Multimodal


#15 Influence of Proficiency and L2 Experience on Dynamic Spectral Cue Utilization in L2 Vowel Perception and Production [PDF] [Copy] [Kimi] [REL]

Authors: Linda Bakkouche, Brechtje Post

The acquisition of English vowels as an L2 is complex, yet most studies focus on static measures, with little attention to dynamic spectral cues like Vowel-Inherent Spectral Change (VISC). It remains unclear how language experience and length of residence (LOR) in immersion-rich environments affect perception-production alignment. This study examines Polish learners’ perception and production of /e-æ/ (DRESS-TRAP) and /i-I/ (FLEECE-KIT). These contrasts are challenging due to phonetic similarity and category overlap as predicted by L2 models. Advanced learners showed greater perceptual accuracy and more consistent production, especially for /i-I/, while /e-æ/ remained difficult. With higher proficiency, learners exhibited greater formant movement (20-40% of vowel duration), but LOR and language experience were not significant predictors. These findings provide insight into phonetic similarity in theoretical models of L2 vowel acquisition.

Subject: INTERSPEECH.2025 - Language and Multimodal


#16 A Bayesian Approach to L2 Fluency Ratings by Native and Nonnative Listeners [PDF] [Copy] [Kimi] [REL]

Authors: Kakeru Yazawa, Takayuki Konishi

This study investigates how native and nonnative listeners evaluate the fluency of Japanese speakers' English using a Bayesian modeling framework. Data were obtained from 16 listeners with diverse linguistic backgrounds (Cantonese, English, French, German, Japanese, Korean, Mandarin, Polish, Punjabi, and Spanish), who rated English read speech samples from 180 Japanese speakers, in the J-AESOP corpus. Utterance fluency measures included speed (syllable- or segment-based articulation rate), breakdown (pause frequency and duration), and re pair (repetitions). Results revealed that nonnative listeners, particularly those with Asian language backgrounds, were generally more lenient and less reliant on speech rate than native listeners, highlighting inter-listener variability previously overlooked. Model comparisons also revealed that segment-based articulation rate better captures utterance speed fluency than the commonly adopted syllable-based articulation rate.

Subject: INTERSPEECH.2025 - Language and Multimodal


#17 Are loan sequences different from foreign sequences? A perception study with Japanese listeners on coronal obstruent – high front vowel sequences [PDF] [Copy] [Kimi] [REL]

Authors: Silke Hamann, Andrea Alićehajić

Native phonotactics influences speech perception, as numerous studies have shown. The present study tackles the question whether there is a difference in perceptual performance if the involved sequence occurs only in loanwords, compared to a sequence that does not occur at all in the native language. This was tested with the native Japanese sequences of palatal affricate plus /i/, compared to /ti/ (accepted only in loanwords) versus /zi/ (not accepted in Japanese) in an online AX discrimination task with 39 Japanese speakers (21-63 years old), who also had to answer three questions on their received English input. Participants performed significantly better at discriminating the accepted loan sequence /ti/, though discrimination of the foreign sequence /zi/ was also quite high (ranging from 40-100% correct). The results indicate that discriminability is only partly guided by native phonotactics. A potential role of amount of English input measured by self-report could not be attested.

Subject: INTERSPEECH.2025 - Language and Multimodal


#18 Relative cue weighting in multilingual stop voicing production [PDF] [Copy] [Kimi] [REL]

Authors: Le Xuan Chan, Annika Heuser

How does a multilingual speaker produce similar phonological contrasts across the different languages that they speak? Some theories predict crosslinguistic influence while others predict that multilinguals keep separate sound inventories for each language. In this paper, we present crosslinguistic data from early multilingual speakers in Malaysia. We investigate the interaction of a true voicing language (Malay), a variable voicing language (English), and an aspiration language (Mandarin). Using a random forest classification of nine acoustic correlates of stop voicing, we show that 1) all early multilinguals show language-specific productions of stop voicing, and 2) variation driven by dominance can still be observed despite this language-specificity. In addition, we present evidence that closure voicing is a salient correlate alongside aspiration in Malaysian English, and that English is more reliant on secondary correlates than Malay and Mandarin.

Subject: INTERSPEECH.2025 - Language and Multimodal


#19 Variability in Intervocalic /t/ and Community Diversity in Australian English [PDF] [Copy] [Kimi] [REL]

Authors: Hannah White, Joshua Penney, Felicity Cox

The voiceless alveolar stop /t/ exhibits considerable variation in English. Realisations of /t/ vary depending phonetic context and social factors such as gender, age and socioeconomic status. Generally, studies on Australian English have focused on the “mainstream” variety, without acknowledging the wide range of linguistic diversity speakers are exposed to in contemporary multicultural Australian society. In the present paper, we explore intervocalic /t/ variation in data collected from 183 speakers as part of the Multicultural Australian English – Voices of Sydney corpus. Results show that, in certain phonetic contexts, exposure to community linguistic diversity can affect intervocalic /t/ realisation, with speakers from more diverse areas showing a preference for a single variant (the tap) compared to those from less diverse areas. We interpret this as an example of simplification that can occur in diverse communities where there is extreme variability in ambient language exposure.

Subject: INTERSPEECH.2025 - Language and Multimodal


#20 Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model [PDF] [Copy] [Kimi] [REL]

Authors: Yong Ren, Chenxing Li, Le Xu, Hao Gu, Duzhen Zhang, Yujie Chen, Manjie Xu, Ruibo Fu, Shan Yang, Dong Yu

Humans can intuitively infer sounds from silent videos, but whether multimodal large language models can perform modal-mismatch reasoning without accessing target modalities remains relatively unexplored. Current text-assisted-video-to-audio (VT2A) methods excel in video foley tasks but struggle to acquire audio descriptions during inference. We introduce the task of Reasoning Audio Descriptions from Silent Videos (SVAD) to address this challenge and investigate vision-language models' (VLMs) capabilities on this task. To further enhance the VLMs' reasoning capacity for the SVAD task, we construct a CoT-AudioCaps dataset and propose a Chain-of-Thought-based supervised fine-tuning strategy. Experiments on SVAD and subsequent VT2A tasks demonstrate our method's effectiveness in two key aspects: significantly improving VLMs' modal-mismatch reasoning for SVAD and effectively addressing the challenge of acquiring audio descriptions during VT2A inference.

Subject: INTERSPEECH.2025 - Language and Multimodal


#21 Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning [PDF] [Copy] [Kimi] [REL]

Authors: Le Xu, Chenxing Li, Yong Ren, Yujie Chen, Yu Gu, Ruibo Fu, Shan Yang, Dong Yu

Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysis in cross-attention layers to automatically identify and suppress misleading visual cues during modal fusion. Complementing this architecture, we develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs, greatly enhancing model resilience against alignment noise. Evaluations on the AudioCaps benchmark demonstrate our system's superior performance over existing baselines, especially in mismatched modality scenarios. Furthermore, our solution demonstrates an approximately 6x improvement in inference speed compared to the baseline.

Subject: INTERSPEECH.2025 - Language and Multimodal


#22 ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Thai-Binh Nguyen, Thi Van Nguyen, Quoc Truong Do, Chi Mai Luong

Audio-Visual Speech Recognition (AVSR) has gained significant attention recently due to its robustness against noise, which often challenges conventional speech recognition systems that rely solely on audio features. Despite this advantage, AVSR models remain limited by the scarcity of extensive datasets, especially for most languages beyond English. Automated data collection offers a promising solution. This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. We demonstrate its broad applicability by developing a baseline AVSR model for Vietnamese. Experiments show the automatically collected dataset enables a strong baseline, achieving competitive performance with robust ASR in clean conditions and significantly outperforming them in noisy environments like cocktail parties. This efficient method provides a pathway to expand AVSR to more languages, particularly under-resourced ones.

Subject: INTERSPEECH.2025 - Language and Multimodal


#23 GALAXY: A Large-Scale Open-Domain Dataset for Multimodal Learning [PDF] [Copy] [Kimi1] [REL]

Authors: Yihan Wu, Yichen Lu, Yijing Chen, Jiaqi Song, William Chen, Ruihua Song, Shinji Watanabe

Humans naturally use multimodal information, with vision, speech, and text working together to understand the world and solve problems. For artificial intelligence to achieve human-level capability, it must process multimodal information in a similar manner. However, there is a lack of large-scale open-domain datasets that support all three modalities—vision, speech, and text—with high-quality speech transcriptions. To address this gap, we introduce GALAXY, a large-scale, open-domain dataset designed for multimodal learning, containing 8,270 hours of videos, speech, and transcriptions across 16 diverse domains. We describe the data creation pipeline and provide detailed statistics and analyses of the dataset. Using multimodal speech recognition as a case study, we validate GALAXY’s effectiveness and evaluate baseline models’ performance across different data volumes and domains. The results highlight GALAXY’s potential as a valuable resource for advancing multimodal understanding.

Subject: INTERSPEECH.2025 - Language and Multimodal


#24 FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems [PDF2] [Copy] [Kimi2] [REL]

Authors: Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, Eng Siong Chng

Full-duplex spoken dialogue systems (FDSDS) enable more natural human–machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizing LLMs, TTS, and ASR to address this gap. It assesses FDSDS’s ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with diverse novel metrics. We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours of generated speech, with 293 simulated conversations and 1,200 interruptions. The results show that all models continue to face challenges, such as failing to respond to user interruptions, under frequent disruptions and noisy conditions. Demonstrations, data, and code will be released.

Subject: INTERSPEECH.2025 - Language and Multimodal


#25 PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs [PDF1] [Copy] [Kimi1] [REL]

Authors: Sho Inoue, Shuai Wang, Haizhou Li

Despite significant progress in neural spoken dialog systems, personality-aware conversation agents---capable of adapting behavior based on personalities---remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.

Subject: INTERSPEECH.2025 - Language and Multimodal