INTERSPEECH.2025 - Others

| Total: 166

#1 Towards Multi-Level Transcript Segmentation: LoRA Fine-Tuning for Table-of-Contents Generation [PDF] [Copy] [Kimi] [REL]

Authors: Steffen Freisinger, Philipp Seeberger, Thomas Ranzenberger, Tobias Bocklet, Korbinian Riedhammer

Segmenting speech transcripts into thematic sections benefits both downstream processing and users who depend on written text for accessibility. We introduce a novel approach to hierarchical topic segmentation in transcripts, generating multi-level tables of contents that capture both topic and subtopic boundaries. We compare zero-shot prompting and LoRA fine-tuning on large language models, while also exploring the integration of high-level speech pause features. Evaluations on English meeting recordings and multilingual lecture transcripts (Portuguese, German) show significant improvements over established topic segmentation baselines. Additionally, we adapt a common evaluation measure for multi-level segmentation, taking into account all hierarchical levels within one metric.

Subject: INTERSPEECH.2025 - Others


#2 Pick and Summarize: Integrating Extractive and Abstractive Speech Summarization [PDF] [Copy] [Kimi] [REL]

Authors: Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Ryo Fukuda, William Chen, Shinji Watanabe

Speech summarization condenses long speech while preserving essential content. Recently, there has been growing interest in end-to-end (E2E) abstractive speech summarization, which directly generates a text summary from spoken input. However, abstractive summarization of lengthy speech sequences presents challenges, such as identifying key information within very long speech. In this paper, we hypothesize that first addressing the simpler task of extractive summarization can help with these aforementioned long-sequence challenges and improve overall summarization performance. To this end, we introduce an extractive-abstractive summarization model that exploits auxiliary information from extractive summaries generated directly from raw speech input to enhance abstractive speech summarization. Experiments on a web presentation corpus demonstrate consistent gains with our proposed method, achieving up to 1.4-point gains in METEOR score over a strong abstractive summarization baseline.

Subject: INTERSPEECH.2025 - Others


#3 Beyond Similarity Scoring: Detecting Entailment and Contradiction in Multilingual and Multimodal Contexts [PDF] [Copy] [Kimi] [REL]

Authors: Othman Istaiteh, Salima Mdhaffar, Yannick Estève

Natural Language Inference (NLI) determines whether a hypothesis entails, contradicts, or is neutral with respect to a premise. While text-based NLI is well-studied, its multimodal and multilingual extension remains underexplored. This paper introduces a multilingual, multimodal NLI framework classifying entailment, contradiction, and neutrality across text-text, text-speech, speech-text, and speech-speech pairs in same- and cross-lingual settings. A key motivation is improving translation assessment, where similarity-based approaches may miss contradictions. The framework complements evaluation methods and helps identify inconsistencies by detecting entailment and contradiction alongside semantic similarity. It also extends text-based datasets with speech-text and speech-speech pairs for multilingual multimodal inference. Experiments show the model outperforms BLASER in distinguishing entailment from non-entailment, achieving F1 gains of 0.19 in speech-speech and 0.13 in speech-text.

Subject: INTERSPEECH.2025 - Others


#4 Comparison-Based Automatic Evaluation for Meeting Summarization [PDF1] [Copy] [Kimi] [REL]

Authors: Ziwei Gong, Lin Ai, Harsh Deshpande, Alexander Johnson, Emmy Phung, Zehui Wu, Ahmad Emami, Julia Hirschberg

Large Language Models (LLMs) have spurred interest in automatic evaluation methods for summarization, offering a faster, more cost-effective alternative to human evaluation. However, existing methods often fall short when applied to complex tasks like long-context summarizations and dialogue-based meeting summarizations. In this paper, we introduce CREAM (Comparison-based Reference-free Elo-ranked Automatic evaluation for Meeting summarization), a novel framework that addresses the unique challenges of evaluating meeting summaries. CREAM leverages a combination of chain-of-thought reasoning and key facts alignment to assess conciseness and completeness of model-generated summaries without requiring reference. By employing an ELO ranking system, our approach provides a robust mechanism for comparing the quality of different models or prompt configurations.

Subject: INTERSPEECH.2025 - Others


#5 Modeling Formant Dynamics in Mandarin /ai/: Effects of Speech Style and Speech Rate [PDF] [Copy] [Kimi] [REL]

Authors: Yunzhuo Xiang, Jingyi Sun

This study examines the relationship between speech clarity and speech rate by modeling F1/F2 variation in the Mandarin diphthong /ai/ across different durations and speech styles. The corpus includes 20 hours of conversational and 6 hours of read speech. Vowel durations were manually verified, and formant values were extracted using auto-correlation. Generalized additive mixed models (GAMMs) were used to examine the interaction between duration and formants. Results show that both read and slow speeches exhibit higher F1 onset and F2 offset for /ai/. However, read speech has larger formant frequency range, and even at the shortest (20 ms) or longest (200 ms) durations, diphthongs in read speech remained more clearly articulated than those in conversational speech of the same length. This suggests that speech clarity and speech rate may be distinct dimensions rather than interchangeable factors.

Subject: INTERSPEECH.2025 - Others


#6 Representation of Perceived Prosodic Similarity of Conversational Feedback [PDF] [Copy] [Kimi] [REL]

Authors: Livia Qian, Carol Figueroa, Gabriel Skantze

Vocal feedback (e.g., 'mhm', 'yeah', 'okay') is an important component of spoken dialogue and is crucial to ensuring common ground in conversational systems. The exact meaning of such feedback is conveyed through both lexical and prosodic form. In this work, we investigate the perceived prosodic similarity of vocal feedback with the same lexical form, and to what extent existing speech representations reflect such similarities. A triadic comparison task with recruited participants is used to measure perceived similarity of feedback responses taken from two different datasets. We find that spectral and self-supervised speech representations encode prosody better than extracted pitch features, especially in the case of feedback from the same speaker. We also find that it is possible to further condense and align the representations to human perception through contrastive learning.

Subject: INTERSPEECH.2025 - Others


#7 Prolongation in Romanian [PDF] [Copy] [Kimi] [REL]

Authors: Oana Niculescu, Monica Vasileanu

In this study we investigate segmental prolongation (PR) as a form of disfluent hesitation in a corpus of spontaneous Romanian monologues. A total of 3541 PRs were extracted from 216 minutes of speech pertaining to 4 native speakers (2 female, 2 male). In line with the methodology employed by previous corpus-studies on PR, our data reveal that prolonged segments have an average duration of 316ms (sd = 130), surfacing at a frequency of 11.3 per 100 words and following a 17–7–76% position distribution. In Romanian, all segments can undergo PR, with vowels being the preferred target (57.2%), followed by fricatives (12.8%), nasals (11.8%), plosives (10.3%), diphthongs (5.5%), affricates (1.6%) and laterals (0.8%). The vast majority of PRs surface in monosyllabic words (59%). Function words are prolonged in 57% of the cases. By including data from a lesser-studied European language, this paper broadens our understanding of the formal regularities of PR in a cross-linguistic setting.

Subject: INTERSPEECH.2025 - Others


#8 Speech Reduction in French: The Relationship Between Vowel Space and Articulation Dynamics [PDF] [Copy] [Kimi] [REL]

Authors: Kübra Bodur, Corinne Fredouille, Christine Meunier

Reduction is an inherent characteristic of conversations, reflecting the dynamic adaptability of language. This study examined the link between vowel space and non-lexicalized reductions (NLR) in spontaneous French speech. The hypothesis posited that speakers with smaller vowel spaces - indicating centralized, less distinct vowels - would produce more NLR, defined as temporally compressed speech zones. Results showed that smaller vowel space (pVSA) predicted greater NLR only when articulation rate was considered, highlighting an interaction between spatial and temporal speech dynamics. Articulation rate was the strongest predictor, supporting theories that link faster speech to reduced articulatory precision. Vowel Distinctiveness Index (VDI) did not significantly predict NLR, suggesting it reflects broader systemic patterns rather than local reductions. These findings emphasized the interplay of temporal dynamics and individual articulation strategies in shaping reduction in speech.

Subject: INTERSPEECH.2025 - Others


#9 Stress in Spoken and Whistled Greek [PDF] [Copy] [Kimi] [REL]

Authors: Andre Batchelder-Schwab, Vasileios Michos, Jonathan Barnes

This paper presents experimental results testing vowels of a register of whistled Greek called Sfyria. We used a frame sentence and minimal pairs. Each participant performed the experiment in whistled Greek, then again in spoken Greek. The results suggest that all five vowel qualities of Greek remain distinct in the whistled register (/i/ /e/ /o/ through F0 alone; /a/ /u/ through intensity). While acoustic differences were found between stressed and unstressed versions of all spoken vowel qualities (namely intensity), correlates of a stress distinction were not found for /i/. This might be a ceiling effect from general high intensity of front vowels in the whistled register, and we tentatively suggest that stressed /i/ in Sfyria is not produced differently from its unstressed counterpart.

Subject: INTERSPEECH.2025 - Others


#10 Neutral Tone Variation in Beijing Mandarin: Is Neutral Tone Toneless? [PDF] [Copy] [Kimi] [REL]

Authors: Xiao Dong, Fengming Liu, Chien-Jer Lin, Monica Nesbitt, Shuju Shi

Neutral tone (NT) is a distinctive feature of Beijing Mandarin, traditionally described as toneless and entirely dependent on the preceding tone. Recent studies suggest that NT may retain specific phonetic targets and exists on a continuum of reduction, challenging the strict neutral versus full-tone dichotomy. This study examines the phonetic realization of NT as influenced by three factors - preceding tone, underlying tone, and NT type - using word-list reading data from 36 Beijing Mandarin speakers. Our findings confirm a robust effect of the preceding tone. In the meantime, we identify a significant impact of the underlying tone, indicating that NT is not entirely toneless but retains some degree of phonological specification. Moreover, the differences observed between optional and forbidden NT words suggest that NT should be conceptualized as part of a gradient system influenced by contextual factors, rather than as a simple neutral versus full-tone contrast.

Subject: INTERSPEECH.2025 - Others


#11 The Role of Syntactic Structures in Shaping Directionality in Trisyllabic Tone Sandhi: Evidence from Tianjin Mandarin [PDF] [Copy] [Kimi] [REL]

Authors: Siqi Lu, Hui Feng, Ziyu Xiong

Prior studies reported inconsistent syntactic influence on trisyllabic tone sandhi directionality in Tianjin Mandarin. This study investigates the role of six syntactic structures and generational differences in the application of sandhi rules, analyzing seven trisyllabic tone patterns across two structural configurations ("2+1"/"1+2"), with data from 26 native speakers. Key findings are: 1) T4-T4-T1 pattern triggers right-edge sandhi, producing a novel T4-T2-T1 pattern. 2) T3-T1-T1 pattern exhibits one-round sandhi in VP+N, A+NP, N+NP, Q+NP, and V+NP structures, but two-round application in NP+N. 3) Diachronic shifts: the absence of T4-T4 sandhi and the replacement of T1-T1→T3-T1 with T1-T1→T2-T1 apply universally across generations, restructuring trisyllabic outcomes without altering directionality. These findings clarify syntax-sandhi interplay and generational variation in Tianjin Mandarin, revealing systematic directional patterns tied to syntactic structures.

Subject: INTERSPEECH.2025 - Others


#12 Acoustic Representation and Realization of Weak Elements Subcategories: In the Case of Tianjin Mandarin [PDF1] [Copy] [Kimi] [REL]

Authors: Zhijie Li, Hui Feng

Weak elements in Tianjin Mandarin are often assumed to have distinct origins and functions across subcategories, yet acoustic differences among them remain underexplored. This study analyzes the first two formants (F1, F2) and fundamental frequency (F0) of weak elements produced by six Tianjin Mandarin speakers. Results reveal: (1) Vowel quality shows minimal variation across subcategories, with F1 reduction and F2 centralization in monophthongs, and F1 centralization and compression in diphthongs. (2) Tone realization varies: affixes, locative prepositions, directional verbs, reduplicated words and optional weak elements exhibit a fixed mid-low pitch target, achieved through target-setting. In contrast, structural particles, aspectual particles, habitual weak elements, and functional weak elements rely on contextual tone spreading. These findings highlight two coexisting patterns of tone realization in Tianjin Mandarin weak elements.

Subject: INTERSPEECH.2025 - Others


#13 Lexical competition in the process of Cantonese tone merging: Diverse Impact Mechanisms Across Different Individuals and Tone Pairs [PDF] [Copy] [Kimi] [REL]

Authors: Lishan Li, Yaolin Zhou, Xiaoying Xu

Lexical competition has been shown to exert an inhibitory influence on phonemic mergers in phonetic evolution. This study investigates the merging process of Cantonese tone pairs, examining the lexical competition effect from an individual perspective. Results reveal that lexical competition influences tone production even in speakers without clear mergers, helping maintain tonal contrasts while subtly altering pitch distributions, which may signal future change. For participants merging only one tone pair, the effect of lexical competition varies: it consistently inhibits merging for the T4 (21)-T6 (22) pair, but for the T3 (33)-T6 (22) pair, its influence manifests in three distinct patterns-no effect, inhibition, or promotion of merging. Conversely, in participants merging all three tones, lexical competition shows minimal impact. This research elucidates the diverse mechanisms through which lexical competition shapes tone-merging processes.

Subject: INTERSPEECH.2025 - Others


#14 Tonal Perception in Changde Mandarin [PDF1] [Copy] [Kimi] [REL]

Authors: Zhenrui Zhang, Fang Hu

This paper explores perceptual cues of the four tones in Changde Mandarin. Results show that T1 is perceived as a high-level tone, T2 is a low-rising tone in both production and perception, T3 is not phonologically level but falling, and the perception of T4 is influenced by various factors. And results also show that tonal perception might not be categorical, since neither the T2-T3 continuum nor the T2-T4 continuum fulfills the standards for typical categorical perception.

Subject: INTERSPEECH.2025 - Others


#15 Tonal Contrasts in the Malipo Variety of the Mienic Language [PDF] [Copy] [Kimi] [REL]

Authors: Changhong Du, Fang Hu

This paper describes tonal contrasts in the Malipo variety of the Mienic language and explores how they are manifested by acoustic cues. Results show that Linear Mixed-Effects Models based on Growth Curve Analysis successfully characterizes the 5 level tones, 3 falling tones, 3 rising or concave tones, and 2 convex tones in the Malipo variety. In addition to F0, duration also plays a role in a complex tonal system.

Subject: INTERSPEECH.2025 - Others


#16 Robot-assisted Recognition of Vocal Emotions in Pseudospeech for Cochlear Implanted Adolescents [PDF] [Copy] [Kimi] [REL]

Authors: Gloria Araiza-Illan, Luke Meyer, Bert Maat, Deniz Başkent

Recognising vocal emotions in speech is difficult for children with hearing loss and who use a cochlear implant (CI). As regular monitoring could be burdensome, we propose a NAO robot as a test interface. Adolescents with CIs (10-17yr) performed the EmoHI test for recognising vocal emotions in pseudospeech (no linguistic emotion information), once with a computer and once with a NAO. Interfaces are compared via test results and durations, and participants’ perception of the interfaces. Test results (sensitivity index, d’) were similar (0.36 ± 0.36 vs. 0.37 ± 0.43), but durations were significantly longer on the NAO (4.18 min ± 39 sec vs. 5.13 min ± 51 sec). The computer had a higher perceived usability, but the NAO was rated more enjoyable, engaging and preferable. Overall, the NAO sound quality seems sufficient for conducting the EmoHI test, even with a CI. The higher ratings of enjoyability for NAO may be especially useful in conducting such tests in populations with hearing devices.

Subject: INTERSPEECH.2025 - Others


#17 Using Neurogram Similarity Index Measure (NSIM) to Model Hearing Loss and Cochlear Neural Degeneration [PDF] [Copy] [Kimi] [REL]

Authors: Ahsan Cheema, Sunil Puria

Trouble hearing in noisy situations remains a common complaint for both individuals with hearing loss and individuals with normal hearing. This is hypothesized to arise due to condition called: cochlear neural degeneration (CND) which can also result in significant variabilities in hearing aids outcomes. This paper uses computational models of auditory periphery to simulate various hearing tasks. We present an objective method to quantify hearing loss and CND by comparing auditory nerve fiber responses using a Neurogram Similarity Index Measure (NSIM). Specifically study 1, shows that NSIM can be used to map performance of individuals with hearing loss on phoneme recognition task with reasonable accuracy. In the study 2, we show that NSIM is a sensitive measure that can also be used to capture the deficits resulting from CND and can be a candidate for noninvasive biomarker of auditory synaptopathy.

Subject: INTERSPEECH.2025 - Others


#18 Contrastive Learning-based Syllable-Level Mispronunciation Detection and Diagnosis for Speech Audiometry [PDF1] [Copy] [Kimi] [REL]

Authors: Longbin Jin, Donghun Min, Jung Eun Shin, Eun Yi Kim

Speech audiometry assesses hearing disorders, typically relies on audiologists, making the process subjective and requiring in-person evaluation. In this paper, we introduce SylPh, a novel automatic syllable-level mispronunciation detection and diagnosis (MDD) model that generalizes across open-set syllables while also offering phonemic analysis. To capture a wide range of mispronunciation patterns, we construct positive and pseudo-negative bags to extract in-distribution and out-of-distribution features from input audio. Our model aligns audio features with adaptive text embeddings using a contrastive objective, dynamically adjusting decision boundaries for each syllable within a single model. Extensive experiments on a large-scale dataset demonstrate its effectiveness in both closed-set and open-set syllables. Notably, despite training only on syllable-level labels, the Sylph has the capability to localize phoneme-level abnormalities, providing detailed diagnostic insights.

Subject: INTERSPEECH.2025 - Others


#19 A Deformable Convolution GAN Approach for Speech Dereverberation in Cochlear Implant Users [PDF] [Copy] [Kimi] [REL]

Authors: Hsin-Tien Chiang, John H.L. Hansen

Speech dereverberation is crucial for enhancing intelligibility and quality, especially for cochlear implant (CI) users, who are highly susceptible to smearing effects induced by reverberation. While conventional and deep learning-based methods have shown promise for normal-hearing (NH) individuals, their effectiveness for CI users remains limited. To bridge this gap, we propose a deformable convolutional GAN architecture for dereverberation for CI users. The deformable convolution layers introduce kernel offset prediction, adaptively adjusting the receptive field based on distortion in reverberant speech. We first evaluate the effectiveness of the proposed method on REVERB challenge dataset. A listening test is conducted with both NH and CI users. Results show that the proposed method markedly improves speech intelligibility for CI users by preserving a more intact envelope structure, enhancing their ability to perceive key transient speech segments for sentence comprehension.

Subject: INTERSPEECH.2025 - Others


#20 L3C-DeepMFC: Low-Latency Low-Complexity Deep Marginal Feedback Cancellation with Closed-Loop Fine Tuning for Hearing Aids [PDF] [Copy] [Kimi] [REL]

Authors: Fengyuan Hao, Brian C. J. Moore, Huiyong Zhang, Xiaodong Li, Chengshi Zheng

Feedback control in hearing aids mitigates acoustic feedback caused by the coupling between the receiver and microphone. While DNN-based methods have achieved progress, they remain computationally intensive with relatively high latency. This paper introduces L3C-DeepMFC, a low-latency and low-complexity time-frequency (T-F) domain method that employs complex spectrum mapping to estimate the magnitude and phase components of the desired speech. This method integrates full- and sub-band recurrent modeling to capture spectro-temporal patterns and modifies the overlap-add method for low-latency processing. Moreover, we utilize closed-loop fine tuning with dynamically generated feedback mixtures to minimize the mismatch between training and estimation. Evaluations using the AISHELL-3 dataset confirm its competitive performance across various gains, significantly improving the maximum stable gain (MSG). Integration with traditional methods shows better performance of feedback suppression.

Subject: INTERSPEECH.2025 - Others


#21 Semantic Processing During Spoken Word Production by Children with Cochlear Implants [PDF1] [Copy] [Kimi] [REL]

Authors: Man Wang, Yixin Ding, Niels Schiller

Research that investigates the speech production of children with cochlear implants (CIs) mostly focuses on the characteristics of their speech sounds. Few studies have looked at the psycholinguistic processes during speech production of children with CI. Our study examines the semantic processing during speech production of this group of speakers, compared to their normal hearing (NH) peers. Using the picture-word interference paradigm, we manipulated the semantic relatedness between target picture names and distractor words. We observed the typical semantic interference effect in the NH group but not in the CI group, suggesting that the semantic network may be organized differently in the CI group than in their NH peers and the CI group may have difficulties accessing the semantic categories. Furthermore, our results are in line with the suggestion that the CI group may rely more on a top-down strategy or attentional cognitive processing than a bottom-up semantic activation.

Subject: INTERSPEECH.2025 - Others


#22 Linguistic Masking and Its Release in Simulated Electric-acoustic Hearing [PDF] [Copy] [Kimi] [REL]

Authors: Yuting Ding, Xuefei Wang, Fei Chen

Combined electric-acoustic stimulation (EAS) with a cochlear implant (CI) and a hearing aid can significantly improve CI user’s speech-in-noise performance, which is noted as combined stimulation advantage. The present study aimed to investigate this advantage in the context of linguistic masking, i.e., when the target speech and masker were from the same or different languages, and how combined stimulation and different linguistic masker affected linguistic release from masking (LRM). Mandarin sentences were mixed with 2-talker babble maskers spoken in Mandarin, Cantonese or English, processed by noise vocoder models simulating CI or EAS hearing, and presented to normal-hearing listeners to recognize. Experimental results showed the combined stimulation advantage under linguistic masking with all three languages. The EAS condition yielded significant difference between the scores of two LRMs (i.e., English vs. Mandarin, and Cantonese vs. Mandarin). Under linguistic masking with Cantonese, there was significant difference between the scores of CI- and EAS-based LRMs. The findings of the present work provided new insights on the benefits for CI users using combined electric-acoustic stimulation.

Subject: INTERSPEECH.2025 - Others


#23 Gaze-Enhanced Multimodal Turn-Taking Prediction in Triadic Conversations [PDF] [Copy] [Kimi] [REL]

Authors: Seongsil Heo, Christi Miller, Calvin Murdock, Michael Proulx

Turn-taking prediction is crucial for seamless interactions. This study introduces a novel, lightweight framework for accurate turn-taking prediction in triadic conversations without relying on computationally intensive methods. Unlike prior approaches that either disregard gaze or treat it as a passive signal, our model integrates gaze with speaker localization, structuring it within a spatial constraint to transform it into a reliable predictive cue. Leveraging egocentric behavioral cues, our experiments demonstrate that incorporating gaze data from a single-user significantly improves prediction performance, while gaze data from multiple-users further enhances it by capturing richer conversational dynamics. This study presents a lightweight and privacy-conscious approach to support adaptive, directional sound control, enhancing speech intelligibility in noisy environments, particularly for hearing assistance in smart glasses.

Subject: INTERSPEECH.2025 - Others


#24 Visual Cues Support Robust Turn-taking Prediction in Noise [PDF] [Copy] [Kimi] [REL]

Authors: Sam O'Connor Russell, Naomi Harte

Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.

Subject: INTERSPEECH.2025 - Others


#25 Backchannel prediction for natural spoken dialog systems using general speaker and listener information [PDF1] [Copy] [Kimi] [REL]

Authors: Yoshinori Fukunaga, Ryota Nishimura, Kengo Ohta, Norihide Kitaoka

Backchannel responses are a crucial component of conversations enabling more effective communication through listener feedback. Current backchannel prediction models classify these responses into just three categories, using speech, text, and listener IDs. These IDs, which contain detailed personal information, cannot be applied in real-world dialog systems however, and three-category classification limits response generation capabilities. Therefore, we propose a model for predicting a backchannel's 'surface form' using only general speaker and listener embeddings. Our experiments show a 1.3% improvement in prediction accuracy when performing 3-category classification, and a 0.9% improvement when performing 11-category classification, compared to conventional ID embeddings, demonstrating an enhancement in performance that is deployable in real-world systems.

Subject: INTERSPEECH.2025 - Others