INTERSPEECH.2025 - Speech Detection

| Total: 95

#1 Leveraging Text and Speech Processing for Suicide Risk Classification in Chinese Adolescents [PDF2] [Copy] [Kimi3] [REL]

Authors: Justyna Krzywdziak, Bartłomiej Eljasiak, Joanna Stępień, Michał Świątek, Agnieszka Pruszek

The increasing prevalence of depression among young people is a growing global concern. Early detection and intervention are crucial, making the development of effective diagnostic tools essential. This work explores the use of advanced text and speech processing techniques to classify suicide risk within the context of the SpeechWellness Challenge (SW1) and verifies if speech can be used as a non-invasive and readily available mental health indicator. The analysis incorporated both linguistic features and audio-based methods for spontaneous speech and passage reading. For text classification, Large Language Models like Qwen2.5 and BERT were evaluated. For audio-based prediction, state-of-the-art speech processing models, including Whisper, Wav2Vec2 and HuBERT were employed. Furthermore, a multimodal approach combining both vocal and textual features was investigated. The results obtained in this research ranked among the highest in the challenge.

Subject: INTERSPEECH.2025 - Speech Detection


#2 The 1st SpeechWellness Challenge: Detecting Suicide Risk Among Adolescents [PDF] [Copy] [Kimi3] [REL]

Authors: Wen Wu, Ziyun Cui, Chang Lei, Yinan Duan, Diyang Qu, Ji Wu, Bowen Zhou, Runsen Chen, Chao Zhang

The 1st SpeechWellness Challenge (SW1) aims to advance methods for detecting current suicide risk in adolescents using speech analysis techniques. Suicide among adolescents is a critical public health issue globally. Early detection of suicidal tendencies can lead to timely intervention and potentially save lives. Traditional methods of assessment often rely on self-reporting or clinical interviews, which may not always be accessible. The SW1 challenge addresses this gap by exploring speech as a non-invasive and readily available indicator of mental health. We release the SW1 dataset which contains speech recordings from 600 adolescents aged 10-18 years. By focusing on speech generated from natural tasks, the challenge seeks to uncover patterns and markers that correlate with current suicide risk.

Subject: INTERSPEECH.2025 - Speech Detection


#3 Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection [PDF] [Copy] [Kimi2] [REL]

Authors: Yifan Gao, Jiao Fu, Long Guo, Hong Liu

Early identification of suicide risk is crucial for preventing suicidal behaviors. As a result, the identification and study of patterns and markers related to suicide risk have become a key focus of current research. In this paper, we present the results of our work in the 1st SpeechWellness Challenge (SW1), which aims to explore speech as a non-invasive and easily accessible mental health indicator for identifying adolescents at risk of suicide. Our approach leverages large language model (LLM) as the primary tool for feature extraction, alongside conventional acoustic and semantic features. The proposed method achieves an accuracy of 74% on the test set, ranking first in the SW1 challenge. These findings demonstrate the potential of LLM-based methods for analyzing speech in the context of suicide risk assessment.

Subject: INTERSPEECH.2025 - Speech Detection


#4 Predicting Adolescent Suicidal Risk from Multi-task-based Speech: An Ensemble Learning Approach [PDF] [Copy] [Kimi] [REL]

Authors: Xi Chen, Renzhe Yu, Yanshen Tan, Yiyi Li, Quan Qian, Ying Lin

Adolescent suicide is a pressing global public health issue. Timely identification of suicide risk is crucial. Traditional methods of assessing suicide risk are often limited by their reliance on subjective input and resource requirements. This paper aims to address these limitations by detecting suicide risk from multi-task-based speech, utilizing a dataset of 600 Chinese adolescents (age: 10-18 yr) provided by the 1st SpeechWellness Challenge. Our approach involved both acoustic and semantic features extracted by OpenSmile, Emotion2Vec, and a fine-tuned BERT-Chinese model. The base models were trained using XGBoost and SVM, etc., with hyperparameters tuned by Bayesian optimization. Then we implemented a multi-model and multi-task nested voting ensemble framework to integrate the base models, achieving a final test set accuracy of 0.63 (recall=0.74, F1≈0.67). This work highlights the potential of voice-based biomarkers in mental health assessment.

Subject: INTERSPEECH.2025 - Speech Detection


#5 In-context learning capabilities of Large Language Models to detect suicide risk among adolescents from speech transcripts [PDF] [Copy] [Kimi] [REL]

Authors: Filomene Roquefort, Alexandre Ducorroy, Rachid Riad

Early suicide risk detection in adolescents is critical yet hindered by scalability challenges of current assessments. This paper presents our approach to the first SpeechWellness Challenge (SW1), which aims to assess suicide risk in Chinese adolescents through speech analysis. Due to speech anonymization constraints, we focused on linguistic features, leveraging Large Language Models (LLMs) for transcript-based classification. Using DSPy for systematic prompt engineering, we developed a robust in-context learning approach that outperformed traditional fine-tuning on both linguistic and acoustic markers. Our systems achieved third and fourth places among 180+ submissions, with 0.68 accuracy (F1=0.7) using only transcripts. Ablation analyses showed that increasing prompt example improved performance (p=0.003), with varying effects across model types and sizes. These findings advance automated suicide risk assessment and demonstrate LLMs' value in mental health applications.

Subject: INTERSPEECH.2025 - Speech Detection


#6 Language-Agnostic Suicidal Risk Detection Using Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: June-Woo Kim, Wonkyo Oh, Haram Yoon, Sung-Hoon Yoon, Dae-Jin Kim, Dong-Ho Lee, Sang-Yeol Lee, Chan-Mo Yang

Suicidal risk detection in adolescents is a critical challenge, yet existing methods rely on language-specific models, limiting scalability and generalization. This study introduces a novel language-agnostic framework for suicidal risk assessment with large language models (LLMs). We generate Chinese transcripts from speech using an ASR model and then employ LLMs with prompt-based queries to extract suicidal risk-related features from these transcripts. The extracted features are retained in both Chinese and English to enable cross-linguistic analysis and then used to fine-tune corresponding pretrained language models independently. Experimental results show that our method achieves performance comparable to direct fine-tuning with ASR results or to models trained solely on Chinese suicidal risk-related features, demonstrating its potential to overcome language constraints and improve the robustness of suicidal risk assessment.

Subject: INTERSPEECH.2025 - Speech Detection


#7 Network of acoustic characteristics for the automatic detection of suicide risk from speech. Contribution to the 2025 SpeechWellness challenge by the Semawave team [PDF] [Copy] [Kimi] [REL]

Authors: Vincent P. Martin, Charles Brazier, Maxime Amblard, Michel Musiol, Jean-Luc Rouas

Suicide is a leading cause of death among young individuals. Although early detection and intervention are vital for preventing suicide attempts, current suicide risk assessments rely heavily on clinical interviews and questionnaires, both of which are subject to patient biases. In contrast, speech analysis provides several objective advantages for estimating suicide risk. This is the focus of the 2025 SpeechWellness Challenge. This article presents a new paradigm for speech analysis based on network analyses of low-level descriptors. We evaluate the performance of this approach compared to the classical eGeMAPS+SVM model for suicide risk detection. Additionally, we assess the relevance of comparing networks derived from reading and spontaneous speech, and explore different methods for network construction, analyzing their respective performances.

Subject: INTERSPEECH.2025 - Speech Detection


#8 Speech Reference Intervals: An Assessment of Feasibility in Depression Symptom Severity Prediction [PDF1] [Copy] [Kimi2] [REL]

Authors: Lauren White, Ewan Carr, Judith Dineley, Catarina Botelho, Pauline Conde, Faith Matcham, Carolin Oetzmann, Amos Folarin, George Fairs, Agnes Norbury, Stefano Goria, Srinivasan Vairavan, Til Wykes, Richard Dobson, Vaibhav Naraya, Matthew Hotopf, Alberto Abad, Isabel Trancoso, Nicholas Cummins

Major Depressive Disorder (MDD) is a prevalent mental disorder. Combining speech features and machine learning has promise for predicting MDD, but interpretability is crucial for clinical applications. Reference intervals (RIs) represent a typical range for a speech feature in a population. RIs could increase interpretability and help clinicians identify deviations from norms. They could also replace conventional speech features in machine learning models. However, no work has yet assessed the feasibility of speech RIs in MDD. We generated and compared RIs from three reference datasets varying in size, elicitation prompt, and health information. We then calculated deviations from each RI set for people with MDD to compare performance on a depression symptom severity prediction task. Our RI-based models trained with demographic data performed similarly to each other and equivalent models using conventional features or demographics only, demonstrating the value of RI-derived features.

Subject: INTERSPEECH.2025 - Speech Detection


#9 DepressGEN: Synthetic Data Generation Framework for Depression Detection [PDF] [Copy] [Kimi1] [REL]

Authors: Wenrui Liang, Rong Zhang, Xuezhen Zhang, Ying Ma, Wei-Qiang Zhang

Automated depression detection is vital for early diagnosis, but ethical and privacy concerns often limit the availability of sufficient training data, hindering research in depression screening. To address this, we introduce DepressGEN, a novel framework that generates synthetic interview dialogue texts and speech simulating depressed patients to improve training for detection models. By inputting linguistic features associated with depression into a large language model, we create dialogue texts and use a TTS system to generate corresponding speech. We also developed a depression modulation module to modify the synthesized speech, as well as a speech verification module to bridge the gap between synthetic and real data distributions. Our results demonstrate that a GRU/BiLSTM-based model trained with additional synthetic data improves F1 scores by 9.9% compared to the same model trained only on original data, outperforming existing methods on the EATD dataset.

Subject: INTERSPEECH.2025 - Speech Detection


#10 Emotion-Guided Graph Attention Networks for Speech-Based Depression Detection under Emotion-Inducting Tasks [PDF1] [Copy] [Kimi1] [REL]

Authors: Yuqiu Zhou, Yongjie Zhou, Yudong Yang, Yang Liu, Jun Huang, Shuzhi Zhao, Rongfeng Su, Lan Wang, Nan Yan

Depression affects emotional expression and perception. As a non-invasive and privacy-preserving method, speech is widely used for automatic depression detection. However, existing models often focus only on depressive features in speech, ignoring the differential emotion expression patterns across different emotion-inducing tasks. To address this, we propose an emotion-guided graph attention network (emoGAT) for depression detection. By collecting speech-text data from depressed individuals and healthy controls during emotion-inducing tasks, we construct graph embeddings using sentiment cues from both speech and text. Experimental results show our method reduces the standard deviation by 1.8% and improves accuracy by 4.36%. Graph attention visualization also reveals depression-specific characteristics, such as flattened prosody in neutral picture description tasks and cognitive biases toward negative information, offering deeper insights into emotional relational expressions.

Subject: INTERSPEECH.2025 - Speech Detection


#11 Explainable Depression Detection using Masked Hard Instance Mining [PDF] [Copy] [Kimi] [REL]

Authors: Patawee Prakrankamanant, Shinji Watanabe, Ekapol Chuangsuwanich

This paper addresses the critical need for improved explainability in text-based depression detection. While offering predictive outcomes, current solutions often overlook the understanding of model predictions which can hinder trust in the system. We propose the use of Masked Hard Instance Mining (MHIM) to enhance the explainability in the depression detection task. MHIM strategically masks attention weights within the model, compelling it to distribute attention across a wider range of salient features. We evaluate MHIM on two datasets representing distinct languages: Thai (Thai-Maywe) and English (DAIC-WOZ). Our results demonstrate that MHIM significantly improves performance in terms of both prediction accuracy and explainability metrics.

Subject: INTERSPEECH.2025 - Speech Detection


#12 Test-Time Training for Speech-based Depression Detection [PDF] [Copy] [Kimi] [REL]

Authors: Sri Harsha Dumpala, Chandramouli S. Sastry, Rudolf Uher, Sageev Oore

Previous works on speech-based depression detection typically use datasets collected in similar environments for both training and testing the models. However, in practice, the training and testing distributions often differ. Distributional shifts in speech can result from various factors, such as differences in recording environments (e.g., background noise) and demographic attributes (e.g., gender, age). These shifts can significantly degrade the performance of depression detection models. In this paper, we analyze the application of test-time training (TTT) to improve the robustness of depression detection models against such shifts. Our results demonstrate that TTT can substantially enhance model performance under various distributional shifts, including those caused by (a) background noise, (b) gender bias, and (c) differences in data collection and curation procedures, where training and testing samples originate from different datasets.

Subject: INTERSPEECH.2025 - Speech Detection


#13 Leveraging Ordinal Information for Speech-based Depression Classification [PDF] [Copy] [Kimi] [REL]

Authors: Lishi Zuo, Man-Wai Mak

While depression is inherently ordinal, much of the previous work in depression detection oversimplifies the problem by treating it as a binary classification problem, ignoring the subtle variations and the order in depression severity. We propose creating a latent space that contains ordinal information via an ordinal loss to benefit the learning of depression classification. Specifically, we define K thresholds for the depression scores, thereby creating a series of binary classification tasks on different levels of depression (e.g., mild vs. non-mild). The ordinal loss allows the model to capture the relationships between these levels on top of the binary classification task. Our approach outperforms current state-of-the-art depression detection methods, highlighting the importance of considering the inherent ordinal nature of depression severity.

Subject: INTERSPEECH.2025 - Speech Detection


#14 Zero-Shot Speech-Based Depression and Anxiety Assessment with LLMs [PDF] [Copy] [Kimi] [REL]

Authors: Erfan Loweimi, Sofia de la Fuente Garcia, Saturnino Luz

The use of Large Language Models (LLMs) for psychological state assessment from speech has gained significant interest, particularly in analysing and predicting mental health. In this paper, we explore the potential of eight instruct-tuned LLMs (Llama-3.1-8B, Ministral, Gemma-2-9B, Phi-4, Mistral, DeepSeek-Qwen, QwQ-Preview and Llama-3.3-70B) in a zero-shot setting to predict Hospital Anxiety and Depression Scale (HADS) depression and anxiety scores from one-to-two minute spontaneous speech recordings from the PsyVoiD database. We evaluate how transcript quality affects LLM responses by comparing performance using ground-truth transcriptions versus transcripts generated by Whisper models of different sizes. Spearman correlation coefficients and statistical analysis demonstrate significant and notable potential of the LLMs to predict psychological states in a zero-shot setting.

Subject: INTERSPEECH.2025 - Speech Detection


#15 Towards the Objective Characterisation of Major Depressive Disorder Using Speech Data from a 12-week Observational Study with Daily Measurements [PDF] [Copy] [Kimi] [REL]

Authors: Robert Lewis, Szymon Fedor, Nelson Hidalgo Julia, Joshua Curtiss, Jiyeon Kim, Noah Jones, David Mischoulon, Thomas F Quatieri, Nicholas Cummins, Paola Pedrelli, Rosalind Picard

Our analysis focuses on identifying relations between properties of the voice and depression symptom severity. On a novel corpus of 3,374 longitudinal speech recordings from 71 patients clinically diagnosed with major depressive disorder (MDD), we use a statistical modelling approach to identify associations between depression symptom severity and 38 acoustic and cognitive features. Significant negative associations with daily within-individual fluctuations of depression include speaking rate and articulation rate. Furthermore, when analysing how the changes in speech-derived features covary over time with the change in depression severity, we find that the standard deviation of the pitch has a significant negative association, as well as the speaking and articulation rate. We also discover that several performance metrics derived from the cognitive tasks (digit-span and Stroop) have significant associations with fluctuations or changes in depression symptom severity.

Subject: INTERSPEECH.2025 - Speech Detection


#16 Can Speech Accurately Detect Depression in Patients With Comorbid Dementia? An Approach for Mitigating Confounding Effects of Depression and Dementia [PDF] [Copy] [Kimi] [REL]

Authors: Sophie Young, Fuxiang Tao, Bahman Mirheidari, Madhurananda Pahar, Markus Reuber, Heidi Christensen

Approximately 15.9% of people living with dementia experience co-occurring major depressive disorder. Both disorders cause similar early clinical symptoms in older people but treatment options and patient outcomes differ. While it is challenging, it is therefore critical for clinicians to be able to distinguish between them. We build on existing research into objective markers of depression in speech, testing their generalizability to a more complex population. On a novel, comorbidity dataset, we demonstrate that existing depression classification methods perform worse for participants with dementia than they do for those with no cognitive decline. We also propose a method of applying Wasserstein distance-based weight vectors to emphasize depression-related information which is robust against the effect of dementia. This improves performance for users with dementia, without requiring changes to the model architectures. Our best performing model achieves an overall F1-score of 81.0%.

Subject: INTERSPEECH.2025 - Speech Detection


#17 Temporal Convolutional Network with Smoothed and Weighted Losses for Distant Voice Activity and Overlapped Speech Detection [PDF1] [Copy] [Kimi1] [REL]

Authors: Shaojie Li, Qintuya Si, De Hu

Voice Activity Detection (VAD) and Overlapped Speech Detection (OSD) are key steps in various audio/speech processing tasks. Recent advances in VAD or OSD are moving toward using Temporal Convolutional Networks (TCNs) with frame-independent cross-entropy loss, which may be unable to cope with transient errors or boundary errors (caused by weak recordings at speech boundaries). In this paper, we formulate two novel losses, namely smoothed loss and weighted loss, in which the former copes with transient errors while the latter deals with boundary errors. In addition, we adopt Mel Frequency Cepstral Coefficients (MFCCs) and Instantaneous Correlation Coefficients (ICCs) as the acoustic and spatial features to drive the model. To improve computing efficiency, we also propose a spatial feature extraction module by selecting those frequencies with information-rich ICCs, which delivers good lightweight nature. Numerical experiments validate the efficacy of the proposed method.

Subject: INTERSPEECH.2025 - Speech Detection


#18 Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion [PDF] [Copy] [Kimi1] [REL]

Authors: Kumud Tripathi, Chowdam Venkata Kumar, Pankaj Wasnik

Voice Activity Detection (VAD) plays a vital role in speech processing, often relying on hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best fusion model outperforms state-of-the-art Pyannote VAD model across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.

Subject: INTERSPEECH.2025 - Speech Detection


#19 SpeechMLC: Speech Multi-label Classification [PDF] [Copy] [Kimi1] [REL]

Authors: Miseul Kim, Seyun Um, Hyeonjin Cha, Hong-Goo Kang

In this paper, we propose a multi-label classification framework to detect multiple speaking styles in a speech sample. Unlike previous studies that have primarily focused on identifying a single target style, our framework effectively captures various speaker characteristics within a unified structure, making it suitable for generalized human-computer interaction applications. The proposed framework integrates cross-attention mechanisms within a transformer decoder to extract salient features associated with each target label from the input speech. To mitigate the data imbalance inherent in multi-label speech datasets, we employ a data augmentation technique based on a speech generation model. We validate our model's effectiveness through multiple objective evaluations on seen and unseen corpora. In addition, we provide an analysis of the influence of human perception on classification accuracy by considering the impact of human labeling agreement on model performance.

Subject: INTERSPEECH.2025 - Speech Detection


#20 Fully End-to-end Streaming Open-vocabulary Keyword Spotting with W-CTC Forced Alignment [PDF] [Copy] [Kimi1] [REL]

Authors: Dohyun Kim, Jiwook Hwang

In open-vocabulary keyword spotting, an acoustic encoder pre-trained with Connectionist Temporal Classification (CTC) loss is typically used to train a text encoder by aligning audio embedding space with text embedding space. In previous work, word-aligned datasets were created by forced alignment algorithms such as the Montreal Forced Aligner (MFA) to train text encoder and verifier models. In this paper, we propose a new training pipeline for open-vocabulary keyword spotting using the W-CTC forced alignment algorithm, a simple modification of the practical CTC algorithm. Our approach eliminates the need for creating word-aligned datasets, operates in a fully end-to-end manner, and demonstrates superior performance on the Libriphrase hard dataset.

Subject: INTERSPEECH.2025 - Speech Detection


#21 Comparative Evaluation of Acoustic Feature Extraction Tools for Clinical Speech Analysis [PDF] [Copy] [Kimi] [REL]

Authors: Anna Seo Gyeong Choi, Alexander Richardson, Ryan Partlan, Sunny X. Tang, Sunghye Cho

This study compares three acoustic feature extraction toolkits—OpenSMILE, Praat, and Librosa—applied to clinical speech data from individuals with schizophrenia spectrum disorders (SSD) and healthy controls (HC). By standardizing extraction parameters across the toolkits, we analyzed speech samples from 77 SSD and 87 HC participants and found significant toolkit-dependent variations. While F0 percentiles showed high cross-toolkit correlation (r=0.962–0.999), measures like F0 standard deviation and formant values often had poor, even negative, agreement. Additionally, correlation patterns differed between SSD and HC groups. Classification analysis identified F0 mean, HNR, and MFCC1 (AUC > 0.70) as promising discriminators. These findings underscore reproducibility concerns and advocate for standardized protocols, multi-toolkit cross-validation, and transparent reporting.

Subject: INTERSPEECH.2025 - Speech Detection


#22 Can We Trust Machine Learning? The Reliability of Features from Open-Source Speech Analysis Tools for Speech Modeling [PDF] [Copy] [Kimi] [REL]

Authors: Tahiya Chowdhury, Veronica Romero

Machine learning-based behavioral models rely on features extracted from audio-visual recordings. The recordings are processed using open-source tools to extract speech features for classification models. These tools often lack validation to ensure reliability in capturing behaviorally relevant information. This gap raises concerns about reproducibility and fairness across diverse populations and contexts. Speech processing tools, when used outside of their design context, can fail to capture behavioral variations equitably and can then contribute to bias. We evaluate speech features extracted from two widely used speech analysis tools, OpenSMILE and Praat, to assess their reliability when considering adolescents with autism. We observed considerable variation in features across tools, which influenced model performance across context and demographic groups. We encourage domain-relevant verification to enhance the reliability of machine learning models in clinical applications.

Subject: INTERSPEECH.2025 - Speech Detection


#23 SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification [PDF] [Copy] [Kimi1] [REL]

Authors: Theo Lepage, Reda Dehak

Self-Supervised Learning (SSL) has led to considerable progress in Speaker Verification (SV). The standard framework uses same-utterance positive sampling and data-augmentation to generate anchor-positive pairs of the same speaker. This is a major limitation, as this strategy primarily encodes channel information from the recording condition, shared by the anchor and positive. We propose a new positive sampling technique to address this bottleneck: Self-Supervised Positive Sampling (SSPS). For a given anchor, SSPS aims to find an appropriate positive, i.e., of the same speaker identity but a different recording condition, in the latent space using clustering assignments and a memory queue of positive embeddings. SSPS improves SV performance for both SimCLR and DINO, reaching 2.57% and 2.53% EER, outperforming SOTA SSL methods on VoxCeleb1-O. In particular, SimCLR-SSPS achieves a 58% EER reduction by lowering intra-speaker variance, providing comparable performance to DINO-SSPS.

Subject: INTERSPEECH.2025 - Speech Detection


#24 ParaNoise-SV: Integrated Approach for Noise-Robust Speaker Verification with Parallel Joint Learning of Speech Enhancement and Noise Extraction [PDF] [Copy] [Kimi] [REL]

Authors: Minu Kim, Kangwook Jang, Hoirin Kim

Noise-robust speaker verification leverages joint learning of speech enhancement (SE) and speaker verification (SV) to improve robustness. However, prevailing approaches rely on implicit noise suppression, which struggles to separate noise from speaker characteristics as they do not explicitly distinguish noise from speech during training. Although integrating SE and SV helps, it remains limited in handling noise effectively. Meanwhile, recent SE studies suggest that explicitly modeling noise, rather than merely suppressing it, enhances noise resilience. Reflecting this, we propose ParaNoise-SV, with dual U-Nets combining a noise extraction (NE) network and a speech enhancement (SE) network. The NE U-Net explicitly models noise, while the SE U-Net refines speech with guidance from NE through parallel connections, preserving speaker-relevant features. Experimental results show that ParaNoise-SV achieves a relatively 8.4% lower equal error rate (EER) than previous joint SE-SV models.

Subject: INTERSPEECH.2025 - Speech Detection


#25 Disentangling Speaker and Content in Pre-trained Speech Models with Latent Diffusion for Robust Speaker Verification [PDF1] [Copy] [Kimi] [REL]

Authors: Zhe Li, Man-Wai Mak, Jen-Tzung Chien, Mert Pilanci, Zezhong Jin, Helen Meng

Disentangled speech representation learning for speaker verification aims to separate spoken content and speaker timbre into distinct representations. However, existing variational autoencoder (VAE)--based methods for speech disentanglement rely on latent variables that lack semantic meaning, limiting their effectiveness for speaker verification. To address this limitation, we propose a diffusion-based method that disentangles and separates speaker features and speech content in the latent space. Building upon the VAE framework, we employ a speaker encoder to learn latent variables representing speaker features while using frame-specific latent variables to capture content. Unlike previous sequential VAE approaches, our method utilizes a conditional diffusion model in the latent space to derive speaker-aware representations. Experiments on the VoxCeleb datasets demonstrate that our method effectively isolates speaker features from speech content using pre-trained speech

Subject: INTERSPEECH.2025 - Speech Detection