li25c@interspeech_2025@ISCA

Total: 1

#1 Temporal Convolutional Network with Smoothed and Weighted Losses for Distant Voice Activity and Overlapped Speech Detection [PDF1] [Copy] [Kimi1] [REL]

Authors: Shaojie Li, Qintuya Si, De Hu

Voice Activity Detection (VAD) and Overlapped Speech Detection (OSD) are key steps in various audio/speech processing tasks. Recent advances in VAD or OSD are moving toward using Temporal Convolutional Networks (TCNs) with frame-independent cross-entropy loss, which may be unable to cope with transient errors or boundary errors (caused by weak recordings at speech boundaries). In this paper, we formulate two novel losses, namely smoothed loss and weighted loss, in which the former copes with transient errors while the latter deals with boundary errors. In addition, we adopt Mel Frequency Cepstral Coefficients (MFCCs) and Instantaneous Correlation Coefficients (ICCs) as the acoustic and spatial features to drive the model. To improve computing efficiency, we also propose a spatial feature extraction module by selecting those frequencies with information-rich ICCs, which delivers good lightweight nature. Numerical experiments validate the efficacy of the proposed method.

Subject: INTERSPEECH.2025 - Speech Detection