Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion

#1 Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion [PDF] [Copy] [Kimi¹] [REL]

Authors: Kumud Tripathi, Chowdam Venkata Kumar, Pankaj Wasnik

Voice Activity Detection (VAD) plays a vital role in speech processing, often relying on hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best fusion model outperforms state-of-the-art Pyannote VAD model across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.

Subject: INTERSPEECH.2025 - Speech Detection

tripathi25@interspeech_2025@ISCA

#1 Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion [PDF] [Copy] [Kimi1] [REL]

#1 Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion [PDF] [Copy] [Kimi¹] [REL]