Robust Voice Activity Detection using Locality-Sensitive Hashing and Residual Frequency-Temporal Attention

#1 Robust Voice Activity Detection using Locality-Sensitive Hashing and Residual Frequency-Temporal Attention [PDF] [Copy] [Kimi] [REL]

For voice activity detection (VAD), recent works focus on learning the attention distribution on contextual information of speech to reduce the impact of irrelevant noise. However, contextual frames selected with specific steps may not be relevant, and these attention mechanisms can not fully discover the structure and characteristics of speech. In this paper, we explore a self-attention-inspired locality-sensitive hashing algorithm (SALSH) for dynamic and efficient contextual frame selection to enrich the frame-level features into a 2D partial spectrogram. Then, we propose a residual frequency-temporal attention model (FTAM) for VAD, consisting of an interval branch, an analogous hourglass structure with channel attention, and an attention learning mechanism for speech based on frequency-temporal attention. On the LibriSpeech and TIMIT datasets, the proposed method outperforms the others in terms of area under the curve (AUC), even under extremely low signal-to-noise ratio of -15dB.

Subject: INTERSPEECH.2024 - Speech Recognition

li24sa@interspeech_2024@ISCA

#1 Robust Voice Activity Detection using Locality-Sensitive Hashing and Residual Frequency-Temporal Attention [PDF] [Copy] [Kimi] [REL]