Speech Enhancement with Weakly Labelled Data from AudioSet

#1 Speech Enhancement with Weakly Labelled Data from AudioSet [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Qiuqiang Kong, Haohe Liu, Xingjian Du, Li Chen, Rui Xia, Yuxuan Wang

Speech enhancement is a task to improve the intelligibility and perceptual quality of degraded speech signals. Recently, neural network-based methods have been applied to speech enhancement. However, many neural network-based methods require users to collect clean speech and background noise for training, which can be time-consuming. In addition, speech enhancement systems trained on particular types of background noise may not generalize well to a wide range of noise. To tackle those problems, we propose a speech enhancement framework trained on weakly labelled data. We first apply a pretrained sound event detection system to detect anchor segments that contain sound events in audio clips. Then, we randomly mix two detected anchor segments as a mixture. We build a conditional source separation network using the mixture and a conditional vector as input. The conditional vector is obtained from the audio tagging predictions on the anchor segments. In inference, we input a noisy speech signal with the one-hot encoding of “Speech” as a condition to the trained system to predict enhanced speech. Our system achieves a PESQ of 2.28 and an SSNR of 8.75 dB on the VoiceBank-DEMAND dataset, outperforming the previous SEGAN system of 2.16 and 7.73 dB respectively.

Subject: INTERSPEECH.2021 - Speech Processing

kong21@interspeech_2021@ISCA

#1 Speech Enhancement with Weakly Labelled Data from AudioSet [PDF1] [Copy] [Kimi1] [REL]

#1 Speech Enhancement with Weakly Labelled Data from AudioSet [PDF¹] [Copy] [Kimi¹] [REL]