choi23b@interspeech_2023@ISCA

Total: 1

#1 Resolution Consistency Training on Time-Frequency Domain for Semi-Supervised Sound Event Detection [PDF] [Copy] [Kimi1]

Authors: Won-Gook Choi ; Joon-Hyuk Chang

The fact that unlabeled data can be used for supervised learning is of considerable relevance concerning polyphonic sound event detection (PSED) because of the high costs of frame-wise labeling. While semi-supervised learning (SSL) for image tasks has been extensively developed, SSL for PSED has not been substantially explored due to data augmentation limitations. In this paper, we propose a novel SSL strategy for PSED called resolution consistency training (ResCT), combining unsupervised terms with the mean teacher using different resolutions of a spectrogram for data augmentation. The proposed method regularizes the consistency between the model predictions for different resolutions by controlling the sampling rate and window size. Experimental results show that ResCT outperforms other SSL methods on various evaluation metrics: event-f1 score, intersection-f1 score, and PSDSs. Finally, we report on some ablation studies for the weak and strong augmentation policies.