Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification

#1 Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification [PDF] [Copy] [Kimi¹] [REL]

Authors: Yanfeng Wu, Junan Zhao, Chenkai Guo, Jing Xu

Deep Convolutional Neural Network (CNN) based speaker embeddings, such as r-vectors, have shown great success in text-independent speaker verification (TI-SV) task. However, previous deep CNN models usually use fixed-length samples for training and employ variable-length utterances for speaker embeddings, which generates a mismatch between training and embedding. To address this issue, we investigate the effect of employing variable-length training samples on CNN-based TI-SV systems and explore two approaches to improve the performance of deep CNN architectures on TI-SV through capturing variable-term contexts. Firstly, we present an improved selective kernel convolution which allows the networks to adaptively switch between short-term and long-term contexts based on variable-length utterances. Secondly, we propose a multi-scale statistics pooling method to aggregate multiple time-scale features from different layers of the networks. We build a novel ResNet34 based architecture with two proposed approaches. Experiments are conducted on the VoxCeleb datasets. The results demonstrate that the effect of using variable-length samples is diverse in different networks and the architecture with two proposed approaches achieves significant improvement over r-vectors baseline system.

Subject: INTERSPEECH.2021 - Others

wu21@interspeech_2021@ISCA

#1 Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification [PDF] [Copy] [Kimi1] [REL]

#1 Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification [PDF] [Copy] [Kimi¹] [REL]