zhang21@interspeech_2021@ISCA

Total: 1

#1 Improving Time Delay Neural Network Based Speaker Recognition with Convolutional Block and Feature Aggregation Methods [PDF] [Copy] [Kimi1]

Authors: Yu-Jia Zhang ; Yih-Wen Wang ; Chia-Ping Chen ; Chung-Li Lu ; Bo-Cheng Chan

In this paper, we develop a system that integrates multiple ideas and techniques inspired by the convolutional block and feature aggregation methods. We begin with the state-of-the-art speaker-embedding model for speaker recognition, namely the model of Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network, and then gradually experiment with the proposed network modules, including bottleneck residual blocks, attention mechanisms, and feature aggregation methods. In our final model, we replace the Res2Block with SC-Block and we use a hierarchical architecture for feature aggregation. We evaluate the performance of our model on the VoxCeleb1 test set and the 2020 VoxCeleb Speaker Recognition Challenge (VoxSRC20) validation set. The relative improvement of the proposed models over ECAPA-TDNN is 22.8% on VoxCeleb1 and 18.2% on VoxSRC20.