soni19@interspeech_2019@ISCA

Total: 1

#1 Label Driven Time-Frequency Masking for Robust Continuous Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Meet Soni ; Ashish Panda

The application of Time-Frequency (T-F) masking based approaches for Automatic Speech Recognition has been shown to provide significant gains in system performance in the presence of additive noise. Such approaches give performance improvement when the T-F masking front-end is trained jointly with the acoustic model. However, such systems still rely on a pre-trained T-F masking enhancement block, trained using pairs of clean and noisy speech signals. Pre-training is necessary due to large number of parameters associated with the enhancement network. In this paper, we propose a flat-start joint training of a network that has both a T-F masking based enhancement block and a phoneme classification block. In particular, we use fully convolutional network as an enhancement front-end to reduce the number of parameters. We train the network by jointly updating the parameters of both these blocks using tied Context-Dependent phoneme states as targets. We observe that pretraining of the proposed enhancement block is not necessary for the convergence. In fact, the proposed flat-start joint training converges faster than the baseline multi-condition trained model. The experiments performed on Aurora-4 database show 7.06% relative improvement over multi-conditioned baseline. We get similar improvements for unseen test conditions as well.