Improved Single System Conversational Telephone Speech Recognition with VGG Bottleneck Features

#1 Improved Single System Conversational Telephone Speech Recognition with VGG Bottleneck Features [PDF] [Copy] [Kimi¹] [REL]

Authors: William Hartmann, Roger Hsiao, Tim Ng, Jeff Ma, Francis Keith, Man-Hung Siu

On small datasets, discriminatively trained bottleneck features from deep networks commonly outperform more traditional spectral or cepstral features. While these features are typically trained with small, fully-connected networks, recent studies have used more sophisticated networks with great success. We use the recent deep CNN (VGG) network for bottleneck feature extraction — previously used only for low-resource tasks — and apply it to the Switchboard English conversational telephone speech task. Unlike features derived from traditional MLP networks, the VGG features outperform cepstral features even when used with BLSTM acoustic models trained on large amounts of data. We achieve the best BBN single system performance when combining the VGG features with a BLSTM acoustic model. When decoding with an n-gram language model, which are used for deployable systems, we have a realistic production system with a WER of 7.4%. This result is competitive with the current state-of-the-art in the literature. While our focus is on realistic single system performance, we further reduce the WER to 6.1% through system combination and using expensive neural network language model rescoring.

Subject: INTERSPEECH.2017 - Speech Recognition

hartmann17@interspeech_2017@ISCA

#1 Improved Single System Conversational Telephone Speech Recognition with VGG Bottleneck Features [PDF] [Copy] [Kimi1] [REL]

#1 Improved Single System Conversational Telephone Speech Recognition with VGG Bottleneck Features [PDF] [Copy] [Kimi¹] [REL]