fathullah23@interspeech_2023@ISCA

Total: 1

#1 Multi-Head State Space Model for Speech Recognition [PDF1] [Copy] [Kimi1]

Authors: Yassir Fathullah ; Chunyang Wu ; Yuan Shangguan ; Junteng Jia ; Wenhan Xiong ; Jay Mahadeokar ; Chunxi Liu ; Yangyang Shi ; Ozlem Kalinli ; Mike Seltzer ; Mark J. F. Gales

State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76%/4.37% on the development and 1.91%/4.36% on the test sets without using an external language model.