Transmitted and Aggregated Self-Attention for Automatic Speech Recognition

#1 Transmitted and Aggregated Self-Attention for Automatic Speech Recognition [PDF²] [Copy] [Kimi³] [REL]

Authors: Tian-Hao Zhang, Xinyuan Qian, Feng Chen, Xu-Cheng Yin

Transformer based models have recently achieved outstanding progress in ASR system. The attention maps are generated in self-attention to capture temporal relationships among input tokens and heavily influence transformer performance. Many works demonstrate that attention maps of different layers incorporate various contextual scopes of information. We believe that the information from diverse attention maps is valuable and complementary. This inspires us with a novel proposal, namely Transmitted and Aggregated Self-Attention (TASA), which leverages the information of attention maps in each layer to improve the overall performance. In particular, we design Residual-TASA and Dense-TASA which are distinguished by using attention maps of the previous layer or all previous layers, respectively. Extensive experiments demonstrate that the proposed method achieving up to 10.62% CER and 7.36% WER relative reduction conducted on AISHELL-1 and LibriSpeech datasets, respectively.

Subject: INTERSPEECH.2024 - Speech Recognition

zhang24q@interspeech_2024@ISCA

#1 Transmitted and Aggregated Self-Attention for Automatic Speech Recognition [PDF2] [Copy] [Kimi3] [REL]

#1 Transmitted and Aggregated Self-Attention for Automatic Speech Recognition [PDF²] [Copy] [Kimi³] [REL]