Unified Multi-Talker ASR with and without Target-speaker Enrollment

#1 Unified Multi-Talker ASR with and without Target-speaker Enrollment [PDF²] [Copy] [Kimi¹] [REL]

Authors: Ryo Masumura, Naoki Makishima, Tomohiro Tanaka, Mana Ihori, Naotaka Kawata, Shota Orihashi, Kazutoshi Shinoda, Taiga Yamane, Saki Mizuno, Keita Suzuki, Satoshi Suzuki, Nobukatsu Hojo, Takafumi Moriya, Atsushi Ando

This paper proposes a novel multi-talker automatic speech recognition (MT-ASR) system that can perform both a target-speaker enrollment-driven process and a target-speaker-free process in a unified modeling framework. In previous studies, these two MT-ASR forms were independently modeled with unshareable parameters. However, the independent modeling cannot mutually utilize knowledge trained with different tasks. Our key idea for bridging the gap between the two forms is to introduce modeling that can regard the target-speaker-free process as the target-speaker enrollment-driven process enrolled with no target-speaker information. Therefore, our method constructs a unified autoregressive model with a removable target-speaker encoder, and its shareable model parameters are trained jointly using training datasets with and without target-speaker enrollment. Experiments demonstrated that our unified modeling significantly outperforms the independent modeling in both MT-ASR forms.

Subject: INTERSPEECH.2024 - Speech Recognition

masumura24@interspeech_2024@ISCA

#1 Unified Multi-Talker ASR with and without Target-speaker Enrollment [PDF2] [Copy] [Kimi1] [REL]

#1 Unified Multi-Talker ASR with and without Target-speaker Enrollment [PDF²] [Copy] [Kimi¹] [REL]