huang24d@interspeech_2024@ISCA

Total: 1

#1 On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization [PDF1] [Copy] [Kimi] [REL]

Authors: Yiling Huang, Weiran Wang, Guanlong Zhao, Hank Liao, Wei Xia, Quan Wang

While standard speaker diarization attempts to answer the question "who spoke when", many realistic applications are interested in determining "who spoke what". In both the conventional modularized approach and the more recent end-to-end neural diarization (EEND), an additional automatic speech recognition (ASR) model and an orchestration algorithm are required to associate speakers with recognized words. In this paper, we propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network, a multi-task learning algorithm that performs end-to-end ASR and speaker diarization in the same architecture by sharing blank logits. Such a framework allows easily adding diarization capabilities to any existing RNN-T based ASR models without Word Error Rate (WER) regressions. Experimental results demonstrate that WEEND outperforms a strong turn-based diarization baseline system on all 2-speaker short-form scenarios, with the capability to generalize to audio lengths of 5 minutes.