harkonen24@interspeech_2024@ISCA

Total: 1

#1 EEND-M2F: Masked-attention mask transformers for speaker diarization [PDF] [Copy] [Kimi] [REL]

Authors: Marc Härkönen, Samuel J. Broughton, Lahiru Samarakoon

In this paper, we make the explicit connection between image segmentation methods and end-to-end diarization methods. From these insights, we propose a novel, fully end-to-end diarization model, EEND-M2F, based on the Mask2Former architecture. Speaker representations are computed in parallel using a stack of transformer decoders, in which irrelevant frames are explicitly masked from the cross attention using predictions from previous layers. EEND-M2F is efficient, and truly end-to-end, eliminating the need for additional segmentation models or clustering algorithms. Our model achieves state-of-the-art performance on several public datasets, such as AMI, AliMeeting and RAMC. Most notably our DER of 16.07% on DIHARD-III is the first major improvement upon the challenge winning system.