Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-attended Speaker Representations

#1 Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-attended Speaker Representations [PDF] [Copy] [Kimi²] [REL]

Authors: Shucong Zhang, Malcolm Chadwick, Alberto Gil C. P. Ramos, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya

Personalised speech enhancement (PSE) extracts only the speech of a target user and removes everything else from corrupted input audio. This can greatly improve on-device streaming audio processing, such as voice calls and speech recognition, which has strict requirements on model size and latency. To focus the PSE system on the target speaker, it is conditioned on a recording of the user's voice. This recording is usually summarised as a single static vector. However, a static vector cannot reflect all the target user's voice characteristics. Thus, we propose using the full recording. To condition on such a variable-length sequence, we propose fully Transformer-based PSE models with a cross-attention mechanism which generates target speaker representations dynamically. To better reflect the on-device scenario, we carefully design and publish a new PSE dataset. On the dataset, our proposed model significantly surpasses strong baselines while halving the model size and reducing latency.

Subject: INTERSPEECH.2023 - Speech Processing

zhang23r@interspeech_2023@ISCA

#1 Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-attended Speaker Representations [PDF] [Copy] [Kimi2] [REL]

#1 Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-attended Speaker Representations [PDF] [Copy] [Kimi²] [REL]