dou21@interspeech_2021@ISCA

Total: 1

#1 Deliberation-Based Multi-Pass Speech Synthesis [PDF] [Copy] [Kimi]

Authors: Qingyun Dou ; Xixin Wu ; Moquan Wan ; Yiting Lu ; Mark J.F. Gales

Sequence-to-sequence (seq2seq) models have achieved state-of-the-art performance in a wide range of tasks including Neural Machine Translation (NMT) and Text-To-Speech (TTS). These models are usually trained with teacher forcing, where the reference back-history is used to predict the next token. This makes training efficient, but limits performance, because during inference the free-running back-history must be used. To address this problem, deliberation-based multi-pass seq2seq has been used in NMT. Here the output sequence is generated in multiple passes, each one conditioned on the initial input and the free-running output of the previous pass. This paper investigates, and compares, deliberation-based multi-pass seq2seq for TTS and NMT. For NMT the simplest form of multi-pass approaches, where the free-running first-pass output is combined with the initial input, improves performance. However, applying this scheme to TTS is challenging: the multi-pass model tends to converge to the standard single-pass model, ignoring the previous output. To tackle this issue, a guided attention loss is added, enabling the system to make more extensive use of the free-running output. Experimental results confirm the above analysis and demonstrate that the proposed TTS model outperforms a strong baseline.