An Efficient and High Fidelity Vietnamese Streaming End-to-End Speech Synthesis

#1 An Efficient and High Fidelity Vietnamese Streaming End-to-End Speech Synthesis [PDF] [Copy] [Kimi²] [REL]

Authors: Tho Nguyen Duc Tran, The Chuong Chu, Vu Hoang, Trung Huu Bui, Hung Quoc Truong

In recent years, parallel end-to-end speech synthesis systems have outperformed the 2-stage TTS approaches in audio quality and latency. A parallel end-to-end speech like VITS can generate the audio with high MOS comparable to ground truth and achieve low latency on GPU. However, the VITS still has high latency when synthesizing long utterances on CPUs. Therefore, in this paper, we propose a streaming method for the parallel speech synthesis model like VITS to synthesize with the long texts effectively on CPU. Our system has achieved human-like speech quality in both the non-streaming and streaming mode on the in-house Vietnamese evaluation set, while the synthesis speed of our system is approximately four times faster than that of the VITS in the non-streaming mode. Furthermore, the customer perceived latency of our system in streaming mode is 25 times faster than the VITS on computer CPU. Our system in non-streaming mode achieves a MOS of 4.43 compared to ground-truth with MOS 4.56; it also has high-quality speech with a MOS of 4.35 in streaming mode. Finally, we release a Vietnamese single accent dataset used in our experiments.

Subject: INTERSPEECH.2022 - Speech Synthesis

tran22@interspeech_2022@ISCA

#1 An Efficient and High Fidelity Vietnamese Streaming End-to-End Speech Synthesis [PDF] [Copy] [Kimi2] [REL]

#1 An Efficient and High Fidelity Vietnamese Streaming End-to-End Speech Synthesis [PDF] [Copy] [Kimi²] [REL]