Total: 1
In this paper, we present EfficientSing, a Chinese singing voice synthesis (SVS) system based on a non-autoregressive duration-free acoustic model and HiFi-GAN neural vocoder. Different from many existing SVS methods, no auxiliary duration prediction module is needed in this work, since a newly proposed monotonic alignment modeling mechanism is adopted. Moreover, we follow the non-autoregressive architecture of EfficientTTS with some singing-specific adaption, making training and inference fully parallel and efficient. HiFi-GAN vocoder is adopted to improve the voice quality of synthesized songs and inference efficiency. Both objective and subjective experimental results show that the proposed system can produce quite natural and high-fidelity songs and outperform the Tacotron-based baseline in terms of pronunciation, pitch and rhythm.