ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

#1 ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps [PDF] [Copy] [Kimi] [REL]

Authors: Yulin Song ; Guorui Sang ; Jing Yu ; Chuangbai Xiao

Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at https://keylxiao.github.io/consinger.

Subjects: Sound ; Machine Learning ; Audio and Speech Processing

Publish: 2024-10-20 09:32:03 UTC

2410.15342

#1 ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps [PDF] [Copy] [Kimi] [REL]