Evaluating the Extrapolation Capabilities of Neural Vocoders to Extreme Pitch Values

#1 Evaluating the Extrapolation Capabilities of Neural Vocoders to Extreme Pitch Values [PDF] [Copy] [Kimi¹] [REL]

Authors: Olivier Perrotin, Hussein El Amouri, Gérard Bailly, Thomas Hueber

Neural vocoders are systematically evaluated on homogeneous train and test databases. This kind of evaluation is efficient to compare neural vocoders in their “comfort zone”, yet it hardly reveals their limits towards unseen data during training. To compare their extrapolation capabilities, we introduce a methodology that aims at quantifying the robustness of neural vocoders in synthesising unseen data, by precisely controlling the ranges of seen/unseen data in the training database. By focusing in this study on the pitch (F0) parameter, our methodology involves a careful splitting of a dataset to control which F0 values are seen/unseen during training, followed by both global (utterance) and local (frame) evaluation of vocoders. Comparison of four types of vocoders (autoregressive, sourcefilter, flows, GAN) displays a wide range of behaviour towards unseen input pitch values, including excellent extrapolation (WaveGlow); widely-spread F0 errors (WaveRNN); and systematic generation of the training set median F0 (LPCNet, Parallel WaveGAN). In contrast, fewer differences between vocoders were observed when using homogeneous train and test sets, thus demonstrating the potential and need for such evaluation to better discriminate the neural vocoders abilities to generate out-of-training-range data.

Subject: INTERSPEECH.2021 - Speech Synthesis

perrotin21@interspeech_2021@ISCA

#1 Evaluating the Extrapolation Capabilities of Neural Vocoders to Extreme Pitch Values [PDF] [Copy] [Kimi1] [REL]

#1 Evaluating the Extrapolation Capabilities of Neural Vocoders to Extreme Pitch Values [PDF] [Copy] [Kimi¹] [REL]