Audio-visual Voice Conversion Using Deep Canonical Correlation Analysis for Deep Bottleneck Features

#1 Audio-visual Voice Conversion Using Deep Canonical Correlation Analysis for Deep Bottleneck Features [PDF] [Copy] [Kimi²] [REL]

Authors: Satoshi Tamura, Kento Horio, Hajime Endo, Satoru Hayamizu, Tomoki Toda

This paper proposes Audio-Visual Voice Conversion (AVVC) methods using Deep BottleNeck Features (DBNF) and Deep Canonical Correlation Analysis (DCCA). DBNF has been adopted in several speech applications to obtain better feature representations. DCCA can generate much correlated features in two views and enhance features in one modality based on another view. In addition, DCCA can make projections from different views ideally to the same vector space. Firstly, in this work, we enhance our conventional AVVC scheme by employing the DBNF technique in the visual modality. Secondly, we apply the DCCA technology to DBNFs for new effective visual features. Thirdly, we build a cross-modal voice conversion model available for both audio and visual DCCA features. In order to clarify effectiveness of these frameworks, we carried out subjective and objective evaluations and compared them with conventional methods. Experimental results show that our DBNF- and DCCA-based AVVC can successfully improve the quality of converted speech waveforms.

Subject: INTERSPEECH.2018 - Speech Synthesis

tamura18@interspeech_2018@ISCA

#1 Audio-visual Voice Conversion Using Deep Canonical Correlation Analysis for Deep Bottleneck Features [PDF] [Copy] [Kimi2] [REL]

#1 Audio-visual Voice Conversion Using Deep Canonical Correlation Analysis for Deep Bottleneck Features [PDF] [Copy] [Kimi²] [REL]