Articulatory synthesis using representations learnt through phonetic label-aware contrastive loss

#1 Articulatory synthesis using representations learnt through phonetic label-aware contrastive loss [PDF] [Copy] [Kimi] [REL]

Authors: Jesuraj Bandekar, Sathvik Udupa, Prasanta Kumar Ghosh

Articulatory speech synthesis is a challenging task which requires mapping of time-varying articulatory trajectories and speech. In recent years, deep learning methods have been proposed for speech synthesis which have achieved significant progress towards human-like speech generation. However, articulatory speech synthesis is far from human-level performance. Thus, in this work, we further improve the results of articulatory speech synthesis to enhance synthesis quality. We consider a deep learning-based sequence-to-sequence baseline. We improve upon this network using a novel approach of labelaware contrastive learning using framewise phoneme alignment to learn better representations of the articulatory trajectories. With this approach, we obtain a relative improvement in Word Error Rate (WER) of 5.8% over the baseline. We also conduct mean opinion score (MOS) tests and other objective metrics to further evaluate our proposed models.

Subject: INTERSPEECH.2024 - Others

bandekar24@interspeech_2024@ISCA

#1 Articulatory synthesis using representations learnt through phonetic label-aware contrastive loss [PDF] [Copy] [Kimi] [REL]