Analysis by Synthesis: Using an Expressive TTS Model as Feature Extractor for Paralinguistic Speech Classification

#1 Analysis by Synthesis: Using an Expressive TTS Model as Feature Extractor for Paralinguistic Speech Classification [PDF] [Copy] [Kimi²] [REL]

Authors: Dominik Schiller, Silvan Mertes, Pol van Rijn, Elisabeth André

Modeling adequate features of speech prosody is one key factor to good performance in affective speech classification. However, the distinction between the prosody that is induced by ‘how’ something is said (i.e., affective prosody) and the prosody that is induced by ‘what’ is being said (i.e., linguistic prosody) is neglected in state-of-the-art feature extraction systems. This results in high variability of the calculated feature values for different sentences that are spoken with the same affective intent, which might negatively impact the performance of the classification. While this distinction between different prosody types is mostly neglected in affective speech recognition, it is explicitly modeled in expressive speech synthesis to create controlled prosodic variation. In this work, we use the expressive Text-To-Speech model Global Style Token Tacotron to extract features for a speech analysis task. We show that the learned prosodic representations outperform state-of-the-art feature extraction systems in the exemplary use case of Escalation Level Classification.

Subject: INTERSPEECH.2021 - Language and Multimodal

schiller21@interspeech_2021@ISCA

#1 Analysis by Synthesis: Using an Expressive TTS Model as Feature Extractor for Paralinguistic Speech Classification [PDF] [Copy] [Kimi2] [REL]

#1 Analysis by Synthesis: Using an Expressive TTS Model as Feature Extractor for Paralinguistic Speech Classification [PDF] [Copy] [Kimi²] [REL]