Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition

#1 Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Danqing Luo, Yuexian Zou, Dongyan Huang

Speech emotion recognition (SER) is a challenging task due to its difficulty in finding proper representations for emotion embedding in speech. Recently, Convolutional Recurrent Neural Network (CRNN), which is combined by convolution neural network and recurrent neural network, is popular in this field and achieves state-of-art on related corpus. However, most of work on CRNN only utilizes simple spectral information, which is not capable to capture enough emotion characteristics for the SER task. In this work, we investigate two joint representation learning structures based on CRNN aiming at capturing richer emotional information from speech. Cooperating the handcrafted high-level statistic features with CRNN, a two-channel SER system (HSF-CRNN) is developed to jointly learn the emotion-related features with better discriminative property. Furthermore, considering that the time duration of speech segment significantly affects the accuracy of emotion recognition, another two-channel SER system is proposed where CRNN features extracted from different time scale of spectrogram segment are used for joint representation learning. The systems are evaluated over Atypical Affect Challenge of ComParE2018 and IEMOCAP corpus. Experimental results show that our proposed systems outperform the plain CRNN.

Subject: INTERSPEECH.2018 - Language and Multimodal

luo18@interspeech_2018@ISCA

#1 Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition [PDF] [Copy] [Kimi] [REL]