sebastian19@interspeech_2019@ISCA

Total: 1

#1 Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts [PDF] [Copy] [Kimi]

Authors: Jilt Sebastian ; Piero Pierucci

In human perception and understanding, a number of different and complementary cues are adopted according to different modalities. Various emotional states in communication between humans reflect this variety of cues across modalities. Recent developments in multi-modal emotion recognition utilize deep-learning techniques to achieve remarkable performances, with models based on different features suitable for text, audio and vision. This work focuses on cross-modal fusion techniques over deep learning models for emotion detection from spoken audio and corresponding transcripts. We investigate the use of long short-term memory (LSTM) recurrent neural network (RNN) with pre-trained word embedding for text-based emotion recognition and convolutional neural network (CNN) with utterance-level descriptors for emotion recognition from speech. Various fusion strategies are adopted on these models to yield an overall score for each of the emotional categories. Intra-modality dynamics for each emotion is captured in the neural network designed for the specific modality. Fusion techniques are employed to obtain the inter-modality dynamics. Speaker and session-independent experiments on IEMOCAP multi-modal emotion detection dataset show the effectiveness of the proposed approaches. This method yields state-of-the-art results for utterance-level emotion recognition based on speech and text.