Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers

#1 Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers [PDF] [Copy] [Kimi] [REL]

Authors: Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa

Emotions are an essential element in verbal communication, so understanding individuals' affect during a human-robot interaction (HRI) becomes imperative. This paper investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers) pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from different human subjects having pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants. In the results, we show that fine-tuning vision transformers on benchmark datasets and and then using either these already fine-tuned models or ensembling ViT/BEiT models gets us the highest classification accuracies per individual when it comes to identifying four primary emotions from their speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs or BEiTs.

Subjects: Audio and Speech Processing , Human-Computer Interaction , Robotics , Sound

Publish: 2024-09-16 19:34:34 UTC

2409.10687

#1 Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers [PDF] [Copy] [Kimi] [REL]