SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

#1 SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition [PDF²] [Copy] [Kimi²] [REL]

Authors: Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem

Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. However, the generalization of these models to diverse languages and emotional expressions remains a challenge. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. Our benchmark includes a diverse set of multilingual datasets, focusing on less commonly used corpora to assess generalization to new data. We employ logit adjustment to account for varying class distributions and establish a single dataset cluster for systematic evaluation. Surprisingly, we find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER. Our results highlight the need for more robust and generalizable SER models, and our benchmark serves as a valuable resource to drive future research in this direction.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2024-08-14 23:33:10 UTC

2408.07851

#1 SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition [PDF2] [Copy] [Kimi2] [REL]

#1 SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition [PDF²] [Copy] [Kimi²] [REL]