Total: 1
Recent research on speech models, which are jointly pre-trained with text, has unveiled its promising potential to enhance speech representations by encoding both speech and text within a shared space. However, these models often struggle with the interference between speech and text modalities that hardly achieves cross-modality alignment. Furthermore, the previous focus of evaluation for these models has been on neutral speech scenarios. Their effectiveness in addressing domain-shift speech, notably in the context of emotional speech, has remained largely unexplored in the existing works. In this study, a modality translation model is proposed to align speech and text modalities based on a shared space for speech-to-text translation, and aims to harness such a shared representation to address the challenge of emotional speech recognition. Experiment results show that the proposed method achieves about 3% absolute improvement in word error rate when compared with speech models.