Multimodal Fusion for Vocal Biomarkers Using Vector Cross-Attention

#1 Multimodal Fusion for Vocal Biomarkers Using Vector Cross-Attention [PDF²] [Copy] [Kimi¹] [REL]

Authors: Vladimir Despotovic, Abir Elbéji, Petr V. Nazarov, Guy Fagherazzi

Vocal biomarkers are measurable characteristics of person's voice that provide valuable insights into various aspects of their physiological and psychological state, or health status. The use of standardized voice tasks, such as reading, counting, or sustained vowel phonation are common in vocal biomarker research, but semi-spontaneous tasks where the person is instructed to talk about a particular topic, or spontaneous speech are also increasingly used. However, limited efforts were made to combine multiple voice modalities. In this paper, we propose a simple, yet efficient approach of fusing multiple standardized voice tasks based on vector cross-attention, showing improved predictive capacity for derived vocal biomarkers in comparison to single modalities. The multimodal approach is tested on the assessment of respiratory quality of life from reading and sustained vowel phonation recordings, outperforming single modalities up to 4.2% in terms of accuracy (relative increase of 7%).

Subject: INTERSPEECH.2024 - Special Session

despotovic24@interspeech_2024@ISCA

#1 Multimodal Fusion for Vocal Biomarkers Using Vector Cross-Attention [PDF2] [Copy] [Kimi1] [REL]

#1 Multimodal Fusion for Vocal Biomarkers Using Vector Cross-Attention [PDF²] [Copy] [Kimi¹] [REL]