SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering

#1 SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering [PDF] [Copy] [Kimi¹] [REL]

Multimodal models integrating speech and vision hold significant potential for advancing human-computer interaction, particularly in Speech-Based Visual Question Answering (SBVQA) where spoken questions about images require direct audio-visual understanding. Existing approaches predominantly focus on text-visual integration, leaving speech-visual modality gaps underexplored due to their inherent heterogeneity. To this end, we introduce SViQA, a unified speech-vision model that directly processes spoken questions without text transcription. Building upon the LLaVA architecture, our framework bridges auditory and visual modalities through two key innovations: (1) end-to-end speech feature extraction eliminating intermediate text conversion, and (2) cross-modal alignment optimization enabling effective fusion of speech signals with visual content. Extensive experimental results on the SBVQA benchmark demonstrate the proposed SViQA's state-of-the-art performance, achieving 75.62% accuracy, and competitive multimodal generalization. Leveraging speech-text mixed input boosts performance to 78.85%, a 3.23% improvement over pure speech input, highlighting SViQA's enhanced robustness and effective cross-modal attention alignment.

Subjects: Computer Vision and Pattern Recognition , Machine Learning

Publish: 2025-04-01 07:15:32 UTC

2504.01049

#1 SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering [PDF] [Copy] [Kimi1] [REL]

#1 SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering [PDF] [Copy] [Kimi¹] [REL]