SilVar: Speech-Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization

#1 SilVar: Speech-Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization [PDF] [Copy] [Kimi] [REL]

Authors: Tan-Hanh Pham, Le Hoang Nam, Phu-Vinh Nguyen, Chris Ngo, Truong-Son Hy

Visual Language Models have demonstrated remarkable capabilities across various tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in natural human-machine interactions. Moreover, the quality of language models primarily depends on reasoning and prompting techniques, such as chain-of-thought, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, an end-to-end multimodal model that leverages speech instructions for reasoning-based visual question answering. Additionally, we investigate reasoning techniques at different levels, including conversational, simple, and complex speech instructions. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling more intuitive interactions by allowing users to provide verbal or text-based instructions. To this end, we introduce a new dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model’s ability to process and explain visual scenes from spoken input, moving beyond simple object recognition to reasoning-based interactions. To our knowledge, SilVar is the first open-source, speech-driven VLM. We believe SilVar will inspire the next generation of multimodal reasoning models, advancing toward expert artificial general intelligence.

Subject: EMNLP.2025 - Main

2025.emnlp-main.589@ACL

#1 SilVar: Speech-Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization [PDF] [Copy] [Kimi] [REL]