Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

#1 Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering [PDF] [Copy] [Kimi] [REL]

Authors: Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, Jieping Ye

The task of Knowlegde-Based Visual Question Answering (KB-VQA) requires the model to understand visual features and retrieve external knowledge. Retrieval-Augmented Generation (RAG) have been employed to address this problem through knowledge base querying. However, existing work demonstrate two limitations: insufficient interactivity during knowledge retrieval and ineffective organization of retrieved information for Visual-Language Model (VLM). To address these challenges, we propose a three-stage visual language model with Process, Retrieve and Filter (VLM-PRF) framework. For interactive retrieval, VLM-PRF uses reinforcement learning (RL) to guide the model to strategically process information via tool-driven operations. For knowledge filtering, our method trains the VLM to transform the raw retrieved information into into task-specific knowledge. With a dual reward as supervisory signals, VLM-PRF successfully enable model to optimize retrieval strategies and answer generation capabilities simultaneously. Experiments on two datasets demonstrate the effectiveness of our framework.

Subject: NeurIPS.2025 - Poster

h0LzGQq6uO@OpenReview

#1 Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering [PDF] [Copy] [Kimi] [REL]