Knowledge Image Matters: Improving Knowledge-Based Visual Reasoning with Multi-Image Large Language Models

#1 Knowledge Image Matters: Improving Knowledge-Based Visual Reasoning with Multi-Image Large Language Models [PDF¹] [Copy] [Kimi²] [REL]

Authors: Guanghui Ye, Huan Zhao, Zhixue Zhao, Xupeng Zha, Yang Liu, Zhihua Jiang

We revisit knowledge-based visual reasoning (KB-VR) in light of modern advances in multimodal large language models (MLLMs), and make the following contributions: (i) We propose Visual Knowledge Card (VKC) – a novel image that incorporates not only internal visual knowledge (e.g., scene-aware information) detected from the raw image, but also external world knowledge (e.g., attribute or object knowledge) produced by a knowledge generator; (ii) We present VKC-based Multi-Image Reasoning (VKC-MIR) – a four-stage pipeline which harnesses a state-of-the-art scene perception engine to construct an initial VKC (Stage-1), a powerful LLM to generate relevant domain knowledge (Stage-2), an excellent image editing toolkit to introduce generated knowledge into the updated VKC (Stage-3), and finally, an emerging multi-image MLLM to solve the VKC-enhanced task (Stage-4). By performing experiments on three popular KB-VR benchmarks, our approach achieves new state-of-the-art results compared to previous top-performing models.

Subject: ACL.2025 - Long Papers

2025.acl-long.1063@ACL

#1 Knowledge Image Matters: Improving Knowledge-Based Visual Reasoning with Multi-Image Large Language Models [PDF1] [Copy] [Kimi2] [REL]

#1 Knowledge Image Matters: Improving Knowledge-Based Visual Reasoning with Multi-Image Large Language Models [PDF¹] [Copy] [Kimi²] [REL]