VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

#1 VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs [PDF⁴] [Copy] [Kimi²] [REL]

Authors: Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng, Limin Wang, Yu Qiao, Yi Wang

While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-11-25 12:58:32 UTC

2511.20272

#1 VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs [PDF4] [Copy] [Kimi2] [REL]

#1 VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs [PDF⁴] [Copy] [Kimi²] [REL]