Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

#1 Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models [PDF⁵] [Copy] [Kimi¹] [REL]

Authors: Zahra Babaiee, Peyman M. Kiasari, Daniela Rus, Radu Grosu

Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems' capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{https://vga.csail.mit.edu/}{vga.csail.mit.edu}

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2025-06-06 17:06:25 UTC

2506.06242

#1 Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models [PDF5] [Copy] [Kimi1] [REL]

#1 Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models [PDF⁵] [Copy] [Kimi¹] [REL]