ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

#1 ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models [PDF] [Copy] [Kimi] [REL]

Authors: Yihao Wang, Zijian He, Jie Ren, Keze Wang

Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \textsc{ROSE} (\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, \textsc{ROSE} tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8\% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2026-06-18 09:05:48 UTC

2606.19965

#1 ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models [PDF] [Copy] [Kimi] [REL]