Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning

#1 Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning [PDF⁴] [Copy] [Kimi¹] [REL]

Authors: Federico Tavella, Amber Drinkwater, Angelo Cangelosi

Vision-Language Models (VLMs) have emerged as powerful tools for generating textual descriptions from visual data. While these models excel on web-scale datasets, their robustness to the domain shifts inherent in many real-world applications remains under-explored. This paper presents a systematic evaluation of VLM performance on a single-view object captioning task when faced with a controlled, physical domain shift. We compare captioning accuracy across two distinct object sets: a collection of multi-material, real-world tools and a set of single-material, 3D-printed items. The 3D-printed set introduces a significant domain shift in texture and material properties, challenging the models' generalization capabilities. Our quantitative results demonstrate that all tested VLMs show a marked performance degradation when describing the 3D-printed objects compared to the real-world tools. This underscores a critical limitation in the ability of current models to generalize beyond surface-level features and highlights the need for more robust architectures for real-world signal processing applications.

Subjects: Robotics , Artificial Intelligence , Computation and Language , Computer Vision and Pattern Recognition , Machine Learning

Publish: 2025-06-24 12:45:09 UTC

2506.19579

#1 Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning [PDF4] [Copy] [Kimi1] [REL]

#1 Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning [PDF⁴] [Copy] [Kimi¹] [REL]