On the Limitations of Vision-Language Models in Understanding Image Transforms

#1 On the Limitations of Vision-Language Models in Understanding Image Transforms [PDF²] [Copy] [Kimi³] [REL]

Authors: Ahmad Mustafa Anis, Hasnain Ali, Saquib Sarfraz

Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence , Computation and Language

Publish: 2025-03-12 20:58:16 UTC

2503.09837

#1 On the Limitations of Vision-Language Models in Understanding Image Transforms [PDF2] [Copy] [Kimi3] [REL]

#1 On the Limitations of Vision-Language Models in Understanding Image Transforms [PDF²] [Copy] [Kimi³] [REL]