GLEAM: Enhanced Transferable Adversarial Attacks for Vision-Language Pre-training Models via Global-Local Transformations

#1 GLEAM: Enhanced Transferable Adversarial Attacks for Vision-Language Pre-training Models via Global-Local Transformations [PDF] [Copy] [Kimi] [REL]

Authors: Yunqi Liu, Xue Ouyang, Xiaohui Cui

Vision-language pre-training (VLP) models leverage large-scale cross-modal pre-training to align vision and text modalities, achieving impressive performance on tasks like image-text retrieval and visual grounding. However, these models are highly vulnerable to adversarial attacks, raising critical concerns about their robustness and reliability in safety-critical applications. Existing black-box attack methods are limited by insufficient data augmentation mechanisms or the disruption of global semantic structures, leading to poor adversarial transferability. To address these challenges, we propose the Global-Local Enhanced Adversarial Multimodal attack (GLEAM), a unified framework for generating transferable adversarial examples in vision-language tasks. GLEAM introduces a local feature enhancement module that achieves diverse local deformations while maintaining global semantic and geometric integrity. It also incorporates a global distribution expansion module, which expands feature space coverage through dynamic transformations. Additionally, a cross-modal feature alignment module leverages intermediate adversarial states to guide text perturbations. This enhances cross-modal consistency and adversarial text transferability. Extensive experiments on Flickr30K and MSCOCO datasets show that GLEAM outperforms state-of-the-art methods, with over 10%-30% higher attack success rates in image-text retrieval tasks and over 30% improved transferability on large models like Claude 3.5 Sonnet and GPT-4o. GLEAM provides a robust tool for exposing vulnerabilities in VLP models and offers valuable insights into designing more secure and reliable vision-language systems. Our code is available at https://github.com/LuckAlex/GLEAM.

Subject: ICCV.2025 - Poster

Liu_GLEAM_Enhanced_Transferable_Adversarial_Attacks_for_Vision-Language_Pre-training_Models_via@ICCV2025@CVF

#1 GLEAM: Enhanced Transferable Adversarial Attacks for Vision-Language Pre-training Models via Global-Local Transformations [PDF] [Copy] [Kimi] [REL]