Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

#1 Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models [PDF¹] [Copy] [Kimi⁸] [REL]

Authors: Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

Subjects: Cryptography and Security , Artificial Intelligence , Computer Vision and Pattern Recognition

Publish: 2025-08-08 16:13:28 UTC

2508.09201

#1 Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models [PDF1] [Copy] [Kimi8] [REL]

#1 Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models [PDF¹] [Copy] [Kimi⁸] [REL]