Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries

#1 Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries [PDF] [Copy] [Kimi] [REL]

Authors: Jiahao Yu, Haozheng Luo, Jerry Yao-Chieh Hu, Yan Chen, Wenbo Guo, Han Liu, Xinyu Xing

Recent advances in Large Language Models (LLMs) have led to impressive alignment—where models learn to distinguish harmful from harmless queries through supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). In this paper, we reveal a subtle yet impactful weakness in these aligned models. We find that simply appending multiple end-of-sequence (eos) tokens can cause a phenomenon we call "context segmentation", which effectively shifts both "harmful" and "benign" inputs closer to the refusal boundary in the hidden space. Building on this observation, we propose a straightforward method to BOOST jailbreak attacks by appending eos tokens. Our systematic evaluation shows that this strategy significantly increases the attack success rate across 8 representative jailbreak techniques and 16 open-source LLMs, ranging from 2B to 72B parameters. Moreover, we develop a novel probing mechanism for commercial APIs and discover that major providers—such as OpenAI, Anthropic, and Qwen—do not filter eos tokens, making them similarly vulnerable. These findings highlight a hidden yet critical blind spot in existing alignment and content filtering approaches. We call for heightened attention to eos tokens' unintended influence on model behaviors, particularly in production systems. Our work not only calls for an input-filtering based defense, but also points to new defenses that make refusal boundaries more robust and generalizable, as well as fundamental alignment techniques that can defend against context segmentation attacks.

Subject: USENIX-Sec.2025

yu-jiahao@usenixsecurity25@USENIX

#1 Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries [PDF] [Copy] [Kimi] [REL]