Analogy-based Multi-Turn Jailbreak against Large Language Models

#1 Analogy-based Multi-Turn Jailbreak against Large Language Models [PDF¹] [Copy] [Kimi] [REL]

Authors: Mengjie Wu, Yihao Huang, Zhenjun Lin, Kangjie Chen, Yuyang zhang, Yuhan Huang, Run Wang, Lina Wang

Large language models (LLMs) are inherently designed to support multi-turn interactions, which opens up new possibilities for jailbreak attacks that unfold gradually and potentially bypass safety mechanisms more effectively than single-turn attacks. However, current multi-turn jailbreak methods are still in their early stages and suffer from two key limitations. First, they all inherently require inserting sensitive phrases into the context, which makes the dialogue appear suspicious and increases the likelihood of rejection, undermining the effectiveness of the attack. Second, even when harmful content is generated, the response often fails to align with the malicious prompt due to semantic drift, where the conversation slowly moves away from its intended goal. To address these challenges, we propose an analogy-based black-box multi-turn jailbreak framework that constructs fully benign contexts to improve attack success rate while ensuring semantic alignment with the malicious intent. The method first guides the model through safe tasks that mirror the response structure of the malicious prompt, enabling it to internalize the format without exposure to sensitive content. A controlled semantic shift is then introduced in the final turn, substituting benign elements with malicious ones while preserving structural coherence. Experiments on six commercial and open-source LLMs, two benchmark datasets show that our method significantly improves attack performance, achieving an average attack success rate of 93.3\% and outperforming five competitive baselines. Our code is released at https://anonymous.4open.science/r/AMA-E1C4

Subject: NeurIPS.2025 - Poster

RwCaBZ4w5P@OpenReview

#1 Analogy-based Multi-Turn Jailbreak against Large Language Models [PDF1] [Copy] [Kimi] [REL]

#1 Analogy-based Multi-Turn Jailbreak against Large Language Models [PDF¹] [Copy] [Kimi] [REL]