MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs

#1 MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique

Large language models (LLMs) face persistent vulnerability to jailbreak attacks despite their increasing capabilities. While developers deploy alignment finetuning and safety guardrails, researchers consistently devise novel attacks that circumvent these defenses. This dynamic mirrors a strategic game of continual evolution. However, two challenges hinder jailbreak development: the high cost of querying top-tier LLMs and the short lifespan of effective attacks due to frequent safety updates. These factors limit cost-efficiency and impact. To address this, we propose MetaCipher, a low-cost, multi-agent jailbreak framework that generalizes across LLMs with varying safety measures. Using reinforcement learning, MetaCipher is modular and adaptive, supporting extensibility to future strategies. Within as few as 10 queries, MetaCipher achieves state-of-the-art attack success rates on recent malicious prompt benchmarks, outperforming prior jailbreak methods. We conduct a large-scale empirical evaluation across diverse victim models, demonstrating its robustness and adaptability.

Subject: AAAI.2026 - Special Track on AI Alignment

41058@AAAI

#1 MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs [PDF1] [Copy] [Kimi1] [REL]

#1 MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs [PDF¹] [Copy] [Kimi¹] [REL]