LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds

#1 LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds [PDF⁷] [Copy] [Kimi⁹] [REL]

Authors: James Beetham, Souradip Chakraborty, Mengdi Wang, Furong Huang, Amrit Singh Bedi, Mubarak Shah

Jailbreak attacks expose vulnerabilities in safety-aligned LLMs by eliciting harmful outputs through carefully crafted prompts. Existing methods rely on discrete optimization or trained adversarial generators, but are slow, compute-intensive, and often impractical. We argue that these inefficiencies stem from a mischaracterization of the problem. Instead, we frame jailbreaks as inference-time misalignment and introduce LIAR (Leveraging Inference-time misAlignment to jailbReak), a fast, black-box, best-of-$N$ sampling attack requiring no training. LIAR matches state-of-the-art success rates while reducing perplexity by $10\times$ and Time-to-Attack from hours to seconds. We also introduce a theoretical "safety net against jailbreaks" metric to quantify safety alignment strength and derive suboptimality bounds. Our work offers a simple yet effective tool for evaluating LLM robustness and advancing alignment research.

Subject: Computation and Language

Publish: 2024-12-06 18:02:59 UTC

2412.05232

#1 LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds [PDF7] [Copy] [Kimi9] [REL]

#1 LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds [PDF⁷] [Copy] [Kimi⁹] [REL]