Always Refuse: Steering LLMs Against Jailbreaks with Contrastive Activations (Student Abstract)

#1 Always Refuse: Steering LLMs Against Jailbreaks with Contrastive Activations (Student Abstract) [PDF¹] [Copy] [Kimi] [REL]

Authors: Abhilekh Borah, Niranjan Chebrolu, Kokil Jaidka

Refusals must be resilient, not brittle.” Yet guarding refusals against adversarial phrasing and shifting user contexts remains difficult: large language models (LLMs) still yield to jailbreak prompts that evade safety filters and surface harmful content. We propose Refusal Activation Steering (RAS), a training-free, inference-time method that uses contrastive activations to shift LLM responses, biasing generation trajectories toward refusals without altering model weights. The approach is modular and domain-targetable, avoiding collateral refusals on benign queries while strengthening activation- space boundaries for unsafe content. On adversarial evaluations with an 8B instruction-tuned model, we find that steering improves refusal rate by ∼ 52% and reduces attack success rate by ∼ 40%, establishing a lightweight and interpretable safety layer for robust refusal consistency. To foster further research in this domain, we have made our implementation publicly available.

Subject: AAAI.2026 - Student Abstract and Poster Program

42191@AAAI

#1 Always Refuse: Steering LLMs Against Jailbreaks with Contrastive Activations (Student Abstract) [PDF1] [Copy] [Kimi] [REL]

#1 Always Refuse: Steering LLMs Against Jailbreaks with Contrastive Activations (Student Abstract) [PDF¹] [Copy] [Kimi] [REL]