Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

#1 Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families [PDF] [Copy] [Kimi] [REL]

Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3-3B) finetuned identically, a difference-in-means direction achieves 99.6% separation of aligned and misaligned activations at each model's final layer. Causal steering by subtracting this direction reduces code spillover by 21-51 points, while a secure-code control confirms content specificity. Cross-architecture transfer via ridge regression maps yields large behavioral suppression (up to 46 points) but fails specificity controls as random and orthogonal directions perform comparably. We identify a two-tier specificity structure: within-model directions are causally specific and actionable; cross-model directions are causally real but non-specific. An asymmetric transfer topology emerges, with Gemma and Qwen acting as geometric donors and Llama as a receiver. These findings define the limits of linear cross-architecture correction and recommend within-model probing for auditing.

Subject: Computation and Language

Publish: 2026-06-18 13:39:59 UTC

2606.20225

#1 Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families [PDF] [Copy] [Kimi] [REL]