Identifying and Tuning Safety Neurons in Large Language Models | Cool Papers

#1 Identifying and Tuning Safety Neurons in Large Language Models [PDF³] [Copy] [Kimi¹] [REL]

Authors: Yiran Zhao, Wenxuan Zhang, Yuxi Xie, Anirudh Goyal, Kenji Kawaguchi, Michael Qizhe Shieh

Safety alignment for Large Language Models (LLMs) has become a critical issue due to their rapid progress. However, our understanding of effective safety mechanisms in LLMs remains limited, leading to safety alignment training that mainly focuses on improving optimization, data-level enhancement, or adding extra structures to intentionally block harmful outputs. To address this gap, we develop a neuron detection method to identify safety neurons—those consistently crucial for handling and defending against harmful queries. Our findings reveal that these safety neurons constitute less than $1\%$ of all parameters, are language-specific and are predominantly located in self-attention layers. Moreover, safety is collectively managed by these neurons in the first several layers. Based on these observations, we introduce a $\underline{S}$ afety $\underline{N}$ euron $\underline{Tun}$ ing method, named $\texttt{SN-Tune}$ , that exclusively tune safety neurons without compromising models' general capabilities. $\texttt{SN-Tune}$ significantly enhances the safety of instruction-tuned models, notably reducing the harmful scores of Llama3-8B-Instruction from $65.5$ to $2.0$ , Mistral-7B-Instruct-v0.2 from $70.8$ to $4.5$ , and Vicuna-13B-1.5 from $93.5$ to $3.0$ . Moreover, $\texttt{SN-Tune}$ can be applied to base models on establishing LLMs' safety mechanism, effectively diminishing models' harmful scores from around $100$ to $5.3$ , $13.5$ , and $13.8$ for LLama2-7B-Base, LLama3-8B-Base, and Mistral-7B-v0.1, respectively. In addition, we improve the LLMs' safety robustness during downstream tasks fine-tuning by separating the safety neurons from models' foundation neurons.

Subject: ICLR.2025 - Poster

yR47RmND1m@OpenReview

#1 Identifying and Tuning Safety Neurons in Large Language Models [PDF3] [Copy] [Kimi1] [REL]

#1 Identifying and Tuning Safety Neurons in Large Language Models [PDF³] [Copy] [Kimi¹] [REL]