Total: 1
Safety alignment for Large Language Models (LLMs) has become a critical issue due to their rapid progress. However, our understanding of effective safety mechanisms in LLMs remains limited, leading to safety alignment training that mainly focuses on improving optimization, data-level enhancement, or adding extra structures to intentionally block harmful outputs. To address this gap, we develop a neuron detection method to identify safety neurons—those consistently crucial for handling and defending against harmful queries. Our findings reveal that these safety neurons constitute less than 1% of all parameters, are language-specific and are predominantly located in self-attention layers. Moreover, safety is collectively managed by these neurons in the first several layers. Based on these observations, we introduce a S_afety N_euron Tun_ing method, named SN-Tune, that exclusively tune safety neurons without compromising models' general capabilities. SN-Tune significantly enhances the safety of instruction-tuned models, notably reducing the harmful scores of Llama3-8B-Instruction from 65.5 to 2.0, Mistral-7B-Instruct-v0.2 from 70.8 to 4.5, and Vicuna-13B-1.5 from 93.5 to 3.0. Moreover, SN-Tune can be applied to base models on establishing LLMs' safety mechanism, effectively diminishing models' harmful scores from around 100 to 5.3, 13.5, and 13.8 for LLama2-7B-Base, LLama3-8B-Base, and Mistral-7B-v0.1, respectively. In addition, we improve the LLMs' safety robustness during downstream tasks fine-tuning by separating the safety neurons from models' foundation neurons.