Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning

TiYOHdK35L@OpenReview

Total: 1

#1 Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning [PDF] [Copy] [Kimi¹] [REL]

Authors: Mahavir Dabas, Si Chen, Charles Fleming, Ming Jin, Ruoxi Jia

Safety alignment is crucial for Large Language Models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. To this end, we introduce **ACTOR** (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and-data efficient training framework that mini- mizes over-refusals by utilizing internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model’s ability to handle harmful queries and preserving overall utility.

Subject: ICML.2025 - Poster

TiYOHdK35L@OpenReview

#1 Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning [PDF] [Copy] [Kimi1] [REL]

#1 Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning [PDF] [Copy] [Kimi¹] [REL]