2025.emnlp-main.1781@ACL

Total: 1

#1 A Simple Yet Effective Method for Non-Refusing Context Relevant Fine-grained Safety Steering in LLMs [PDF] [Copy] [Kimi] [REL]

Authors: Shaona Ghosh, Amrita Bhattacharjee, Yftah Ziser, Christopher Parisien

Fine-tuning large language models (LLMs) to meet evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, but its potential for precise, customizable safety adjustments remains underexplored. We propose SafeSteer, a simple and effective method to guide LLM outputs by (i) leveraging category-specific steering vectors for fine-grained control, (ii) applying a gradient-free, unsupervised approach that enhances safety while preserving text quality and topic relevance without forcing explicit refusals, and (iii) eliminating the need for contrastive safe data. Across multiple LLMs, datasets, and risk categories, SafeSteer provides precise control, avoids blanket refusals, and directs models to generate safe, relevant content, aligning with recent findings that simple activation-steering techniques often outperform more complex alternatives.

Subject: EMNLP.2025 - Main