wj4lM45xQR@OpenReview

Total: 1

#1 LayerNavigator: Finding Promising Intervention Layers for Efficient Activation Steering in Large Language Models [PDF1] [Copy] [Kimi1] [REL]

Authors: Hao Sun, Huailiang Peng, Qiong Dai, Xu Bai, Yanan Cao

Activation steering is an efficient technique for aligning the behavior of large language models (LLMs) by injecting steering vectors directly into a model’s residual stream during inference. A pivotal challenge in this approach lies in choosing the right layers to intervene, as inappropriate selection can undermine behavioral alignment and even impair the model’s language fluency and other core capabilities. While single-layer steering allows straightforward evaluation on held-out data to identify the "best" layer, it offers only limited alignment improvements. Multi-layer steering promises stronger control but faces a combinatorial explosion of possible layer subsets, making exhaustive search impractical. To address these challenges, we propose LayerNavigator, which provides a principled and promising layer selection strategy. The core innovation of LayerNavigator lies in its novel, quantifiable criterion that evaluates each layer's steerability by jointly considering two key aspects: discriminability and consistency. By reusing the activations computed during steering vector generation, LayerNavigator requires no extra data and adds negligible overhead. Comprehensive experiments show that LayerNavigator achieves not only superior alignment but also greater scalability and interpretability compared to existing strategies. Our code is available at https://github.com/Bryson-Arrot/LayerNavigator

Subject: NeurIPS.2025 - Poster