Total: 1
A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which modifies internal model activations during inference, offers a promising solution. However, prior work in dense activation spaces struggles with $\textit{superposition}$, where multiple features become entangled, limiting interpretability and precise control. In contrast, sparse representations offer an untapped opportunity for more interpretable behavior modulation. In this work, we introduce $\textit{Sparse Activation Steering}$ (SAS), a novel method for steering LLM behavior in $\textit{sparse spaces}$. By isolating behavior-specific features (i.e., latent dimensions) through a contrastive prompt-pairing approach, we define a set of features that can selectively reinforce or suppress behaviors. Experiments on Gemma 2 LLMs show that SAS vectors enable steering on par with its dense counterpart while offering interpretability advantages such as easier compositionality of features in these spaces. Furthermore, our scaling studies on sparse latents reveal a trend toward greater sparsity in SAS vectors, approaching ideal $\textit{monosemanticity}$.