Toward Efficient Sparse Autoencoder-Guided Steering for Improved In-Context Learning in Large Language Models

#1 Toward Efficient Sparse Autoencoder-Guided Steering for Improved In-Context Learning in Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Ikhyun Cho, Julia Hockenmaier

Sparse autoencoders (SAEs) have emerged as a powerful analytical tool in mechanistic interpretability for large language models (LLMs), with growing success in applications beyond interpretability. Building on this momentum, we present a novel approach that leverages SAEs to enhance the general in-context learning (ICL) performance of LLMs.Specifically, we introduce Feature Detection through Prompt Variation (FDPV), which leverages the SAE’s remarkable ability to capture subtle differences between prompts, enabling efficient feature selection for downstream steering. In addition, we propose a novel steering method tailored to ICL—Selective In-Context Steering (SISTER)—grounded in recent insights from ICL research that LLMs utilize label words as key anchors. Our method yields a 3.5% average performance improvement across diverse text classification tasks and exhibits greater robustness to hyperparameter variations compared to standard steering approaches. Our code is available at https://github.com/ihcho2/SAE-ICL.

Subject: EMNLP.2025 - Main

2025.emnlp-main.1474@ACL

#1 Toward Efficient Sparse Autoencoder-Guided Steering for Improved In-Context Learning in Large Language Models [PDF] [Copy] [Kimi] [REL]