LYBiatN3aJ@OpenReview

Total: 1

#1 KV Shifting Attention Enhances Language Modeling [PDF] [Copy] [Kimi3] [REL]

Authors: Mingyu Xu, Bingning Wang, Weipeng Chen

Current large language models (LLMs) predominantly rely on decode-only transformer architectures, which exhibit exceptional in-context learning (ICL) capabilities. It is widely acknowledged that the cornerstone of their ICL ability lies in the induction heads mechanism, which necessitates at least two layers of attention. To more effectively harness the model's induction capabilities, we revisit the induction heads mechanism and provide theoretical proof that KV shifting attention reduces the model's dependency on the depth and width of the induction heads mechanism. Our experimental results confirm that KV shifting attention enhances the learning of induction heads and improves language modeling performance. This leads to superior performance or accelerated convergence, spanning from toy models to pre-trained models with over 10 billion parameters.

Subject: ICML.2025 - Poster