Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers

#1 Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers [PDF²] [Copy] [Kimi¹] [REL]

Authors: Andrew Nam, Henry Conklin, Yukang Yang, Thomas Griffiths, Jonathan Cohen, Sarah-Jane Leslie

We present causal head gating (CHG), a scalable method for interpreting the functional roles of attention heads in transformer models. CHG learns soft gates over heads and assigns them a causal taxonomy - facilitating, interfering, or irrelevant - based on their impact on task performance. Unlike prior approaches in mechanistic interpretability, which are hypothesis-driven and require prompt templates or target labels, CHG applies directly to any dataset using standard next-token prediction. We evaluate CHG across multiple large language models (LLMs) in the Llama 3 model family and diverse tasks, including syntax, commonsense, and mathematical reasoning, and show that CHG scores yield causal - not merely correlational - insight, validated via ablation and causal mediation analyses. We also introduce contrastive CHG, a variant that isolates sub-circuits for specific task components. Our findings reveal that LLMs contain multiple sparse, sufficient sub-circuits, that individual head roles depend on interactions with others (low modularity), and that instruction following and in-context learning rely on separable mechanisms.

Subject: Artificial Intelligence

Publish: 2025-05-19 21:24:13 UTC

2505.13737

#1 Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers [PDF2] [Copy] [Kimi1] [REL]

#1 Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers [PDF²] [Copy] [Kimi¹] [REL]