do5vVfKEXZ@OpenReview

Total: 1

#1 Towards Global-level Mechanistic Interpretability: A Perspective of Modular Circuits of Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Yinhan He, Wendy Zheng, Yushun Dong, Yaochen Zhu, Chen Chen, Jundong Li

Mechanistic interpretability (MI) research aims to understand large language models (LLMs) by identifying computational circuits, subgraphs of model components with associated functional interpretations, that explain specific behaviors. Current MI approaches focus on discovering task-specific circuits, which has two key limitations: (1) poor generalizability across different language tasks, and (2) high costs associated with requiring human or advanced LLM interpretation of each computational node. To address these challenges, we propose developing a ``modular circuit (MC) vocabulary'' consisting of task-agnostic functional units. Each unit consists of a small computational subgraph with its interpretation. This approach enables global interpretability by allowing different language tasks to share common MCs, while reducing costs by reusing established interpretations for new tasks. We establish five criteria for characterizing the MC vocabulary and present ModCirc, a novel global-level mechanistic interpretability framework for discovering MC vocabularies in LLMs. We demonstrate ModCirc's effectiveness by showing that it can identify modular circuits that perform well on various metrics.

Subject: ICML.2025 - Poster