sRKtbGsebH@OpenReview

Total: 1

#1 Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers [PDF1] [Copy] [Kimi] [REL]

Authors: Thiziri Nait Saada, Alireza Naderi, Jared Tanner

Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. Alternatives to softmax-based attention are being explored due to its tendency to hinder effective information flow. Even *at initialisation*, it remains poorly understood why the propagation of signals and gradients through these random networks can be pathological, resulting in issues known as (i) vanishing/exploding gradients and (ii) rank collapse *in depth*, i.e. when all tokens converge to a single representation along layers. While rank collapse in depth naturally arises from repeated matrix multiplications---a common pattern across various architectures---we identify an additional and previously unknown challenge unique to softmax attention layers: (iii) rank collapse *in width*, which occurs as the context length increases. Using Random Matrix Theory, we conduct a rigorous analysis that uncovers a spectral gap between the two largest singular values of the attention matrix as the cause of (iii), which in turn exacerbates (i) and (ii).Building on this insight, we propose a novel yet simple practical solution to mitigate rank collapse in width by removing the outlier eigenvalue(s). Our theoretical framework offers a fresh perspective on recent practical studies, such as (Ye et al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpreted as implicit efforts to address the spectral gap issue. This work provides valuable theoretical support for ongoing large-scale empirical research, bringing theory and practice one step closer in the understanding of transformers.

Subject: ICML.2025 - Poster