The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

#1 The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training [PDF⁴] [Copy] [Kimi⁶] [REL]

Authors: Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Weinan E, Lei Wu

Transformers have become the cornerstone of modern AI. Unlike traditional architectures, transformers exhibit a distinctive characteristic: diverse types of building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feed-forward networks, work collaboratively. Understanding the disparities and interactions among these blocks is therefore important.In this paper, we uncover a clear **sharpness disparity** across these blocks, which intriguingly emerges early in training and persists throughout the training process.Building on this insight, we propose a novel **Blockwise Learning Rate (LR)** strategy to accelerate large language model (LLM) pre-training. Specifically, by integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. This improvement is demonstrated across GPT-2 and LLaMA models, with model sizes ranging from 0.12B to 1.1B and datasets including OpenWebText and MiniPile.Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory savings. These results underscore the potential of leveraging the sharpness disparity principle to improve LLM training.

Subject: ICML.2025 - Poster

DZ6iFdVDrx@OpenReview

#1 The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training [PDF4] [Copy] [Kimi6] [REL]

#1 The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training [PDF⁴] [Copy] [Kimi⁶] [REL]