Gradient Multi-Normalization for Efficient LLM Training

#1 Gradient Multi-Normalization for Efficient LLM Training [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Meyer Scetbon, Chao Ma, Wenbo Gong, Edward Meeds

Training large language models (LLMs) commonly relies on adaptive optimizers such as Adam (Kingma & Ba 2015), which accelerate convergence through moment estimates but incur substantial memory overhead. Recent stateless approaches such as SWAN (Ma et al., 2024) have shown that appropriate preprocessing of instantaneous gradient matrices can match the performance of adaptive methods without storing optimizer states. Building on this insight, we introduce \emph{gradient multi-normalization}, a principled framework for designing stateless optimizers that normalize gradients with respect to multiple norms simultaneously. Whereas standard first-order methods can be viewed as gradient normalization under a single norm (Bernstein & Newhouse, 2024), our formulation generalizes this perspective to a multi-norm setting. We derive an efficient alternating scheme that enforces these normalization constraints and show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem. This unifies and extends prior stateless optimizers, showing that SWAN arises as a specific instance with particular norm choices. Leveraging this principle, we develop SinkGD, a lightweight matrix optimizer that retains the memory footprint of SGD while substantially reducing computation relative to whitening-based methods. On the memory-efficient LLaMA training benchmark (Zhao et al., 2024), SinkGD achieves state-of-the-art performance, reaching the same evaluation perplexity as Adam using only 40\% of the training tokens.

Subject: NeurIPS.2025 - Poster

oanhUGY6un@OpenReview

#1 Gradient Multi-Normalization for Efficient LLM Training [PDF1] [Copy] [Kimi1] [REL]

#1 Gradient Multi-Normalization for Efficient LLM Training [PDF¹] [Copy] [Kimi¹] [REL]