Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle

#1 Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle [PDF⁶²] [Copy] [Kimi²¹] [REL]

Authors: Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ortí

Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $β_{1}=β_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $β_{1}=β_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $β_{1}=β_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.

Subjects: Machine Learning , Artificial Intelligence , Machine Learning

Publish: 2026-01-29 13:56:11 UTC

2601.21739

#1 Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle [PDF62] [Copy] [Kimi21] [REL]

#1 Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle [PDF⁶²] [Copy] [Kimi²¹] [REL]