Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD

#1 Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD [PDF¹] [Copy] [Kimi³] [REL]

Authors: Arseniy Andreyev, Pierfrancesco Beneventano

Recent findings by Cohen et al., 2021, demonstrate that when training neural networks with full-batch gradient descent with a step size of $\eta$, the largest eigenvalue $\lambda_{\max}$ of the full-batch Hessian consistently stabilizes at $\lambda_{\max} = 2/\eta$. These results have significant implications for convergence and generalization. This, however, is not the case of mini-batch stochastic gradient descent (SGD), limiting the broader applicability of its consequences. We show that SGD trains in a different regime we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at $2/\eta$ is *Batch Sharpness*: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence $\lambda_{\max}$ -- which is generally smaller than Batch Sharpness -- is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for mathematical modeling of SGD trajectories.

Subjects: Machine Learning , Optimization and Control , Machine Learning

Publish: 2024-12-29 18:59:01 UTC

2412.20553

#1 Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD [PDF1] [Copy] [Kimi3] [REL]

#1 Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD [PDF¹] [Copy] [Kimi³] [REL]