Processing math: 100%

faw23a@v195@PMLR

Total: 1

#1 Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD [PDF] [Copy] [Kimi2] [REL]

Authors: Matthew Faw, Litu Rout, Constantine Caramanis, Sanjay Shakkottai

This work considers the problem of finding a first-order stationary point of a non-convex function with potentially unbounded smoothness constant using a stochastic gradient oracle. We focus on the class of (L0,L1)-smooth functions proposed by Zhang et al. (ICLR’20). Empirical evidence suggests that these functions more closely capture practical machine learning problems as compared to the pervasive L0-smoothness. This class is rich enough to include highly non-smooth functions, such as exp(L1x) which is (0,O(L1))-smooth. Despite the richness, an emerging line of works achieves the ˜O(1T) rate of convergence when the noise of the stochastic gradients is deterministically and uniformly bounded. This noise restriction is not required in the \L0-smooth setting, and in many practical settings is either not satisfied, or results in weaker convergence rates with respect to the noise scaling of the convergence rate.We develop a technique that allows us to prove O(polylog(T)T) convergence rates for (L0,L1)-smooth functions without assuming uniform bounds on the noise support. The key innovation behind our results is a carefully constructed stopping time τ which is simultaneously “large” on average, yet also allows us to treat the adaptive step sizes before τ as (roughly) independent of the gradients. For general (L0,L1)-smooth functions, our analysis requires the mild restriction that the multiplicative noise parameter σ1<1. For a broad subclass of (L0,L1)-smooth functions, our convergence rate continues to hold when σ11. By contrast, we prove that many algorithms analyzed by prior works on (L0,L1)-smooth optimization diverge with constant probability even for smooth and strongly-convex functions when σ1>1.

Subject: COLT.2023 - Accept