Total: 1
In large-scale applications, such as machine learning, it is desirable to design non-convex optimization algorithms with a high degree of parallelization. In this work, we study the adaptive complexity of finding a stationary point, which is the minimal number of sequential rounds required to achieve stationarity given polynomially many queries executed in parallel at each round. For the high-dimensional case, i.e., d=˜Ω(ε−(2+2p)/p), we show that for any (potentially randomized) algorithm, there exists a function with Lipschitz p-th order derivatives such that the algorithm requires at least ε−(p+1)/p iterations to find an ε-stationary point. Our lower bounds are tight and show that even with poly(d) queries per iteration, no algorithm has better convergence rate than those achievable with one-query-per-round algorithms. In other words, gradient descent, the cubic-regularized Newton's method, and the p-th order adaptive regularization method are adaptively optimal. Our proof relies upon novel analysis with the characterization of the output for the hardness potentials based on a chain-like structure with random partition. For the constant-dimensional case, i.e., d=Θ(1), we propose an algorithm that bridges grid search and gradient flow trapping, finding an approximate stationary point in constant iterations. Its asymptotic tightness is verified by a new lower bound on the required queries per iteration. We show there exists a smooth function such that any algorithm running with Θ(log(1/ε)) rounds requires at least ˜Ω((1/ε)(d−1)/2) queries per round. This lower bound is tight up to a logarithmic factor, and implies that the gradient flow trapping is adaptively optimal.