Total: 1
Adaptive gradient methods have achieved remarkable success in training deep neural networks on a wide variety of tasks. However, not much is known about the mathematical and statistical properties of this family of methods. This work aims at providing a series of theoretical analyses of its statistical properties justified by experiments. In particular, we show that when the underlying gradient obeys a normal distribution, the variance of the magnitude of the <em>update</em> is an increasing and bounded function of time and does not diverge. This work suggests that the divergence of variance is not the cause of the need for warm-up of the Adam optimizer, contrary to what is believed in the current literature.