VUbwLjLkws@OpenReview

Total: 1

#1 Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf’s Law [PDF] [Copy] [Kimi] [REL]

Authors: Frederik Kunstner, Francis Bach

Recent works have highlighted the optimization difficulties encountered by gradient descent in training the first and last layer of transformer-based language models, which are overcome by optimizers such as Adam. The problem appears linked to the heavy-tailed distribution of words in text data, where the frequency of the $k$th most frequent word $\pi_k$ is proportional to $1/k$, following Zipf's law. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when the tokens follow a power-law $\pi_k \propto 1/k^\alpha$ parameterized by the exponent $\alpha$. We derive optimization scaling laws for deterministic gradient descent and sign descent as a proxy for Adam as a function of the power $\alpha \geq 0$. This setting differs from existing theoretical investigations in scaling laws which assume that the eigenvalues of the data decay as a power with power $\alpha > 1$. This assumption effectively makes the problem "finite dimensional" as most of the loss comes from a few of the largest eigencomponents. In comparison, we show that the problem is more difficult when the data have heavier tails. The case $\alpha = 1$ as found in text is ``worst-case'' for gradient descent, in that the number of iterations required to reach a small relative error scales almost linearly with dimension. While the performance of sign descent also depends on the dimension, for Zipf-distributed data the number of iterations scales only with the square-root of the dimension, leading to a large improvement over gradient descent for large vocabularies.

Subject: NeurIPS.2025 - Poster