Total: 1
Several recently introduced deep learning optimizers inspired by second-order methods have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts to validate and replicate their successes have reported mixed results, with some finding quickly diminishing advantage over AdamW with scale. In this work, we investigate \emph{how to scale} second-order optimizers to achieve optimal performance at scale. Through theoretical and empirical analysis, we derive scaling rules for hyperparameters such as learning rate and weight decay as we scale up model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. For compute-optimal scaling, we find scaling independent weight decay as $1/\mathrm{width}$ is nearly optimal across optimizers, and that second-order optimizers have a substantially larger optimal model size compared to AdamW for a fixed compute budget. Applying these scaling rules, we show Muon achieves close to $1.4\times$ or higher speedup over AdamW in training transformer language models, while incorrect scaling can decrease the speedup from $1.4\times$ to below $1.1\times$ from $190$M to $640$M parameter models.