Why Gradients Rapidly Increase Near the End of Training

2506.02285

Total: 1

#1 Why Gradients Rapidly Increase Near the End of Training [PDF²⁷] [Copy] [Kimi²⁵] [REL]

During long-duration Large Language Model (LLM) training runs the gradient norm increases rapidly near the end of training. In this short note, we show that this increase is due to an unintended interaction between weight decay, normalization layers, and the learning rate schedule. We propose a simple correction that fixes this behavior while also resulting in lower loss values throughout training.

Subjects: Machine Learning , Artificial Intelligence

Publish: 2025-06-02 21:51:04 UTC

2506.02285

#1 Why Gradients Rapidly Increase Near the End of Training [PDF27] [Copy] [Kimi25] [REL]

#1 Why Gradients Rapidly Increase Near the End of Training [PDF²⁷] [Copy] [Kimi²⁵] [REL]