Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

#1 Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules [PDF²] [Copy] [Kimi²] [REL]

Authors: Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire $\textit{loss dynamics}$ obey similar laws and, crucially, how the $\textit{learning rate schedule}$ (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel $\textbf{intrinsic-time}$ viewpoint, which captures the training progress more faithfully than iteration count. We then establish a $\textbf{Functional Scaling Law (FSL)}$ that captures the full loss trajectory under arbitrary LRSs, with the schedule’s influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs---constant, exponential decay, and warmup–stable–decay (WSD)---and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.

Subject: NeurIPS.2025 - Spotlight

dpllevHMbc@OpenReview

#1 Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules [PDF2] [Copy] [Kimi2] [REL]

#1 Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules [PDF²] [Copy] [Kimi²] [REL]