Total: 1
Scaling laws predict the loss of a target machine learning model by extrapolatingfrom easier-to-train models with fewer parameters or smaller training sets. Thisprovides an efficient way for practitioners and researchers alike to compare pre-training decisions involving optimizers, datasets, and model architectures. Despitethe widespread use of scaling laws to model the dynamics of language modeltraining, there has been little work on understanding how to best estimate andinterpret them. We collect (and release) a large-scale dataset containing losses anddownstream evaluations for 485 previously published pretrained models. We usethese to estimate more than 1000 scaling laws, then derive a set of best practicesfor estimating scaling laws in new model families. We find that fitting scaling lawsto intermediate checkpoints of training runs (and not just their final losses) substan-tially improves accuracy, and that—all else equal—estimates of performance aregenerally most accurate when derived from other models of similar sizes. However,because there is a significant degree of variability across model seeds, trainingmultiple small models is sometimes more useful than training a single large one.Moreover, while different model families differ in scaling behavior, they are oftensimilar enough that a target model’s behavior can be predicted from a single modelwith the same architecture, along with scaling parameter estimates derived fromother model families.