LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

#1 LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws [PDF] [Copy] [Kimi] [REL]

Authors: Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute.More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization.In this work, we investigate which factors most strongly influence loss-to-loss scaling.Our experiments reveal that the pretraining data determines the scaling trend.In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact.Consequently, practitioners should carefully curate pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

Subject: ICML.2025 - Poster

IVUjRWnU6c@OpenReview

#1 LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws [PDF] [Copy] [Kimi] [REL]