On the Generalization Ability of Next-Token-Prediction Pretraining

#1 On the Generalization Ability of Next-Token-Prediction Pretraining [PDF] [Copy] [Kimi] [REL]

Authors: Zhihao Li, Xue JIANG, Liyuan Liu, xuelin zhang, Hong Chen, Feng Zheng

Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model's generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings.

Subject: ICML.2025 - Poster

hLGJ1qZPdu@OpenReview

#1 On the Generalization Ability of Next-Token-Prediction Pretraining [PDF] [Copy] [Kimi] [REL]