Learning In-context $n$-grams with Transformers: Sub-$n$-grams Are Near-Stationary Points

OMwdvGDeHL@OpenReview

Total: 1

#1 Learning In-context $n$-grams with Transformers: Sub-$n$-grams Are Near-Stationary Points [PDF²] [Copy] [Kimi] [REL]

Authors: Aditya Vardhan Varre, Gizem Yüce, Nicolas Flammarion

In this article, we explore the loss landscape of next-token prediction with transformers. Specifically, we focus on learning in-context n-gram language models with cross-entropy loss using a simplified two-layer transformer. We design a series of transformers that represent $k$-grams (for $k \leq n$) for which the gradient of the population loss approaches zero in the limit of both infinite sequence length and infinite parameter norm. This construction reveals a key property of the loss landscape: \emph{$k$-grams are stationary points of the population cross-entropy loss}, offering theoretical insights for widely observed empirical phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by comprehensive numerical experiments that illustrate the dynamics of learning $n$-grams, characterized by jumps between stationary points.

Subject: ICML.2025 - Poster

OMwdvGDeHL@OpenReview

#1 Learning In-context $n$-grams with Transformers: Sub-$n$-grams Are Near-Stationary Points [PDF2] [Copy] [Kimi] [REL]

#1 Learning In-context $n$-grams with Transformers: Sub-$n$-grams Are Near-Stationary Points [PDF²] [Copy] [Kimi] [REL]