The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

#1 The Recurrent Transformer: Greater Effective Depth and Efficient Decoding [PDF³] [Copy] [Kimi³] [REL]

Authors: Costin-Andrei Oncescu, Depen Morwani, Samy Jelassi, Alexandru Meterez, Mujin Kwun, Sham Kakade

Transformers process tokens in parallel but are temporally shallow: at position $t$, each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer unbounded temporal depth but suffer from optimization instability and historically underutilize modern accelerators. We introduce the Recurrent Transformer, a simple architectural change where each layer attends to key-value pairs computed off its own activations, yielding layerwise recurrent memory while preserving standard autoregressive decoding cost. We show that the architecture can emulate both (i) a conventional Transformer and (ii) token-to-token recurrent updates under mild assumptions, while avoiding optimization instability. Naively, prefill/training appears bandwidth-bound with effective arithmetic intensity near $1$ because keys and values are revealed sequentially; we give an exact tiling-based algorithm that preserves the mathematical computation while reducing HBM traffic from $Θ(N^2)$ to $Θ(N\log N)$, increasing effective arithmetic intensity to $Θ(N/\log N)$ for sequence length $N$. On 150M and 300M parameter C4 pretraining, Recurrent Transformers improve cross-entropy over a parameter-matched Transformer baseline and achieve the improvement with fewer layers (fixed parameters), suggesting that recurrence can trade depth for width, thus reducing KV cache memory footprint and inference latency.

Subject: Machine Learning

Publish: 2026-04-23 02:12:58 UTC

2604.21215

#1 The Recurrent Transformer: Greater Effective Depth and Efficient Decoding [PDF3] [Copy] [Kimi3] [REL]

#1 The Recurrent Transformer: Greater Effective Depth and Efficient Decoding [PDF³] [Copy] [Kimi³] [REL]