Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

#1 Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context [PDF¹] [Copy] [Kimi] [REL]

Authors: Fitsum Reda, John Kamalu, Roger Waleffe, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

Diffusion language models offer a promising alternative to autoregressive models due to their potential for parallel and iterative generation. However, existing approaches use a single network for both context representation and iterative denoising, forcing one model to serve both roles and limiting its capacity for either role. We propose TwoTower, a block-wise autoregressive diffusion model that decouples these roles into two towers: a frozen AR context tower that causally processes clean tokens, and a trainable diffusion denoiser tower with bidirectional block attention that refines noisy blocks via cross-attention to the context. Built on Nemotron-3-Nano-30B-A3B, an open-weight 30B hybrid Mamba-Transformer MoE model, and trained on approximately 2.1T tokens, Nemotron-TwoTower retains 98.7% of the autoregressive baseline's quality while offering 2.42X higher wall-clock generation throughput. We release the code and model weights at https://huggingface.co/collections/nvidia/nemotron-twotower.

Subject: Computation and Language

Publish: 2026-06-25 00:52:44 UTC

2606.26493

#1 Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context [PDF1] [Copy] [Kimi] [REL]

#1 Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context [PDF¹] [Copy] [Kimi] [REL]