Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation

#1 Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation [PDF] [Copy] [Kimi¹] [REL]

Authors: Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang YU, Xiangyu Zhang, XIAOJUAN QI

In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer’s outputs with the foundation model’s representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, \textbf{\ours}, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation---achieving a gFID of \textbf{1.36} on ImageNet benchmarks, while accelerating model convergence by \textbf{three times}, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at \href{https://github.com/CVMI-Lab/VFMTok}{https://github.com/CVMI-Lab/VFMTok}.

Subject: NeurIPS.2025 - Poster

PESrAH82Zh@OpenReview

#1 Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation [PDF] [Copy] [Kimi1] [REL]

#1 Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation [PDF] [Copy] [Kimi¹] [REL]